Finding bugs ============ Goal: how to build secure systems if we have buggy software? Today's approach: find bugs (and fix them)! How do we find bugs? Two basic steps. Define what constitutes a bug (often called an "assertion" or "specification"). Crash. Out-of-bounds memory access. Overwriting the return address. Jumping to non-executable code. Executing arbitrary code. Returning unsanitized / unchecked data. .. potentially other application-specific notions .. Tension between making the definition broad but harder to check, vs narrow and more low-level and easier to check/find. Many bug-finding tools tend towards easily decidable "local" assertions. Roughly, an assert() statement that ideally should never run. Assertions typically in the form "if (something) assert();" Consider how the code could run, and check if it satisfies specs / assertions. Many things go into determining "how code runs". Inputs. Non-determinism. Scheduling. Different ways to reason about how code might run. Run the code with specific inputs. Run the code in some alternate execution model (maybe even without concrete inputs). Logically reason about how code would run. Example: service that increments counters. Expected input format: +-------+-------+-----+-----+-----+-----+ | hdr | len | ctr | ctr | ... | ctr | +-------+-------+-----+-----+-----+-----+ 1 byte 1 byte len bytes of ctrs Example code (buggy.c): char input[64]; int hdr = input[0]; if (hdr != 5) return; int cnt = input[1]; if (cnt > 64) return; int counters[32]; char *nextctr = &input[2]; for (i = 0; i < cnt; i++) { counters[*nextctr]++; nextctr++; } Testing: hand-constructed cases. Important: early in development, document expected behavior, document expected corner cases, regression testing. Big benefit: can capture sophisticated assertions. Incomplete: developer might forget about some important corner cases. Impractical to write tests for every possible case. What tests could we write for our example snippet? Input bytes and how that affects counters. Input with incorrect first by should not affect counters. Input that asks to increment more than 32 counters should do nothing. Input that increments a particular counter should bump that counter value. What's a bug we might want to find? cnt is 64: still too large, since we access input[2+cnt-1] via nextctr. Not checking the counters we are incrementing: e.g., input[2] = 100 Fuzzing: random cases. Typically focused on relatively simpler assertions like "no crashes". Much of the benefit of fuzzing comes in achieving higher coverage. Starting point: just feed random inputs. Can be effective at triggering corner cases developers didn't consider. Often hard to construct inputs that trigger bugs "deeper" in the code. Need a lot of structure to get past initial parsing, checking, etc. Big idea in fuzzing: evolve inputs to improve coverage. Keep track of a set of inputs to the program (e.g., PDF files). Mutate random changes in an input (e.g., change or insert a few bytes in a file). Some metric that indicates coverage (e.g., branches executed in the code). Fuzzer keeps applying mutations and tracks whether any of them improve coverage. If coverage goes up, keep the mutation. Effective at discovering bugs in input handling / parsing code. Widely used in practice. E.g., Google Chrome (https://www.chromium.org/Home/chromium-security/bugs) Many fuzzing tools. [[ Ref: https://lcamtuf.coredump.cx/afl/ ]] [[ Ref: https://github.com/dvyukov/go-fuzz ]] [[ Ref: https://rust-fuzz.github.io/book/cargo-fuzz.html ]] Demo: fuzzing buggy.c. afl-gcc buggy.c -o buggy mkdir testcase findings echo aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa > testcase/init afl-fuzz -i testcase -o findings ./buggy Uncomment the assert to make the program crash. How might the fuzzer find the bugs in our example code snippet? Would need to randomly generate inputs until it hits the first "!= 5" check. Would then generate random inputs until they pass the "> 64" check. Then coverage seems about the same, just keep trying random inputs. Different ways to build a fuzzer. Generating inputs. Byte-level fuzzing: just byte sequences. Good for parsers, network protocols, etc. Not as effective at finding interesting bugs in higher levels. Structured inputs: e.g., syntactically-valid files, XML docs, messages, etc. Better at quickly exploring interesting bugs in higher-level code. E.g., for JSON inputs, better to generate JSON than guess byte-by-byte. Coverage metrics, or more generally some gradient for fuzzer to climb. Typical code coverage: good for exploring branches. Might augment this with important values (e.g., "is_valid", "logged_in", etc). Coverage looks the same, but it's important that we got specific value! Might want to look for particularly problematic arguments. E.g., allocating lots of memory: bigger malloc() calls are more interesting. Works well when climbing the optimization metric is sufficiently "smooth". Unlikely to discover a random input that matches some SHA-256 hash. Symbolic execution. More sophisticated way to mutate inputs for fuzzing. Program "runs" in a symbolic world where data can be a symbolic expression. Typically, program input is an arbitrary symbolic expression (say, "x"). Memory values can be either "concrete" (real bytes) or "symbolic" (expression). Operations produce appropriate expressions. E.g., adding 5 and 6 produces 11. E.g., adding 2 and x produces (x+2). E.g., comparing 11 and (x+2) produces (11 =? x+2). Of course, this execution is done by some interpreter, not on real hardware. Core challenge in symbolic execution: what to do with control flow? if (condition) { foo(); } else { bar(); } What happens if condition is a symbolic expression? This is actually the core challenge of testing and fuzzing: exploring paths. Symbolic execution idea: use a SAT solver to decide if either path is possible. SAT solver black box API: Input is a symbolic formula that returns a bool (true or false). Output is a set of values for symbolic variables that makes the formula true, or "unsat" if impossible to make the formula true, or "unknown". "Unknown" is practially always a possibility: SAT is not efficiently solvable. NP-complete. Turns out in practice SAT solvers are really good at solving many formulas. Symbolic execution engine: give control flow constraint to SAT solver. If satisfiable, keep executing down that path. Need to remember what constraint took us down that path, though. E.g.: int a = symbolic_input(); if (a % 2 == 0) { // even if (a == 5) { assert(); } } Check both sides of the "if" statement. Demo: python, import z3 a = z3.BitVec('a', 32) z3.solve(a % 2 == 0) z3.solve(a % 2 == 0, a == 5) If we get to an assert(), ask SAT solver for satisfying assignment. Constraint: current path condition (all branches that got us to assert). Satisfying assignment should tell us what input triggers the bug. How would symbolic execution find the bugs in our example snippet? (i_0 != 5) i0 = z3.BitVec('i0', 8) z3.solve(i0 != 5) z3.solve(i0 == 5) (i_1 > 64) i1 = z3.BitVec('i1', 8) z3.solve(i0 == 5, i1 > 64) z3.solve(i0 == 5, i1 <= 64) (0 < i_1) Memory accesses: *nextptr = input[2] -> 0 <= 2 < 64 z3.solve(i0 == 5, i1 <= 64, 0 < i1, z3.And(0 <= 2, 2 < 64)) z3.solve(i0 == 5, i1 <= 64, 0 < i1, z3.Not(z3.And(0 <= 2, 2 < 64))) counters[i_2] -> 0 <= i_2 < 32 i2 = z3.BitVec('i2', 8) z3.solve(i0 == 5, i1 <= 64, 0 < i1, z3.And(0 <= i2, i2 < 32)) z3.solve(i0 == 5, i1 <= 64, 0 < i1, z3.Not(z3.And(0 <= i2, i2 < 32))) (1 < i_1) *nextptr = input[3] -> 0 <= 3 < 64 counters[i_3] -> 0 <= i_3 < 32 .. Quickly discover violation of 0 <= i_2 < 32 Takes more loop iterations to discover input out-of-bounds access Symbolic execution explores all possible paths. In practice too many paths to explore. Apply similar kinds of heuristics as fuzzing. Which paths seem to be exploring interesting executions? Typically coverage-based. But symbolic execution has a more systematic plan for how to evolve input. Instead of random, use SAT solver to find an input that triggers branches. Many other practical challenges, e.g., symbolic indexing into an array. arr[b] where b is a symbolic expression. Could branch out into many possible control flows! Naively translate this into something like.. if (b == 0) then arr[0] else if (b == 1) then arr[1] else ... Could be compact if our symbolic language can refer to array lookup/update. arr[b] is just Lookup(arr, b). arr[b] = c means arr is now Update(arr, b, c) Symbolic execution in practice. Researchers show it can find many bugs in all kinds of systems. Reasonably nice tools that can be used by non-experts. [[ Ref: https://klee.github.io/ ]] [[ Ref: https://s2e.systems/ ]] In principle, symbolic execution can execute all possible inputs. But limited because we need to avoid long loops, recursion, etc. [[ Ref: https://www.usenix.org/system/files/conference/nsdi14/nsdi14-paper-dobrescu.pdf ]] Some use in practice but requires more elaborate setup than fuzzing. Demo: using klee with buggy.c. make buggy.bc klee buggy.bc Quickly finds out-of-bounds counter value. Takes longer to find read past the end at 64. Completes: checked every possible path! Can we analyze all possible cases of a program's execution? Symbolic execution works for simple examples. Looping and recursion requires more sophistication. Generally called formal verification. Broad area, many point solutions in this overall space. Main idea: reason about code execution instead of "running" the code. Already we saw that "running" the code can be flexible, as in symex. But we want ways to reason about arbitrary inputs without running an infinite number of executions. Able to prove correctness (and security) for all possible inputs. Big idea: specify loop invariants. Symbolic system state at the beginning of each loop iteration. Usually depends on the loop variable. Three things that we need to check: 1. Symbolic state when we reach the start of the loop matches the loop invariant with initial variable (e.g., i=0). 2. Starting with loop invariant state for some variable i (0<=i Model of code --> Verification system --> OK? ^ | Proof/hints (maybe) How to prove code? Hoare logic. { PRECONDITION } code { POSTCONDITION } Meaning: If PRECONDITION holds, then running code will make POSTCONDITION hold. PRE and POST are predicates about the state of the system. E.g., suppose we have some code snippet like if (x < 32) { counters[x] = 1; } Hoare logic rule for array store, actually two steps: Reading the x variable: { state.vars["x"] = x } x { RET x; state' = state } Storing into an array: { 0 <= x < state.vars["counters"].size } counters[x] = y { RET (); state'.vars["counters"].elem[x] = y } Hoare logic rule for our example "if" statement: { P /\ pred } code1; { Q } -> { P /\ ~pred } code2; { Q } -> { P } if (pred) code1; else code2; { Q } Enables verifying how our snippet would execute. Assuming unsigned x and 32-size counters array. Can tie into some larger spec for overall procedure. How much should we believe a verified program? Wrong spec? Maybe a problem, though might not matter a huge amount. Wrong model of code? Could be, e.g., failing to model some C weirdness. What matters is whether model of code matches how code runs (compiler, hw, ..) Bugs in verification system? Maybe, depends on verification style. Some verifiers are "privilege-separated" in the sense of having a proof checker. Other verifiers are large trusted code bases that trust an entire SAT solver. Some initial success for verification. Verified crypto code in Chrome, Firefox, Wireguard, ... Verified BPF engine (JIT) in Linux kernel on RISC-V. Still a fairly heavyweight effort requiring experts, but active research area.