💾 Archived View for aphrack.org › issues › phrack70 › 4.gmi captured on 2021-12-03 at 14:04:38. Gemini links have been rewritten to link to archived content
View Raw
More Information
-=-=-=-=-=-=-
==Phrack Inc.==
Volume 0x10, Issue 0x46, Phile #0x04 of 0x0f
|=-----------------------------------------------------------------------=|
|=---------------------=[ Cyber Grand Shellphish ]=----------------------=|
|=-----------------------------------------------------------------------=|
|=------------------------=[ Team Shellphish ]=--------------------------=|
|=----------------------=[ team@shellphish.net ]=------------------------=|
|=----------------=[ http://shellphish.net/cgc#team ]=-------------------=|
|=-----------------------------------------------------------------------=|
#####
# # # # ##### ###### #####
# # # # # # # #
# # ##### ##### # #
# # # # # #####
# # # # # # # #
##### # ##### ###### # #
#####
# # ##### ## # # #####
# # # # # ## # # #
# #### # # # # # # # # #
# # ##### ###### # # # # #
# # # # # # # ## # #
##### # # # # # # #####
MM
. MM
= M
M MM
MM M
MM MM
,M M
MMM MM+ MM MM M
MM MM MM MM MM MMMMD MMM
MM+ MM MM MM MM +M MM :M MM MMM:
MMMM MM MM MM MM MM MMD M. MM MMM M MMM
MMMMM =M~ MM MM MM MM MM M~MM MM ~MM MM MM
,MMM OM MM MM MM MM MM M.MM MM MMM MMMMMM+NM MM MM
MMMMMMMMMM +MMMMMM MMMN MM MM MMMMMM MM MM MM MM MMO MM
MMMMMMMMMMMMM MMMMMM MMMN MM MM MMMMMN MMMMMM MM MMMMMMMMM MMMMMMM
MMMM: MM MM MM MM NM: MM ?M MM MM MMM MMM MM MM
,M MMM MM MM MM MM MM? MM ~MM ~M MM :MMMMMMMMM MM MMM
M M~MMMMMM MM~ MM MM MM MM NM+ MM MM NMM ~N MM
,MMMMMMM8 MM MM :MN MM MM+ MMI NMM M
MMI MM MMM+ MM~ MMD~ MM
|=-----------------------------------------------------------------------=|
|=----------------------------=[ The Team ]=-----------------------------=|
|=-----------------------------------------------------------------------=|
|=---=[ zardus ]=-----=[ mike_pizza ]=---=[ anton00b ]=----=[ salls ]=---=|
|=-----------------------------------------------------------------------=|
|=---=[ fish ]=---------=[ nebhiros ]=---=[ cao ]=--------=[ donfos ]=---=|
|=-----------------------------------------------------------------------=|
|=---=[ hacopo ]=---------=[ nezorg ]=---=[ rhelmot ]=------=[ paul ]=---=|
|=-----------------------------------------------------------------------=|
|=----------------------------=[ zanardi ]=------------------------------=|
|=-----------------------------------------------------------------------=|
Hacking is often considered more than a skill. In popular culture, hackers
are seen as wizards of sorts, artists with powers to access the
inaccessible, or perform acts that seem impossible. Hacking, like art, has
great people, who are recognized for their skills, but whose abilities
cannot be captured or reproduced. In fact, a single great hacker in a team
is better than a hundred mediocre ones, similar to as none of the paintings
from a hundred mediocre artists can match a painting from van Gogh.
Vulnerability analysis is the science of capturing and reproducing what
some hackers do. Vulnerability analysis studies how one can reason, in a
principled way, about finding and exploiting vulnerabilities in all types
of software or hardware. By developing algorithms and tools to help humans
identify flaws in software, researchers "codify" the knowledge that hackers
use, in an organic way, to analyze systems and find their flaws. The
resulting tools can then be used at scale, and composed to create new
analysis systems.
This scientific process has generated a number of useful tools, such as
static analysis tools, fuzzers, and symbolic execution frameworks. However,
these tools codify only a subset of the skills of a hacker, and they are
still used only to augment the abilities of humans.
One approach to push the codification of what human hackers do, is to take
the hackers out of the equation. This is precisely what the DARPA Cyber
Grand Challenge was set out to do.
The DARPA Cyber Grand Challenge (CGC) was designed as a Capture The Flag
(CTF) competition among autonomous systems without any humans being
involved. During the competition, Cyber Reasoning Systems (CRSs) would find
vulnerabilities in binaries, exploit them, and generate patches to protect
them from attacks, without any human involvement at all.
The separation between human and machine is key, as it forces the
participants to codify, in algorithms, the techniques used for both attack
and defense. Although the competition was only a first step toward
capturing the art of hacking, it was an important one: for the first time,
completely autonomous systems were hacking one another with code, and not
human intuition, driving the discovery of flaws in complex software
systems.
Shellphish is a team that was founded by Professor Giovanni Vigna at UC
Santa Barbara in 2005 to participate in the DEF CON CTF with his graduate
students. Since then, Shellphish has evolved to include dozens of
individuals (graduate students - now professors elsewhere, undergraduate
students, visitors, their friends, etc.) who are somewhat connected by the
Security Lab at UC Santa Barbara, but who are now spread all across the
world. Nonetheless, Shellphish has never lost its "hackademic" background
and its interest in the science behind hacking. Participation in many CTF
competitions sparked novel research ideas, which, in addition to
publications, resulted in tools, which, in turn, were put to good use
during CTF competitions.
Given the academic focus of Shellphish, it is no surprise that the DARPA
Cyber Grand Challenge seemed like a great opportunity to put the research
carried out at the UC Santa Barbara SecLab to work. Unfortunately, when the
call for participation for the funded track came out, the lab was (as
usual) busy with a great number of research projects and research
endeavors, and there were simply no cycles left to dedicate to this effort.
However, when the call for the qualification round that was open to anybody
who wanted to throw their hat in the ring was announced, a group of
dedicated students from the SecLab decided to participate, entering the
competition as team Shellphish.
With only a few weeks to spare, the Shellphish team put together a
prototype of a system that automatically identifies crashes in binaries
using a novel composition of fuzzing and symbolic execution.
Unsurprisingly, the system was largely unstable and crashed more than the
binaries it was supposed to crash. Yet, it performed well and Shellphish
was one of the seven teams (out of more than a hundred participants) that
qualified for the final event. Since Shellphish was not initially funded by
DARPA through the funded track, it received a $750,000 award to fund the
creation of the autonomous system that would participate in the final
competition.
The following few months focused mostly on basic research, which resulted
in a number of interesting scientific results in the field of binary
analysis [Driller16, ArtOfWar16], and to the dramatic improvement of angr
[angr], an open-source framework created at the UC Santa Barbara SecLab to
support the analysis of binaries.
Eventually, the pressure to create a fully autonomous system increased to
the point that academic research had to be traded for system-building.
During the several months preceding the final competition event, all the
energy of the Shellphish team focused on creating a solid system that could
be resilient to failure, perform at scale, and be able not only to crash
binaries, but also to generate reliable exploits and patches.
After gruesome months of Sushi-fueled work that lead to severe Circadian
rhythm sleep disorder [Inversion] in many of the team's members, the
Mechanical Phish Cyber Reasoning System was born. Mechanical Phish is a
highly-available, distributed system that can identify flaws in DECREE
binaries, generate exploits (called Proofs Of Vulnerability, or POVs), and
patched binaries, without human intervention. In a way, Mechanical Phish
represents a codification of some of the hacking skills of Shellphish.
Mechanical Phish participated in the final event, held in Las Vegas on
August 4th, in conjunction with DEF CON, and placed third, winning a
$750,000 prize. As a team, we were ecstatic: our system performed more
successful exploits than any other CRS, and it was exploited on fewer
challenges than any other CRS. Of course, in typical Shellphish style, the
game strategy is where we lost points, but the technical aspects of our CRS
were some of the best.
We decided to make our system completely open-source (at the time of
writing, Shellphish is the only team that decided to do so), so that others
can build upon and improve what we put together.
The rest of this article describes the design of our system, how it
performed, and the many lessons learned in designing, implementing, and
deploying Mechanical Phish.
--[ Contents
1 - The Cyber Grand Challenge
2 - Finding Bugs
3 - Exploiting
4 - Patching
5 - Orchestration
6 - Strategy
7 - Results
8 - Warez
9 - Looking Forward
--[ 001 - The Cyber Grand Challenge
The Cyber Grand Challenge was run by DARPA, and DARPA is a government
agency. As such, the amount of rules, regulations, errata, and so forth was
considerably out of the range with which we were familiar. For example, the
Frequently Asked Questions document alone, which became the CGC Bible of
sorts, reached 68 dense pages by the time the final event came around;
roughly the size of a small novella. This novella held much crucial
information, and this crucial information had to be absorbed by the team,
digested, and regurgitated into the squawking maw of the Mechanical Phish.
--[ 001.001 - Game Format
The CGC Final Event took the form of a more-or-less traditional
attack-defense CTF. This means that each team had to attack, and defend
against, each other team. Counter to A&D CTF tradition, and similar to the
setup of several editions of the UCSB iCTF, exploits could not be run
directly against opponents, but had to be submitted to the organizers.
Likewise, patches could not be directly installed (i.e., with something
like scp), but had to be submitted through the central API, termed the
"Team Interface" (TI) by DARPA. This allowed DARPA to maintain full control
over when and how often attacks were launched, and how patches were
evaluated.
The CGC was divided into rounds (specifically, there were 95 rounds in the
final event) of at least 5 minutes each. Each round, the TI specified the
currently-active challenges. A challenge could be introduced at any point,
up to a maximum of 30 concurrently active challenges, and was guaranteed to
remain live for a minimum of 10 rounds.
All teams would start with the same binaries for a challenge. Each round,
teams could submit patches for their instances of the challenge binaries,
exploits for the opponents' instances, and network rules to filter traffic.
Submitted exploits would go "live" on the round *after* the one during
which they were submitted, while submitted patches and network rules would
go live 2 rounds after. Submitting a patch, causes that team to incur one
round of downtime for that challenge, in which opponents can download the
patched binary. Patched binaries were visible by opponents to allow them
to exploit incomplete patches.
--[ 001.002 - Game Scoring
The CGC employed a scoring algorithm that had serious implications for the
strategies that were viable for teams to adopt. Each team would be scored
on a per-round basis, and a team's final score was the sum of all of their
round scores. A team's round score was, in turn, the sum of their
"Challenge Binary Round Scores" for that round. This CB Round Score was the
real crux of the matter.
A CB Round Score was calculated for every Challenge Binary to be a simple
multiplication of:
- Availability: this was a measure of the performance overhead and
functionality impact introduced by a patch. The performance and
functionality scores each ranged from 0 (broken) to 100 (perfect), and
the availability score was the minimum of these numbers.
- Security: this number was 2 if the Challenge Binary was secure (i.e.,
it had not been exploited by any competitor that round), and 1
otherwise (if it was exploited by at least one competitor).
- Evaluation: this number ranged from 1.0 to 2.0, based on how many
opponents the team was hitting with their exploits (if any) for this
challenge binary. For example, if an exploit was succeeding against 2 of
the 6 opponents, this number would be 1.33.
The round after a patch was submitted or an IDS rule was updated for a
challenge binary, that challenge binary's availability score would be set
to 0 and no exploits would be scheduled (either by the patching team or
against the patching team) for that round.
On the surface, this seems like a simple formula. However, there were
several complications:
- DARPA did not disclose the formula used to calculate the performance or
functionality scores prior to the CFE. While this was presumably done to
avoid letting teams "game the system", it led to an astonishing amount of
uncertainty regarding patching. Teams knew that they had some "free"
overhead (20% for the file size, and 5% each for runtime overhead and
memory overhead), but did not know how additional overhead would be
penalized.
- The runtime overhead included overhead introduced by the network IDS as
it matched submitted network rules. This overhead was fundamentally
unmeasurable before the game. During the CFE, it turned out that the
overhead introduced by network rules from one service could actually
influence overhead of *other* services. All this uncertainty kept all but
two teams from using the network IDS during the CFE.
- The memory and runtime overhead was astonishingly hard to measure. As we
discuss later, local measurements were so unreliable that we actually
decided not to test our own patches for performance during the CFE, but
optimized the techniques as much as possible beforehand and relied on the
TI feedback for already-submitted patches. As we'll also discuss, DARPA
themselves had some trouble keeping the measurements constant, with
measurements between teams (for identical, unpatched binaries) varying by
significant amounts.
Overall, the performance scoring was both a strength and weakness of the
game. Because of the ease with which exploits can be "broken" by simple
patches, it was critical to have some patch dis-incentives. However, the
measurement of such patches is a very hard problem, and led to some chaos.
--[ 001.003 - Visibility
With the CGC, DARPA pioneered something called the "Consensus Evaluation".
This had many implications for many parts of the game, but the biggest one
was a slight change to the "visibility" of certain aspects of the game to
the different teams. In this section, we'll detail what information was
available to competitors during the game.
All interaction with the game was done via the CGC API (what DARPA called
the Team Interface, or TI). The TI provided the following information:
- The Challenges: all teams could retrieve the challenges that were "in
play" at any given moment.
- Patched Binaries: all teams could retrieve the patched binaries of *all*
opponents. This prevented opponents from deploying "chump" patches that
relied on simple security through obscurity.
- IDS rules: all teams could retrieve the IDS rules that were being fielded
by their opponents.
- Availability impacts: a team could see the runtime overhead, memory
overhead, and functionality impact of their submitted patches.
- Crashes: a team could see what signals their binaries crashed with as a
result of processing network traffic. This could give hints for whether
or not exploits were being thrown against a challenge binary.
- Network Traffic: a team could get most of the network traffic, including
all exploits and most of the functionality polls (DARPA reserved the
right to keep some of them secret).
- POV Feedback: a team could get the results of their own exploits against
other teams.
- Round scores: teams could get the overall round scores of all
competitors.
This level of visibility is different than your "average" CTF.
Specifically, the big difference is that the patches submitted by the Cyber
Reasoning Systems were made available to opponents. In theory, this would
allow opponents to analyze, and try to bypass, a CRS' patches. In practice,
as we discussed in the Patching section, patched binaries were too
"dangerous" to analyze. Teams were allowed to add anti-reversing code, and,
in our experience, it was easy to find ways to crash analysis tools, or
"clog" them, causing them to slow down or use extra resources. We did not
dare to run our heavier analysis on the patched binaries, and we are not
aware of any team that did. Instead, our heavy analysis was run strictly on
the original, unpatched binaries, and the patched binaries were run through
a quick analysis that would fix up memory offsets and so forth in the case
of "chump" patches.
--[ 001.004 - Practice
Full autonomy is difficult to achieve. To mitigate some of this difficulty,
DARPA provided CGC practice sessions through a "sparring partner". In the
months leading up to the final event, the sparring partner would connect to
our Cyber Reasoning System at random, unannounced times and begin a game.
The presence of the sparring partner was immensely helpful in debugging
interactions with the API. However, the unannounced nature of these events
meant that it was rarely possible to be ready for this debugging.
Additionally, since DARPA only had one sparring partner setup, and teams
had to take turns interacting with it, the sparring partner sessions were
very short (generally, about 30 minutes, or 5 rounds). The reduced length
of these sessions made it more difficult to truly stress-test the systems,
and several teams showed signs of failure due to overload during the final
event itself.
We mostly used the practice round to test the difference between DARPA's
calculated overhead from our estimated one. This allowed us to get a vague
idea of how DARPA calculated overhead, and tailor our patching techniques
accordingly.
--[ 001.005 - DECREE OS
The binaries that made up the Cyber Grand Challenge's challenges were not
standard Linux binaries. Instead, they were binaries for an OS, called
DECREE, that was created specifically for the Cyber Grand Challenge. This
OS was extremely simple: there were 7 system calls, and no persistence
(i.e., filesystem storage, etc). The DECREE syscalls are:
1. terminate: the equivalent of exit()
2. transmit: the equivalent of send()
3. receive: the equivalent of recv()
4. fdwait: the equivalent of select()
5. allocate: the equivalent of mmap()
6. deallocate: the equivalent of munmap()
7. random: a system call that would generate random data
The hardware platform for the binaries was 32-bit x86. Challenge authors
were not allowed to include inline assembly code, only code which was
produced by clang or included in a library provided by DARPA. The provided
math library included instructions such as floating point logarithms and
trigonometric operations. All other code had to be produced directly by the
compiler (clang).
This simple environment model allowed competitors to focus on their program
analysis techniques, rather than the implementation details that go along
with support complex OS environments. The DECREE OS was, in the opinion of
many of us, the biggest thing that made the CGC possible with humanity's
current level of technology.
--[ 002 - Finding Bugs
-----------------------------------------
| Driller |
| |
| ++++++++++++++ ++++++++++++++ |
| + angr + =====> + AFL + |
| + + + + |
| + Symbolic + + Genetic + |
| + Tracing + <===== + Fuzzing + |
| ++++++++++++++ ++++++++++++++ |
| |
-----------------------------------------
There are a few things we considered when we thought about how we wanted to
find bugs in the Cyber Grand Challenge. Firstly, we will need to craft
exploits using the bugs we find. As a result, we need to use techniques
which generate inputs that trigger the bugs, not just point out that there
could be a bug. Secondly, the bugs might be guarded by specific checks such
as matching a command argument, password or checksum. Lastly, the programs
which we need to analyze might be large, so we need techniques which scale
well.
The automated bug finding techniques can be divided into three groups:
static analysis, fuzzing, and symbolic execution. Static analysis is not
too useful as it doesn't generate inputs which actually trigger the bug.
Symbolic execution is great for generating inputs which pass difficult
checks, however, it scales poorly in large programs. Fuzzing can handle
fairly large programs, but struggles to get past difficult checks. The
solution we came up with is to combine fuzzing and symbolic execution,
into a state-of-the-art guided fuzzer, called Driller. Driller uses a
mutational fuzzer to exercise components within the binary, and then uses
symbolic execution to find inputs which can reach a different component.
Driller leverages a popular off-the-shelf fuzzer, American Fuzzy Lop. AFL
uses instrumentation to identify the transitions that a particular input
exercises when it is passed to the program. These transitions are tuples
of source and destination basic blocks in the control flow graph. New
transition tuples often represent functionality, or code paths, that has
not been exercised before; logically, these inputs containing new
transition tuples are prioritized by the fuzzer.
To facilitate the instrumentation, we use a fork of QEMU, which enables the
execution of DECREE binaries. Some minor modifications were made to the
fuzzer and to the emulation of DECREE binaries to enable faster fuzzing as
well as finding deeper bugs:
- De-randomization
Randomization by the program interferes with the fuzzer's evaluation
of inputs - an input that hits an interesting transition with one
random seed, may not hit it with a different random seed. Removing
randomness allows the fuzzer to explore sections of the program which
may be guarded by randomness such as "challenge-response" exchanges.
During fuzzing, we ensure that the flag page is initialized with a
constant seed, and the random system call always returns constant
values so there is no randomness in the system. The Exploitation
component of our CRS, is responsible for handling the removal of
randomness.
- Double Receive Failure
The system call receive fails after the input has been completely read
in by the program and the file descriptor is now closed. Any binary
which does not check the error codes for this failure may enter an
infinite loop, which slows down the fuzzer dramatically. To prevent
this behavior, if the receive system call fails twice because the
end-of-file has been reached, the program is terminated immediately.
- At-Receive Fork Server
AFL employs a fork server, which forks the program for each execution
of an input to speed up fuzzing by avoiding costly system calls and
initialization. Given that the binary has been de-randomized as
described above, all executions of it must be identical up until the
first call to receive, which is the first point in which non-constant
data may enter the system. This allows the fork-server to be moved
from the entry of the program, to right before the first call to
receive. If there is any costly initialization of globals and data
structures, this modification speeds up the fuzzing process greatly.
Network traffic can contain valuable seeds which can be given to the
fuzzer to greatly increase the fuzzing effectiveness. Functionality tests
exercise deep functionality within the program and network traffic from
exploits may exercise the particular functionality which is buggy. To
generate seeds from the traffic, each input to the program is run with the
instrumentation from QEMU to identify if it hits any transitions which
have not been found before. If this condition is met, the input is
considered interesting and is added as a seed for the fuzzer.
- Adding Symbolic Execution
Although symbolic execution is slow and costly, it is extremely powerful.
Symbolic execution uses a constraint solver to generate specific inputs
which will exercise a given path in the binary. As such, it can produce
the inputs which pass a difficult check such as a password, a magic
number, or even a checksum. However, an approach based entirely on
symbolic execution will quickly succumb to path explosion, as the number
of paths through the binary exponentially increases with each branch.
Driller mitigates the path-explosion problem by only tracing the paths
which the fuzzer, AFL, finds interesting. This set of paths is often small
enough that tracing the inputs in it is feasible within the time
constraints of the competition. During each symbolic trace, Driller
attempts to identify transitions that have not yet been exercised by the
fuzzer, and, if possible, it generates an input which will deviate from
the trace and take a new transition instead. These new inputs are fed back
into the fuzzer, where they will be further mutated to continue exercising
deeper paths, following the new transitions.
To ensure that the symbolic trace using angr's concolic execution engine
is identical to the native execution trace, we use pre-constraining. In
pre-constrained execution, each byte of input is constrained to match the
original byte of input that was used by the fuzzer. When a branch is
reached that would lead to a new transition, the pre-constraints are
removed and then the solver is queried for an input which reaches the new
transition. Pre-constraining also has the benefit of greatly improving
execution time, because one does not need to perform expensive solves to
determine the locations of reads and writes to memory because all
variables have only one possible value.
--[ 003 - The Exploiting Component
Crashes Inputs
|| ||
|| ||
======================== ||
|| || ||
|| || ||
\/ \/ \/
-------------------- -------------------- --------------------
| REX | | PovFuzzer | | Colorguard |
| | | | | |
| ++++++++++++++ | | ++++++++++++++ | | ++++++++++++++ |
| + Advanced + | | + Fuzzing + | | + Finding + |
| + AEG + | | + For + | | + Leaks + |
| + + | | + Exploits + | | + + |
| ++++++++++++++ | | ++++++++++++++ | | ++++++++++++++ |
| | | | | |
| ... | | | | |
-------------------- -------------------- --------------------
|| || ||
================================================
||
\/
POVs
In the Cyber Grand Challenge, stealing flags is a little different than
it is in ordinary Capture the Flag games. Instead of reading a secret
"flag" file and submitting the contents to the organizers, an exploit is
demonstrated by submitting a Proof of Vulnerability (POV), which the
organizers will run.
A POV is a binary program, which will communicate with the opponent's
binary and "exploit" it. There are two ways in which a POV can be
considered successful:
- Type 1: Cause a segmentation fault where the instruction pointer and one
additional register have values which were previously negotiated
with the competition infrastructure. Must control at least
20 bits of EIP as well as 20 bits of another register.
- Type 2: Read 4 contiguous bytes from the secret flag data. The flag data
is located at 0x4347c000-0x4347cfff and is randomly initialized
by the kernel.
Different types of vulnerabilities might lend themselves to a specific
type of POV. For example, a vulnerability where the user can control the
address passed to puts() might only be usable as a Type 2 POV. On the
other hand, a stack-based buffer overflow can clearly be used to create a
Type 1 exploit simply by setting a register and EIP, but it can also be
used to create a Type 2 exploit by using Return Oriented Programming or by
jumping to shellcode which prints data from the flag page.
The basic design for Mechanical Phish's automatic exploitation is to take
crashes, triage them, and modify them to create exploits. Mechanical Phish
does not need to understand the root cause of the bug, instead it only
needs to identify what registers and memory it controls at crash time, and
how those values can be set to produce a POV. We created two systems which
are designed to go from crashes to exploits.
- PovFuzzer
Executes the binary repeatedly, slightly modifying the input,
tracking the relationship between input bytes and registers at the
crash point. This method is fast, but cannot handle complex cases.
- Rex
Symbolically executes the input, tracking formulas for all registers
and memory values. Applies "techniques" on this state, such as
jumping to shellcode, and return oriented programming to create a POV.
Now this design was missing one important thing. For some challenges the
buggy functionality does not result in a crash! Consider a buffer
over-read, where it reads passed the end of the buffer. This may not cause
a crash, but if any copy of flag data is there, then it will print the
secret data. To handle this, we added a third component.
- Colorguard
Traces the execution of a binary with a particular input and
checks for flag data being leaked out. If flag data is leaked, then
it uses the symbolic formulas to determine if it can produce a valid
POV.
--[ 003.001 - PovFuzzer
The PovFuzzer takes a crash and repeatedly changes a single byte at a time
until it can determine which bytes control the EIP as well as another
register. Then a Type 1 POV can be constructed which simply chooses the
input bytes that correspond to the negotiated EIP and register values,
inserts the bytes in the payload, and sends it to the target program.
For crashes which occur on a dereference of controlled data, the PovFuzzer
chooses bytes that cause the dereference to point to the flag page, in the
hope that flag data will be printed out. After pointing the dereference at
the flag page, it executes the program to check if flag data is printed
out. If so, it constructs a Type 2 POV using that input.
The PovFuzzer has many limitations. It cannot handle cases where register
values are computed in a non-trivial manner, such as through
multiplication. Furthermore it cannot handle construction of more complex
exploits such as jumping to shellcode, or where it needs to replay a random
value printed by the target program. Even so, it is useful for a couple
reasons. Firstly, it is much faster than Rex, because it only needs to
execute the program concretely. Secondly, although we don't like to admit
it, angr might still have occasional bugs, and the PovFuzzer is a good
fallback in those cases as it doesn't rely on angr.
--[ 003.002 - Rex
The general design of Rex is to take a crashing payload, use angr to
symbolically trace the program with the crashing payload, collecting
symbolic formulas for all memory and input along the way. Once we hit the
point where the program crashes, we stop tracing, but use the constraint
solver to pick values that either make the crash a valid POV, or avoid the
crash to explore further. There are many ways we can choose to constrain
the values at this point, each of which tries to exploit the program in a
different way. These methods of exploiting the program are called
"techniques".
One quick thing to note here is that we include a constraint solver in our
POVs. By including a constraint solver we can simply add all of the
constraints collected during tracing and exploitation into the POV and
then ask the constraint solver at runtime for a solution that matches the
negotiated values. The constraint solver, as well as angr, operates on
bit-vectors enabling the techniques to be bit-precise.
Here we will describe the various "techniques" which Rex employs.
- Circumstantial Exploit
This technique is applicable for ip-overwrite crashes and is the
simplest technique. It determines if at least 20 bits of the
instruction pointer and one register are controlled by user input. If
so, an exploit is constructed that will negotiate the register and ip
values and then solve the constraints to determine the user input
that sets them correctly.
- Shellcode Exploit
Also only applicable to ip-overwrites, this technique will search for
regions of executable memory that are controlled by user input. The
largest region of controlled executable memory is chosen and the
memory there is constrained to be a nop-sled followed by shellcode to
either prove a type-1 or type-2 vulnerability. A shellcode exploit
can function even if the user input only controls the ip value, and
not an additional register.
- ROP Exploit
Although the stack is executable by default, opponents might employ
additional protections that prevent jumping to shellcode, such as
remapping the stack, primitive Address Randomization or even some
form of Control Flow Integrity. Return Oriented Programming (ROP) can
bypass incomplete defenses and still prove vulnerabilities for
opponents that employ them. It is applicable for ip-overwrite crashes
as long as there is user data near the stack pointer, or the binary
contains a gadget to pivot the stack pointer to the user data.
- Arbitrary Read - Point to Flag
A crash that occurs when the program tries to dereference user-
controlled data is considered an "arbitrary read". In some cases, by
simply constraining the address that will be dereferenced to point at
flag data, the flag data will be leaked to stdout, enabling the
creation of a type-2 exploit. Point to Flag constrains the input to
point at the flag page, or at any copy of flag data in memory, then
uses Colorguard to determine if the new input causes an exploitable
leak.
- Arbitrary Read/Write - Exploration
In some cases, the dereference of user-controlled data can lead to a
more powerful exploit later. For example, a vtable overwrite will
first appear as an arbitrary read, but if that memory address points
to user data, then the read will result in a controlled ip. To
explore arbitrary reads/writes for a better crash, the address of the
read or write is constrained to point to user data, then the input is
re-traced as a new crash.
- Write-What-Where
If the input from the user controls both the data being written and
the address to which it is written, we want to identify valuable
targets to overwrite. This is done by symbolically exploring the
crash, to identify values in memory that influence the instruction
pointer, such as return addresses or function pointers. Other
valuable targets are pointers that are used to print data;
overwriting these can lead to type-2 exploits.
--[ 003.003 - Colorguard
As explained above, there are some challenges which include
vulnerabilities that can leak flag data, but do not cause a crash. One
challenge here is that it is difficult to detect when a leak occurs. You
can check if any 4 bytes of output data are contained in the flag page,
but this will have false positives while fuzzing, and it will miss any
case where the data is not leaked directly. The challenge authors seem to
prefer to xor the data or otherwise obfuscate the leak, maybe to prevent
such a method.
To accurately detect these leaks we chose to trace the inputs
symbolically, using angr. However, symbolic tracing is far too slow to run
on every input that the fuzzer generates. Instead, we only perform the
symbolic tracing on the inputs which the fuzzer considers "interesting".
The hope here is that the leak causing inputs have a new transition, or
number of loops which is unique, and that the fuzzer will consider it
interesting. There are definitely cases where this doesn't work, but it's
a fairly good heuristic for reducing the number of traces.
In an effort to further combat the slowness of symbolic execution,
Colorguard takes advantage of angr's concrete emulation mode. Since no
modification of the input has to be made if a flag leak is discovered, our
input is made entirely concrete, with the only symbolic data being that
from the flag page. This allows us to only execute symbolically
those basic blocks that touch the secret flag page contents.
Colorguard traces the entire input concretely and collects the symbolic
expression for the data that is printed to stdout. The expression is
parsed to identify any four consecutive bytes of the flag page that are
contained in the output. For each set of bytes, the solver is queried to
check if we can compute the values of the bytes only from the output data.
If so, then an exploit is crafted which solves for these four bytes after
receiving the output from the program execution.
One caveat here is that challenges may use many bytes of the flag page as
a random seed. In these cases we might see every byte of the flag page as
part of the expression for stdout. Querying the constraint solver for
every one of these consecutive four byte sequences is prohibitively slow,
so it is necessary to pre-filter such expressions. Colorguard does this
pre-filter during the trace by replacing any expression containing more
than 13 flag bytes with a new symbolic variable. The new symbolic variable
is not considered a potential leak. The number 13 was arbitrarily chosen
as it was high enough to still detect all of the leaks we had examples for,
but low enough that checking for leaks was still fast.
--[ 003.004 - Challenge Response
A common pattern that still needs to be considered is where the binary
randomly chooses a value, outputs it, and then requires the user to input
that value or something computed from the random value. For example, the
binary prints "Solve the equation: 6329*4291" and then the user must input
"27157739". To handle these patterns in the exploit, we identify any
constraints that involve both user input and random data. Once identified,
we check the output that has been printed up to that point if it contains
the random data. If so, then we have identified a challenge-response. We
will include the output as a variable in the constraints that are passed
to the exploit, and then read from stdout, adding constraints that the
output bytes match what is received during the exploit. Then the solver
can be queried to generate the input necessary for the correct "response".
--[ 004 - The Patching Component: Patcherex
Original
Binary (CB)
||
||
||===============================================
|| ||
|| ||
\/ \/
-------------------- -------------------- --------------------
| TECHNIQUES | | PATCHES | | BACKENDS |
| | | | | |
| ++++++++++++++ | | | | |
| + Return + | | - AddROData() | | |
| + Pointer + --------> - AddCode() | | +++++++++++++++ |
| + Encryption + | | - ... | | + Detour + |
| ++++++++++++++ | | | | + Backend + |
| | | | | + + |
| ++++++++++++++ | | | | +++++++++++++++ |
| + Transmit + | | - InsertCode() | | |
| + Protection + --------> - AddRWData() |==>| OR |
| + + | | - ... | | |
| ++++++++++++++ | | | | +++++++++++++++ |
| | | | | + Reassembler + |
| ++++++++++++++ | | | | + Backend + |
| + + | | - AddCode() | | + + |
| + Backdoor + --------> - AddRWData() | | +++++++++++++++ |
| + + | | - ... | | |
| ++++++++++++++ | | | | |
| | | | | |
| ... | | | | |
-------------------- -------------------- --------------------
||
||
\/
Replacement
Binary (RB)
Patcherex, which is built on top of angr, is the central patching system of
Mechanical Phish. As illustrated in the overview image, Patcherex is
composed of three major components: techniques, patches, and patching
backends.
A technique is the implementation of a high-level patching strategy. A set
of patches (described below) with respect to a binary are generated after
applying a technique on it. Currently Patcherex implements three different
types of techniques:
- Generic binary hardening techniques, including Return Pointer Encryption,
Transmit Protection, Simple Control-Flow Integrity, Indirect Control-Flow
Integrity, Generic Pointer Encryption;
- Techniques aiming at preventing rivals from analyzing or stealing our
patches, including backdoors, anti-analysis techniques, etc.;
- Optimization techniques that make binaries more performant, including
constant propagation, dead assignment elimination, and redundant stack
variables removal.
A Patch is a low-level description of how a fix or an improvement should be
made on the target binary. Patcherex defines a variety types of patches to
perform tasks ranging from code/data insertion/removal to segment altering.
A backend takes patches generated from one or more techniques, and applies
them on the target binary. Two backends available in Patcherex:
- ReassemblerBackend: this backend takes a binary, completely disassembles
the entire binary, symbolizes all code and data references among code and
data regions, and then generate an assembly file. It then apply patches
on the assembly file, and calls an external assembler (it was clang for
CGC) to reassemble the patched assembly file to the final binary.
- DetourBackend: this backend acts as a fallback to the
ReassemblerBackend. It performs in-line hooking and detouring to apply
patches.
--[ 004.001 - Patching Techniques
In this section, we describe all techniques we implemented in Patcherex.
Obviously we tried to implement techniques that will prevent exploitation
of the given CBs, by making their bugs not exploitable. In some cases,
however, our techniques do not render the bugs completely unexploitable,
but they still force an attacker to adapt its exploits to our RBs. For
instance, some techniques introduce differences in the memory layout
between our generated RB and the original CB an attacker may have used to
develop its exploit. In addition, we try to prevent attackers from adapting
exploits to our RB by adding anti-analysis techniques inside our generated
binaries. Furthermore, although we put a significant effort in minimizing
the speed and memory impact of our patches, it is often impossible to have
performance impact lower than 5% (a CB score starts to be lowered when it
has more than 5% of speed or memory overhead). For this reason we decided
to "optimize" the produced RB, as we will explain later.
--[ 004.001.001 - Binary Hardening Techniques
We implemented some techniques for generic binary hardening. Those general
hardening techniques, although not extremely complex, turned out to be very
useful in the CFE.
Vulnerability-targeted hardening (targeted patching) was also planned
initially. However, due to lack of manpower and fear for deploying
replacement CBs too many times for the same challenge, we did not fully
implement or test our targeted patching strategies.
- Return Pointer Encryption
This technique was designed to protect from classical stack buffer
overflows, which typically give an attacker control over an overwritten
saved return pointer. Our defense mechanism "encrypts" every return pointer
saved on the stack when a function is called, by modifying the function's
prologue. The encrypted pointer is then "decrypted" before every ret
instruction terminating the same function. For encryption and decryption we
simply xor'ed the saved return pointer with a nonce (randomly generated
during program's startup). Since the added code is executed every time a
function is called, we take special care in minimizing the performance
impact of this technique.
First of all, this technique is not applied in functions determined as
safe. We classify a function as safe if one of these three conditions is
true:
- The function does not access any stack buffer.
- The function is called by more than 5 different functions. In this case,
we assume that the function is some standard "utility" function, and it
is unlikely that it contains bugs. Even if it contains bugs, the
performance cost of patching such a function is usually too high.
- The function is called by printf or free. Again, we assume that library
functions are unlikely to contain bugs. These common library functions
are identified by running the functions with test input and output pairs.
This function identification functionality is offered by a separate
component (the "Function identifier" mentioned in the "Warez" Section).
To further improve the performance, the code snippet that encrypts and
decrypts the saved return pointer uses, when possible, a "free" register to
perform its computations. This avoids saving and restoring the value of a
register every time the injected code is executed. We identify free
registers (at a specific code location) by looking for registers in which a
write operation always happens before any read operation. This analysis is
performed by exploring the binary's CFG in a depth-first manner, starting
from the analyzed code location (i.e., the location where the code
encrypting/decrypting the return pointer is injected).
Finally, to avoid negatively impacting the functionality of the binary, we
did not patch functions in which the CFG reconstruction algorithm has
problems in identifying the prologue and the epilogues. In fact, in those
cases, it could happen that if an epilogue of the analyzed function is not
identified, the encrypted return address will not be decrypted when that
epilogue is reached, and consequently the program will use the
still-encrypted return pointer on the stack as the return target. This
scenario typically happens when the compiler applies tail-call
optimizations by inserting jmp instructions at the end of a function.
As a defense mechanism against Type 2 exploits, we inject code around the
transmit syscall so that a binary is forbidden from transmitting any 4
contiguous bytes of the flag page. The injected code uses an array to
keep track of the last transmitted bytes so that it can identify cases
in which bytes of the flag page are leaked one at a time.
- Simple Control-Flow Integrity
To protect indirect control flow instructions (e.g., call eax, jmp ebx), we
inject, before any instruction of this kind, code that checks specific
properties of the target address (i.e., the address which the instruction
pointer will be after the call or jump). The specific checks are:
- The target address must be an allocated address (to prevent an attacker
from using the indirect control-flow instruction to directly perform a
Type 1 attack). To do so, we try to read from the target address, so that
if the address does not point to an allocated region the program will crash
before the instruction pointer is modified. This prevents simple Type 1
exploits, because the attacker must control at least 20 bits of the
instruction pointer, and it is unlikely (but not impossible) that the
negotiated value will end up inside an allocated memory region.
- The target address must be inside the memory range where the binary's
code is typically loaded (for simplicity, we consider a "potential code"
address to be any below 0x4347c000). To avoid breaking programs that use
dynamically allocated code we do not perform this check if we statically
detect that the analyzed program calls the allocate syscall in a way which
will create additional executable memory.
- The target address is not a pop instruction. As a partial mitigation
against ROP attacks, we dynamically check if the target of indirect calls
is a pop instruction, and terminate the program otherwise.
- Uninitialized Data Cleaning
For each function, we identify all of the instructions that read and write
to stack variables, and the stack offset that is accessed. If there is any
path through the CFG such that a stack variable is read before it is
written, then we consider it possible that there is an uninitialized data
usage. For each variable that is detected in an uninitialized data usage,
we zero that variable by adding stack cleaning code at the beginning of the
function.
- Stack Base Address Randomization
On program's startup we add a random value (which can assume any 16-byte
aligned value between 16 and 1024) to the stack pointer address. This adds
indeterminism to the position of the stack, hindering any exploit making
assumptions on the program's stack layout.
To interfere with exploitation of heap overflows, if we are able to
identify a malloc-like function inside the analyzed CB, we slightly modify
its behavior. In particular, we change the amount of bytes allocated by a
small, pseudo-random, value.
For every printf-like function identified, such as printf, snprintf, etc.,
we ensure that the function is not used to perform a "format string"
attack. Specifically, if the format string parameter is neither in the
binary's read-only memory nor a string already present in the binary, we
stop the execution of the binary if:
- The format string parameter contains a meta character (e.g., "%").
- The format string parameter points to the flag page.
--[ 004.001.002 - Adversarial Techniques
Certain techniques are introduced to prevent rivals from analyzing, or even
running our RBs in a controlled environment, while leaving those RBs still
able to run in the real game environment. These techniques are presented in
this section.
We add some code, executed before the original entry point of the binary,
to interfere with analyses that other teams could perform on our patched
binary. Specifically, we add code to:
- Detect if the binary is executed using QEMU or PIN. To do so, we probe
the implementation of different aspects that are difficult to emulate
correctly, such as segment registers, transmission of partially allocated
memory regions, syscall error codes values in case of "double failure".
In addition, we add some code triggering a previously unknown
implementation bug in QEMU, making it stall. Specifically, during the
development of our CRS, we found that QEMU, when using user-mode
emulation, does not correctly handle taking the square root of an
"un-normal" floating point number, that is, a nonzero 80-bit float whose
explicit integer bit (the highest bit of the mantissa) is zero. When this
happens, QEMU will hang forever. Because of this anti-QEMU patch, some of
our RBs broke the live visualization during the CFE public event.
- Interfere with symbolic execution engines. Our adversarial code contains
self-modifying code designed to be extremely hard to simulate correctly
and efficiently by a symbolic execution engine. In addition, some added
code is specifically designed to trigger "path explosion" conditions.
- Interfere with automatic exploitation systems. We add code to transmit
the flag page to file descriptor 2 (stderr). Although data transmitted to
this file descriptor is not sent to a CRS interacting with a binary, an
opponent could mistakenly assume that any contiguous 4 bytes transmitted
from the flag page constitutes a Type 2 exploit and thus fielding a POV
trying to leverage this "fake" leak. In addition, we inject "fake
backdoor" code. This code is triggered if a CRS sends a specific 4-byte
sequence. When triggered, it reads from the CRS 8 bytes used to set the
program instruction pointer and the value of a register. For this reason,
this code looks like easily exploitable to create a Type 1 POV, however
the read values are xor'ed with a random value, before being used to set
the instruction pointer and the register, making this code
non-exploitable.
In addition, to counteract an adversary fielding one of our own patched
binaries as its own patched binary, we inject a backdoor in every fielded
patched binary. This backdoor can be used by our CRS to exploit the patched
binary we generate, but it is designed to be extremely hard to be exploited
from other teams' CRSs. The backdoor is triggered when a specific 4-byte
sequence is received. To detect this, the function wrapping the receive
syscall is modified to keep track of the first 4 bytes a program receives.
Once triggered, the backdoor sends to the CRS a "challenge" C (which is a
19-bit value), and the CRS responds with a response R (a 64-bit value).
Then, the backdoor code checks if the following condition is true:
first_32_bits_of(SHA1(pad(R,160))) == pad(C,32), where pad(A,N) is a
function padding the input value A up to N bits by adding zeros.
The challenge can be easily solved by pre-computing all the possible
responses, but it is impossible for an opponent's POV to compute a solution
for the challenge within the game-imposed 10-second timeout.
--[ 004.001.003 - Optimization Techniques
Performance is a vital concern of our patching strategy. While we stress
the necessity of optimizing all our patching techniques, some overhead
cannot be avoided. From analyzing binaries collected from CQE and CFE
samples, we noticed that most of them are compiled with O0, i.e., without
optimization enabled. We do not know why organizers decided not to optimize
most of the provided challenges, but we speculated that this may have been
decided to leave room for optimizations and patching.
It is well-known that O0 and O1 binaries can have a huge difference in
execution time. Fortunately, some of the optimization methods used in O1
are not that difficult to perform directly on binaries. Further, angr
provides all necessary data-flow analysis techniques, which makes the whole
optimization development easier. Finally, with the help of the
ReassemblerBackend, we can easily fully remove instructions that we want to
get rid of, without having to replace them with nops. Therefore, we
implemented some basic in-line binary optimization techniques in order to
optimize O0 binaries in CFE, which are described below.
- Constant Propagation. We propagate constants used as immediates in each
instruction, and eliminate unnecessary mov instructions in assembly code.
- Dead Assignment Elimination. Many unnecessary assignments occur in
unoptimized code. For example, in unoptimized code, when a function reads
arguments passed from the stack, it will always make a copy of the
argument into the local stack frame, without checking if the argument is
modified or not in the local function. We perform a conservative check
for cases where a parameter is not modified at all in a function and the
copy-to-local-frame is unnecessary. In this case, the copy-to-local
instruction is eliminated, and all references to the corresponding
variable on the local stack frame are altered to reference the original
parameter on the previous stack frame. Theoretically, we may break the
locality, but we noticed some improvement in performance in our off-line
tests.
- Redundant Stack Variable Removal. In unoptimized code, registers are not
allocated optimally, and usually many registers end up not being used. We
perform a data-flow analysis on individual functions, and try to replace
stack variables with registers. This technique works well with variables
accessed within tight loops. Empirically speaking, this technique
contributes the most to the overall performance gain we have seen during
testing.
Thanks to these optimizations, our patches often had *zero* overall
performance overhead.
--[ 004.002 - Patches
The techniques presented in the previous section return, as an output,
lists of patches. In Patcherex, a patch is a single modification to a
binary.
The most important types of patches are:
- InsertCodePatch: add some code that is going to be executed before an
instruction at a specific address.
- AddEntryPointPatch: add some code that is going to be executed before the
original entry point of the binary.
- AddCodePatch: add some code that other patches can use.
- AddRWData: add some readable and writable data that other patches can
use.
- AddROData: add some read-only data that other patches can use.
Patches can refer to each other using an easy symbol system. For instance,
code injected by an InsertCodePatch can contain an instruction like call
check_function. In this example, this call instruction will call the code
contained in an InsertCodePatch named check_function.
--[ 004.003 - Backends
We implemented two different backends to inject different patches. The
DetourBackend adds patches by inserting jumps inside the original code,
whereas the ReassemblerBacked adds code by disassembling and then
reassembling the original binary. The DetourBackend generates bigger (thus
using more memory) and slower binaries (and in some rare cases it cannot
insert some patches), however it is slightly more reliable than the
ReassemblerBackend (i.e., it breaks functionality in slightly less
binaries).
This backend adds patches by inserting jumps inside the original code. To
avoid breaking the original binary, information from the CFG of the binary
is used to avoid placing the added jmp instruction in-between two basic
blocks. The added jmp instruction points to an added code segment in which
first the code overwritten by the added jmp and then the injected code is
executed. At the end of the injected code, an additional jmp instruction
brings the instruction pointer back to its normal flow.
In some cases, when the basic block that needs to be modified is too small,
this backend may fail applying an InsertCodePatch. This requires special
handling, since patches are not, in the general case, independent (a patch
may require the presence of another patch not to break the functionality of
a binary). For this reason, when this backend fails in inserting a patch,
the patches "depending" from the failed one are not applied to the binary.
ReassemblerBackend fully disassembles the target binary, applies all
patches on the generated assembly, and then assembles the assembly back
into a new binary. This is the primary patching backend we used in the CFE.
Being able to fully reassemble binaries greatly reduces the performance hit
introduced by our patches, and enables binary optimization, which improves
the performance of our RBs even further. Also, reassembling usually changes
base addresses and function offsets, which achieves a certain level of
"security by obscurity" -- rivals will have to analyze our RBs if they want
to properly adapt their code-reusing and data-reusing attacks.
We provide an empirical solution for binary reassembling that works on
almost every binary from CQE and CFE samples. The technique is open-sourced
as a component in angr, and, after the CGC competition, it has been
extended to work with generic x86 and x86-64 Linux binaries.
A detailed explanation as well as evaluation of this technique is published
as an academic paper [Ramblr17].
--[ 004.004 - Patching Strategy
We used Patcherex to generate, for every CB, three different Replacement
Binaries (RBs):
- Detouring RB: this RB was generated by applying, using the DetourBackend,
all the patches generated by the hardening and adversarial techniques
presented previously.
- Reassembled RB: this RB was generated by applying the same patches used
by the Detouring RB, but using the ReassemblerBackend instead of the
DetourBackend.
- Optimized Reassembled RB: this RB was generated as the Reassembled one,
but, in addition all the patches generated by the optimization techniques
were added.
These three patched RBs have been listed in order of decreasing performance
overhead and decreasing reliability. In other words, the Detouring RB is
the most reliable (i.e., it has the smallest probability of having broken
functionality), but it has the highest performance overhead with respect to
the original unpatched binary. On the contrary the Optimized Reassembled RB
is the most likely to have broken functionality, but it has a lower
performance impact.
--[ 004.005 - Replacement Binary Evaluation
During initial stages of Patcherex development, testing of the RBs was done
by an in-house developed component called Tester. Internally, Tester uses
cb-test, a utility provided by DARPA for testing a binary with
pre-generated input and output pairs, called polls. We made modifications
to cb-test, which enabled the testing of a binary and its associated IDS
rules on a single machine, whereas the default cb-test needs 3-machines to
test a binary with IDS rules.
Tester can perform both performance and functionality testing of the
provided binary using the pre-generated polls for the corresponding binary
using its Makefile. For functionality testing, given a binary, we randomly
pick 10 polls and check that the binary passes all the polls. For
performance testing, we compute the relative overhead of the provided
binary against the unpatched one using all the polls. However, there was
huge discrepancy (~10%) between the performance overhead computed by us and
that provided during sparring partner sessions for the same binaries.
Moreover, during the sparring partner sessions, we also noticed that
performance numbers were different across different rounds for the same
binaries. Because of these discrepancies and to be conservative, for every
patching strategy, we computed the performance overhead as the maximum
overhead across all rounds of sparring partner sessions during which RBs
with corresponding patching strategy are fielded.
During internal testing we used all the available binaries publicly
released on GitHub (some of which were designed for the CQE event, whereas
others were sample CFE challenges). To further extend our test cases, we
recompiled all the binaries using different compilation flags influencing
the optimization level used by the compiler. In fact, we noticed that
heavily optimized binaries (e.g., -O3), were significantly harder to
analyze and to patch without breaking functionality. Specifically, we used
the following compilations flags: -O0, -Os, -Oz, -O1, -O2, -O3, -Ofast.
Interestingly, we noticed that some of the binaries, when recompiled with
specific compilation flags, failed to work even when not patched.
During the final stages of Patcherex development, we noticed that almost
all generated RBs never failed the functionality and that the performance
overhead was reasonable except for a few binaries. This, combined with the
discrepancy inherent in performance testing, led us not to use any in-depth
testing of our replacement binaries during the CFE.
For every RB successfully generated, Patcherex first performs a quick test
of their functionality. The test is designed to spot RBs that are broken by
the patching strategy. In particular, Patcherex only checks that every
generated RB does not crash when provided with a small test set of
hardcoded input strings ("B", "\n", "\n\n\n\n\n\n", etc.)
We decided to perform only a minimal tests of the functionality because of
for performance and reliability considerations.
--[ 004.006 - Qualification Round Approaches
It is worth mentioning that for the CGC qualification round, the rules were
very different from the final round. Between this and the fact that our
analysis tools were not yet mature it was necessary for us to approach
patching very differently from the final round approaches previously
described.
In the qualification round, the only criteria for "exploitation" was a
crash. If you could crash a binary, it meant that you could exploit it, and
if your binary could crash, it meant that you were vulnerable. Furthermore,
the qualification scoring formula was such that your "defense" score, i.e.
how many vulnerabilities your patch protected against, was a global
multiplier for your score between zero and one. This meant that if you
didn't submit a patch for a binary, or if your patch failed to protect
against any of the vulnerabilities, you received zero points for that
challenge, regardless of how well you were able to exploit it.
This is such an unconventional scoring system that when we analyzed the
(publicly available) patches produced by other teams for the qualification
round, we found that at least one qualifying team had a patching strategy
such that whenever they discovered a crash, they patched the crashing
instruction to simply call the exit syscall. This is technically not a
crash, so the teams that did this did in fact receive defense points for
the challenges, and accordingly did in fact receive a non-negligible score
for effectively stubbing out any vaguely problematic part of the binary.
Our approaches were slightly more nuanced! There were two techniques we
developed for the qualification round, one "general" technique, meaning
that it could be applied to a program without any knowledge of the
vulnerabilities in a binary, and one "targeted" technique, meaning that it
was applied based on our CRS' knowledge of a vulnerability. Each of the
techniques could produce several candidate patched binaries, so we had to
choose which one to submit in the end - our choice was based on some
rudimentary testing to try to ascertain if the binary could still crash,
and if not, to assess the performance impact of the patch.
It is important to notice for CQE, the patched binaries were tested by the
organizers against a fixed set of pre-generated exploits. For this reason,
our patched binaries just had to prevent to be exploited when run against
exploit developed for the original, unpatched, version of the program. In
other words, the attacker had no way to adapt its exploits to our patches
and so "security trough obscurity" techniques were extremely effective
during the qualification event.
--[ 004.006.001 - Fidget
Our "general" patching technique for CQE was a tool called Fidget. Fidget
was developed the summer prior to the announcement of the CGC for use in
attack-defense CTFs. Its basic intuition is that the development of an
attack makes a large number of very strong assumptions about the internal
memory layout of a program, so in many cases simply tweaking the layout of
stack variables is a reliable security-through-obscurity technique.
At the time of the CQE, Fidget was a tool capable of expanding function
stack frames, putting unused space in between local variables stored on the
stack. This is clearly not sufficient to prevent crashes, as is necessary
for the strange qualification scoring formula. However, the tool had an
additional mode that could control the amount of padding that was inserted;
the mode that we used attempted to insert thousands of bytes of padding
into a single stack frame! The idea here is, of course, that no overflow
attack would ever include hundreds of bytes more than strictly necessary to
cause a crash.
The primary issue with Fidget is that it's pretty hard to tell that
accesses to different members or indexes of a variable are actually
accesses to the same variable! It's pretty common for Fidget to patch a
binary that uses local array and struct variables liberally, and the
resulting patched binary is hilariously broken, crashing if you so much as
blow on it. There are a huge number of heuristics we apply to try not to
separate different accesses to the same variable, but in the end, variable
detection and binary type inference are still open problems. As a result,
Fidget also has a "safe mode" that does not try to pad the space in between
variables, instead only padding the space between local variables and the
saved base pointer and return address.
We originally planned to use Fidget in the final round, since it had the
potential to disrupt exploits that overflow only from one local variable
into an adjacent one, something that none of our final-round techniques can
address. However, it was cut from our arsenal at the last minute upon the
discovery of a bug in Fidget that was more fundamental than our ability to
fix it in the limited time available! Unfortunate...
--[ 004.006.002 - CGrex
If we know that it's possible for a binary to crash at a given instruction,
why don't we just add some code to check if that specific instruction would
try to access unmapped or otherwise unusable memory, and if so exit cleanly
instead of crashing? This is exactly what CGrex does.
Our reassembler was not developed until several weeks before the CGC final
event, so for the qualification round CGrex was implemented with a
primitive version of what then became the DetourBackend. Once our CRS found
a crash, the last good instruction pointer address was sent to CGrex, which
produced a patched binary that replaced all crashing instructions with
jumps to a special inserted section that used some quirks in some syscalls
to determine if the given memory location was readable/writable/executable,
and exit cleanly if the instruction would produce a crash.
More precisely, CGrex takes, as input, a list of POVs and a CB and it
outputs a patched CB "immune" against the provided POVs. CGrex works in
five steps:
1) Run the CGC binary against a given POV using a modified QEMU version
with improved instruction trace logging and able to run CGC binaries.
2) Detect the instruction pointer where the POV generates a crash (the
"culprit instruction").
3) Extract the symbolic expression of the memory accesses performed by the
"culprit instruction" (by using Miasm). For instance, if the crashing
instruction is mov eax, [ebx*4+2] the symbolic expression would be
ebx*4+2.
4) Generate "checking" code that dynamically:
- Compute the memory accesses that the "culprit instruction" is going
to perform.
- Verify that these memory accesses are within allocated memory regions
(and so the "culprit instruction" is not going to crash). To
understand if some memory is allocated or not CGrex "abuses" the
return values of the random and fdwait syscalls.
In particular these syscalls were used by passing as one of the
parameters the value to be checked. The kernel code handling these
functions verifies that, for instance, the pointer were the number of
random bytes returned by random is written is actually pointing to
writable memory and it returns a specific error code if not. CGrex
checks this error code to understand if the tested memory region is
allocated. Special care is taken so that, no matter if the tested
memory location is allocated or not, the injected syscall will not
modify the state of the program.
- If a memory access outside allocated memory is detected, the injected
code just calls exit.
5) Inject the "checking" code.
Steps 1 to 5 are repeated until the binary does not crash anymore with all
the provided POVs.
--[ 005 - Orchestration
+--------+ +---------+
CGC endpoints | TI API | | IDS tap |
+--------+ +---------+
. .
/ \ / \
| |
-----------------------------|--------------|------------------------------
| |
Mechanical Phish \ / \ /
' '
+------------+ +------------+
| Ambassador | |Network Dude|
+------------+ +-+----------+
| |
+------------+ +------------+ | |
| Meister | | Scriba | | |
+-----+------+ +-------+----+ | |
| | | | +--------------------------------+
+--------+--------+ | | | Worker |
| | | | |
_----------_ | | | Poll creator AFL Driller |
( )<---------+ | | ============ === ======= |
|`----------`|<-------------+ | Tester POV Tester POV Fuzzer |
| |<---------------+ ====== ========== ========== |
| Farnsworth | | Patcherex Colorguard Rex |
( ) | ========= ========== === |
`----------` +--------------------------------+
Designing a fully autonomous system is a challenging feat from an
engineering perspective too. In the scope of the CFE, the CRS was required
to run without fault for at least 10 hours. Although it was proposed to
allow debugging during the CFE, eventually, no human intervention was
permitted.
To that end, we designed our CRS using a microservice-based approach. Each
logical part was split following the KISS principle ("Keep it simple,
stupid") and the Unix philosophy ("Do one thing and do it well").
Specifically, the separation of logical units allowed us to test and work
on every component in complete isolation. We leveraged Docker to run
components independently, and Kubernetes to schedule, deploy, and control
component instances across all 64 nodes provided to us.
--[ 005.001 - Components
Several different components of Mechanical Phish interacted closely
together during the CRS (see diagram above). In the following, we will
briefly talk about the role of each component.
Farnsworth is a Python-based wrapper around the PostgreSQL database, and
stores all data shared between components: CBs, POVs, crashing inputs,
synchronization structures, etc. In our design, we prohibited any direct
communication between components and required them to talk "through"
Farnsworth. Therefore, Farnsworth was a potential single point of
failure. To reduce the associated risk for the CFE, we paid particular
attention to possible database problems and mitigated them accordingly.
Ambassador was the component that talked to the CGC Team Interface (TI)
to retrieve CBs, obtain feedback, and submit RBs and POVs. In the spirit
of KISS, this component is the only part of Mechanical Phish to
communicate externally and the only source for the ground truth in
respect to the game state.
Meister coordinated Mechanical Phish. For each component, a component-
specific creator decided which jobs should be run at any point in the
game, based on information obtained through Farnsworth and written by
Ambassador. Consequently, Meister decided which jobs to run based on the
priority information of each job (as specified by the creator) and usage
of the nodes in terms of CPU and memory. Note that, specifically, Meister
and its creators were entirely stateless. At any point, it could crash,
yet it would not kill existing jobs upon automatic restart if they were
still considered important by the creators.
An important task of the CRS was to select and submit the POVs and RBs,
respectively. Scriba looked at performance results of exploits and
patches and decided what and when to submit (for more details on the
selection strategy see Section 6 - Strategy). As mentioned previously,
after Scriba decided what and when to submit, Ambassador actually
submitted to the TI (as Ambassador is the only component allowed to
communicate externally).
The Network Dude component received UDP traffic coming from the IDS tap
and stored it into the database via Farnsworth. Since it is required to
receive packets at line-speed, neither parsing nor analysis of the
network data was performed within Network Dude, instead, we relied
different components to process the network traffic.
Worker was the executor for analysis tasks of the CRS. It wrapped tools
such as angr, Driller, Patcherex in a generic interface to be managed
easily. In fact, every Worker instance referred to an entry in a jobs
queue specifying task arguments and type. Since some of the workers had
to execute CGC DECREE binaries for functionality and performance
evaluation, we included a DECREE virtual machine running on QEMU within a
worker.
--[ 005.002 - Dynamic Resource Allocation
Another advantage of our architecture design, alongside dependency
isolation and ease of deployment, was the possibility to dynamically scale
components to meet our needs. Except for the PostgreSQL database and some
internal Kubernetes services, all our components could run on any node
without limitation.
Furthermore, when creating a job, Meister assigned it a priority based on
the component and the current game status. For example, crash-generation
jobs (Rex) were prioritized lower if an input crash was not considered
reliable, or the analysis of a de-fielded CB was considered of no
importance at all. Intuitively, all created jobs were sorted by descending
values of priority, and scheduled through the Kubernetes API until all
nodes' resources (CPU and memory) were saturated. Once all node resources
were taken by running jobs, Meister killed jobs with lower priority to
accommodate new higher priority jobs, but it did not over-provision.
--[ 005.003 - Fail-over
Mechanical Phish was required to run without failure for the duration of
the CFE, an estimated 10 hours. Furthermore, no debugging sessions were
permitted and the CRS was recommended to be resistant to minor hardware
failure (or might have risked to "crash and burn").
To improve the reliability and resiliency of Mechanical Phish, we took
various steps. First, every component, including Ambassador, Meister,
Scriba, Network Dude, and Workers, was deployed as a Docker container. All
components were designed to be entirely stateless, allowing us to restart
them and move them across nodes if necessary.
Although components could be terminated abruptly without any significant
consequence, some components were critical and were required to be running
for Mechanical Phish to function correctly: These were Ambassador, Network
Dude, Scriba, and Meister (crashing and recovering is acceptable for these
components). Fortunately, Kubernetes provided a way to define
always-running instances through DaemonSet and ReplicationController
resources. If an instance of such type is terminated or timed out, it is
automatically launched on another node (to prevent Kubernetes to be a
single point-of-failure, Mechanical Phish was using a highly-available
Kubernetes setup with multiple masters and virtual IP addresses for
access).
Naturally, the entire system cannot be completely stateless, and a single
stateful component is required, which was Farnsworth. To prevent any
failure of the node running the PostgreSQL Docker containers or the
containers themselves, we leveraged PostgreSQL's built-in master-slave
streaming replication for a resilient system. Specifically, for the CFE,
we ran 5 instances on 5 different physical nodes evenly spread across the
rack, and an additional health-checking monitoring service. The monitor
service itself was run using a ReplicationController resource. If the
master database container would have been considered dead by the monitor,
a slave instance would have been elected as the new master and a
replacement slave would have been created on a healthy node. To prevent
components from failing during database disaster recovery, they accessed
the database in a retry loop with exponential back-off. In turn, it would
have ensured that no data would have been lost during the transition from
a slave to master.
The CGC CFE Trials Schedule defined that specific IP addresses were
required to communicate with the CGC API. Given the distributed nature of
our CRS and the recommendation to survive failure, the IP addresses
remained the last single point of failure as specific components needed
to be run on specific physical hosts. Consequently, we used Pacemaker and
Corosync to monitor our components (Ambassador and Network Dude), and
assign the specific IP addresses as virtual IP addresses to a healthy
instance: if a node failed, the address would move to a healthy node.
--[ 006 - Strategy
---;;;;;;;-----'''''''''``' --- `' .,,ccc$hcccccc,. `' ,;;!!!'``,;;!!'
;;;;,,.,;-------''''''' ,;;!!- .zJ$$$$$$$$$$c,. `' ,;;!!!!' ,;
```' -;;;!'''''- `.,.. .zJ$$$$$$$$$$$$$c, `!!'' ,;!!'
!!- ' `,;;;;;;;;;;'''''```' ,c$$$$$$$$$$$$$$$$c, ;!!'' ,;
,;;;!!!!!!!!''``.,;;;;!'`' z$$$$???"""""'.,,.`"?$$$$$$ ``,;;!!!
;;.. --''```_..,;;! J$$$??,zcd$$$$$$$$$$$$h ``'``'
!!!!;;;;, --`!''''''' $$$$$$$$$$$$$$$$$h.`"$$h .
`'''``.,;;;!;;;--;; zF,$$$$$?????$$$$$$$?????$r ;?$$ $.
!;.,..,.````.,;;;; ,$P'J"$$$P" .,c,,.J$$$$$"',cc,_`?h.`$$ $L
'``````' .,.. ,$". $ $$P",c$$$$$$$$',$$$$$ $$ $c,
!!!!!!!!!!!!!''' J