💾 Archived View for aphrack.org › issues › phrack66 › 15.gmi captured on 2021-12-04 at 18:04:22. Gemini links have been rewritten to link to archived content
⬅️ Previous capture (2021-12-03)
-=-=-=-=-=-=-
==Phrack Inc.== Volume 0x0d, Issue 0x42, Phile #0x0F of 0x11 |=-----------------------------------------------------------------------=| |=--------------=[ Linux Kernel Heap Tampering Detection ]=--------------=| |=-----------------------------------------------------------------------=| |=------------------=[ Larry H. <larry@subreption.com> ]=----------------=| |=-----------------------------------------------------------------------=| ------[ Index 1 - History and background of the Linux kernel heap allocators 1.1 - SLAB 1.2 - SLOB 1.3 - SLUB 1.4 - SLQB 1.5 - The future 2 - Introduction: What is KERNHEAP? 3 - Integrity assurance for kernel heap allocators 3.1 - Meta-data protection against full and partial overwrites 3.2 - Detection of arbitrary free pointers and freelist corruption 3.3 - Overview of NetBSD and OpenBSD kernel heap safety checks 3.4 - Microsoft Windows 7 kernel pool allocator safe unlinking 4 - Sanitizing memory of the look-aside caches 5 - Deterrence of IPC based kmalloc() overflow exploitation 6 - Prevention of copy_to_user() and copy_from_user() abuse 7 - Prevention of vsyscall overwrites on x86_64 8 - Developing the right regression testsuite for KERNHEAP 9 - The Inevitability of Failure 9.1 - Subverting SELinux and the audit subsystem 9.2 - Subverting AppArmor 10 - References 11 - Thanks and final statements 12 - Source code ------[ 1. History and background of the Linux kernel heap allocators Before discussing what is KERNHEAP, its internals and design, we will have a glance at the background and history of Linux kernel heap allocators. In 1994, Jeff Bonwick from Sun Microsystems presented the SunOS 5.4 kernel heap allocator at USENIX Summer [1]. This allocator produced higher performance results thanks to its use of caches to hold invariable state information about the objects, and reduced fragmentation significantly, grouping similar objects together in caches. When memory was under stress, the allocator could check the caches for unused objects and let the system reclaim the memory (that is, shrinking the caches on demand). We will refer to these units composing the caches as "slabs". A slab comprises contiguous pages of memory. Each page in the slab holds chunks (objects or buffers) of the same size. This minimizes internal fragmentation, since a slab will only contain same-sized chunks, and only the 'trailing' or free space in the page will be wasted, until it is required for a new allocation. The following diagram shows the layout of Bonwick's slab allocator: +-------+ | CACHE | +-------+ +---------+ | CACHE |----| EMPTY | +-------+ +---------+ +------+ +------+ | PARTIAL |----| SLAB |------| PAGE | (objects) +---------+ +------+ +------+ +-------+ | FULL | ... |-------| CHUNK | +---------+ +-------+ | CHUNK | +-------+ | CHUNK | +-------+ ... These caches operated in a LIFO manner: when an allocation was requested for a given size, the allocator would seek for the first available free object in the appropriate slab. This saved the cost of page allocation and creation of the object altogether. "A slab consists of one or more pages of virtually contiguous memory carved up into equal-size chunks, with a reference count indicating how many of those chunks have been allocated." Page 5, 3.2 Slabs. [1] Each slab was managed with a kmem_slab structure, which contained its reference count, freelist of chunks and linkage to the associated kmem_cache. Each chunk had a header defined as the kmem_bufctl (chunks are commonly referred to as buffers in the paper and implementation), which contained the freelist linkage, address to the buffer and a pointer to the slab it belongs to. The following diagram shows the layout of a slab: .-------------------. | SLAB (kmem_slab) | `-------+--+--------' / \ +----+---+--+-----+ | bufctl | bufctl | +-.-'----+.-'-----+ _.-' .-' +-.-'------.-'-----------------+ | | | ':>=jJ6XKNM| | buffer | buffer | Unused XQNM| | | | ':>=jJ6XKNM| +------------------------------+ [ Page (s) ] For chunk sizes smaller than 1/8 of a page (ex. 512 bytes for x86), the meta-data of the slab is contained within the page, at the very end. The rest of space is then divided in equally sized chunks. Because all buffers have the same size, only linkage information is required, allowing the rest of values to be computed at runtime, saving space. The freelist pointer is stored at the end of the chunk. Bonwick states that this due to end of data structures being less active than the beginning, and permitting debugging to work even when an use-after-free situation has occurred, overwriting data in the buffer, relying on the freelist pointer being intact. In deliberate attack scenarios this is obviously a flawed assumption. An additional word was reserved too to hold a pointer to state information used by objects initialized through a constructor. For larger allocations, the meta-data resides out of the page. The freelist management was simple: each cache maintained a circular doubly-linked list sorted to put the empty slabs (all buffers allocated) first, the partial slabs (free and allocated buffers) and finally the full slabs (reference counter set to zero, all buffers free). The cache freelist pointer points to the first non-empty slab, and each slab then contains its own freelist. Bonwick chose this approach to simplify the memory reclaiming process. The process of reclaiming memory started at the original kmem_cache_free() function, which verified the reference counter. If its value was zero (all buffers free), it moved the full slab to the tail of the freelist with the rest of full slabs. Section 4 explains the intrinsic details of hardware cache side effects and optimization. It is an interesting read due to the hardware used at the time the paper was written. In order to optimize cache utilization and bus balance, Bonwick devised 'slab coloring'. Slab coloring is simple: when a slab is created, the buffer address starts at a different offset (referred to as the color) from the slab base (since a slab is an allocated page or pages, this is always aligned to page size). It is interesting to note that Bonwick already studied different approaches to detect kernel heap corruption, and implemented them in the SunOS 5.4 kernel, possibly predating every other kernel in terms of heap corruption detection). Furthermore, Bonwick noted the performance impact of these features was minimal. "Programming errors that corrupt the kernel heap - such as modifying freed memory, freeing a buffer twice, freeing an uninitialized pointer, or writing beyond the end of a buffer — are often difficult to debug. Fortunately, a thoroughly instrumented ker- nel memory allocator can detect many of these problems." page 10, 6. Debugging features. [1] The audit mode enabled storage of the user of every allocation (an equivalent of the Linux feature that will be briefly described in the allocator subsections) and provided these traces when corruption was detected. Invalid free pointers were detected using a hash lookup in the kmem_cache_free() function. Once an object was freed, and after the destructor was called, it filled the space with 0xdeadbeef. Once this object was being allocated again, the pattern would be verified to see that no modifications occurred (that is, detection of use-after-free conditions, or write-after-free more specifically). Allocated objects were filled with 0xbaddcafe, which marked it as uninitialized. Redzone checking was also implemented to detect overwrites past the end of an object, adding a guard value at that position. This was verified upon free. Finally, a simple but possibly effective approach to detect memory leaks used the timestamps from the audit log to find allocations which had been online for a suspiciously long time. In modern times, this could be implemented using a kernel thread. SunOS did it from userland via /dev/kmem, which would be unacceptable in security terms. For more information about the concepts of slab allocation, refer to Bonwick's paper at [1] provides an in-depth overview of the theory and implementation. ---[ 1.1 SLAB The SLAB allocator in Linux (mm/slab.c) was written by Mark Hemment in 1996-1997, and further improved through the years by Manfred Spraul and others. The design follows closely that presented by Bonwick for his Solaris allocator. It was first integrated in the 2.2 series. This subsection will avoid describing more theory than the strictly necessary, but those interested on a more in-depth overview of SLAB can refer to "Understanding the Linux Virtual Memory Manager" by Mel Gorman, and its eighth chapter "Slab Allocator" [X]. The caches are defined as a kmem_cache structure, comprised of (most commonly) page sized slabs, containing initialized objects. Each cache holds its own GFP flags, the order of pages per slab (2^n), the number of objects (chunks) per slab, coloring offsets and range, a pointer to a constructor function, a printable name and linkage to other caches. Optionally, if enabled, it can define a set of fields to hold statistics an debugging related information. Each kmem_cache has an array of kmem_list3 structures, which contain the information about partial, full and free slab lists: struct kmem_list3 { struct list_head slabs_partial; struct list_head slabs_full; struct list_head slabs_free; unsigned long free_objects; unsigned int free_limit; unsigned int colour_next; ... unsigned long next_reap; int free_touched; }; These structures are initialized with kmem_list3_init(), setting all the reference counters to zero and preparing the list3 to be linked to its respective cache nodelists list for the proper NUMA node. This can be found in cpuup_prepare() and kmem_cache_init(). The "reaping" or draining of the cache free lists is done with the drain_freelist() function, which returns the total number of slabs released, initiated via cache_reap(). A slab is released using slab_destroy(), and allocated with the cache_grow() function for a given NUMA node, flags and cache. The cache contains the doubly-linked lists for the partial, full and free lists, and a free object count in free_objects. A slab is defined with the following structure: struct slab { struct list_head list; /* linkage/pointer to freelist */ unsigned long colouroff; /* color / offset */ void *s_mem; /* start address of first object */ unsigned int inuse; /* num of objs active in slab */ kmem_bufctl_t free; /* first free chunk (or none) */ unsigned short nodeid; /* NUMA node id for nodelists */ }; The list member points to the freelist the slab belongs to: partial, full or empty. The s_mem is used to calculate the address to a specific object with the color offset. Free holds the list of objects. The cache of the slab is tracked in the page structure. The functions used to retrieve the cache a potential object belongs to is virt_to_cache(), which itself relies on page_get_cache() on a page structure pointer. It checks that the Slab page flag is set, and takes the lru.next pointer of the head page (to be compatible with compound pages, this is no different for normal pages). The cache is set with page_set_cache(). The behavior to assign pages to a slab and cache can be seen in slab_map_pages(). The internal function used for cache shrinking is __cache_shrink(), called from kmem_cache_shrink() and during cache destruction. SLAB is clearly poor at the scalability side: on NUMA systems with a large number of nodes, substantial time will be spent on walking the nodelists, drain each freelist, and so forth. In the process, it is most likely that some of those nodes won't be under memory pressure. slab management data is stored inside the slab itself when the size is under 1/8 of PAGE_SIZE (512 bytes for x86, same as Bonwick's allocator). This is done by alloc_slabmgmt(), which either stores the management structure within the slab, or allocates space for it from the kmalloc caches (slabp_cache within the kmem_cache structure, assigned with kmem_find_general_cachep() given the slab size). Again, this is reflected in slab_destroy() which takes care of freeing the off-slab management structure when applicable. The interesting security impact of this logic in managing control structures is that slabs with their meta-data stored off-slab, in one of the general kmalloc caches, will be exposed to potential abuse (ex. in a slab overflow scenario in some adjacent object, the freelist pointer could be overwritten to leverage a write4-primitive during unlinking). This is one of the loopholes which KERNHEAP, as described in this paper, will close or at very least do everything feasible to deter reliable exploitation. Since the basic technical aspects of the SLAB allocator are now covered, the reader can refer to mm/slab.c in any current kernel release for further information. ---[ 1.2 SLOB Released in November 2005, it was developed since 2003 by Matt Mackall for use in embedded systems due to its smaller memory footprint. It lacks the complexity of all other allocators. The granularity of the SLOB allocator supports objects as little as 2 bytes in size, though this is subject to architecture-dependent restrictions (alignment, etc). The author notes that this will normally be 4 bytes for 32-bit architectures, and 8 bytes on 64-bit. The chunks (referred as blocks in his comments at mm/slob.c) are referenced from a singly-linked list within each page. His approach to reduce fragmentation is to place all objects within three distinctive lists: under 256 bytes, under 1024 bytes and then any other objects of size greater than 1024 bytes. The allocation algorithm is a classic next-fit, returning the first slab containing enough chunks to hold the object. Released objects are re-introduced into the freelist in address order. The kmalloc and kfree layer (that is, the public API exposed from SLOB) places a 4 byte header in objects within page size, or uses the lower level page allocator directly if greater in size to allocate compound pages. In such cases, it stores the size in the page structure (in page->private). This poses a problem when detecting the size of an allocated object, since essentially the slob_page and page structures are the same: it's an union and the values of the structure members overlap. Size is enforced to match, but using the wrong place to store a custom value means a corrupted page state. Before put_page() or free_pages(), SLOB clears the Slob bit, resets the mapcount atomically and sets the mapping to NULL, then the page is released back to the low-level page allocator. This prevents the overlapping fields from leading to the aforementioned corrupted state situation. This hack allows both SLOB and the page allocator meta-data to coexist, allowing a lower memory footprint and overhead. ---[ 1.3 SLUB aka The Unqueued Allocator The default allocator in several GNU/Linux distributions at the moment, including Ubuntu and Fedora. It was developed by Christopher Lameter and merged into the -mm tree in early 2007. "SLUB is a slab allocator that minimizes cache line usage instead of managing queues of cached objects (SLAB approach). Per cpu caching is realized using slabs of objects instead of queues of objects. SLUB can use memory efficiently and has enhanced diagnostics." CONFIG_SLUB documentation, Linux kernel. The SLUB allocator was the first introducing merging, the concept of grouping slabs of similar properties together, reducing the number of caches present in the system and internal fragmentation. This, however, has detrimental security side effects which are explained in section 3.1. Fortunately even without a patched kernel, merging can be disabled on runtime. The debugging facilities are far more flexible than those in SLAB. They can be enabled on runtime using a boot command line option, and per-cache. DMA caches are created on demand, or not-created at all if support isn't required. Another important change is the lack of SLAB's per-node partial lists. SLUB has a single partial list, which prevents partially free-allocated slabs from being scattered around, reducing internal fragmentation in such cases, since otherwise those node local lists would only be filled when allocations happen in that particular node. Its cache reaping has better performance than SLAB's, especially on SMP systems, where it scales better. It does not require walking the lists every time a slab is to be pushed into the partial list. For non-SMP systems it doesn't use reaping at all. Meta-data is stored using the page structure, instead of withing the beginning of each slab, allowing better data alignment and again, this reduces internal fragmentation since objects can be packed tightly together without leaving unused trailing space in the page(s). Memory requirements to hold control structures is much lower than SLAB's, as Lameter explains: "SLAB Object queues exist per node, per CPU. The alien cache queue even has a queue array that contain a queue for each processor on each node. For very large systems the number of queues and the number of objects that may be caught in those queues grows exponentially. On our systems with 1k nodes / processors we have several gigabytes just tied up for storing references to objects for those queues This does not include the objects that could be on those queues." To sum it up in a single paragraph: SLUB is a clever allocator which is designed for modern systems, to scale well, work reliably in SMP environments and reduce memory footprint of control and meta-data structures and internal/external fragmentation. This makes SLUB the best current target for KERNHEAP development. ---[ 1.4 SLQB The SLQB allocator was developed by Nick Piggin to provide better scalability and avoid fragmentation as much as possible. It makes a great deal of an effort to avoid allocation of compound pages, which is optimal when memory starts running low. Overall, it is a per-CPU allocator. The structures used to define the caches are slightly different, and it shows that the allocator has been to designed from ground zero to scale on high-end systems. It tries to optimize remote freeing situations (when an object is freed in a different node/CPU than it was allocated at). This is relevant to NUMA environments, mostly. Objects more likely to be subjected to this situation are long-lived ones, on systems with large numbers of processors. It defines a slqb_page structure which "overloads" the lower level page structure, in the same fashion as SLOB does. Instead of an unused padding, it introduces kmem_cache_list ad freelist pointers. For each lookaside cache, each CPU has a LIFO list of the objects local to that node (used for local allocation and freeing), a free and partial pages lists, a queue for objects being freed remotely and a queue of already free objects that come from other CPUs remote free queues. Locking is minimal, but sufficient to control cross-CPU access to these queues. Some of the debugging facilities include tracking the user of the allocated object (storing the caller address, cpu, pid and the timestamp). This track structure is stored within the allocated object space, which makes it subject to partial or full overwrites, thus unsuitable for security purposes like similar facilities in other allocators (SLAB and SLUB, since SLOB is impaired for debugging). Back on SLQB-specific changes, the use of a kmem_cache_cpu structure per CPU can be observed. An article at LWN.net by Jonathan Corbet in December 2008, provides a summary about the significance of this structure: "Within that per-CPU structure one will find a number of lists of objects. One of those (freelist) contains a list of available objects; when a request is made to allocate an object, the free list will be consulted first. When objects are freed, they are returned to this list. Since this list is part of a per-CPU data structure, objects normally remain on the same processor, minimizing cache line bouncing. More importantly, the allocation decisions are all done per-CPU, with no bad cache behavior and no locking required beyond the disabling of interrupts. The free list is managed as a stack, so allocation requests will return the most recently freed objects; again, this approach is taken in an attempt to optimize memory cache behavior." [5] In order to couple with memory stress situations, the freelists can be flushed to return unused partial objects back to the page allocator when necessary. This works by moving the object to the remote freelist (rlist) from the CPU-local freelist, and keep a reference in the remote_free list. The SLQB allocator is well described in depth in the aforementioned article and the source code comments. Feel free to refer to these sources for more in-depth information about its design and implementation. The original RFC and patch can be found at http://lkml.org/lkml/2008/12/11/417 ---[ 1.5 The future As architectures and computing platforms evolve, so will the allocators in the Linux kernel. The current development process doesn't contribute to a more stable, smaller set of options, and it will be inevitable to see new allocators introduced into the kernel mainline, possibly specialized for certain environments. In the short term, SLUB will remain the default, and there seems to be an intention to remove SLOB. It is unclear if SLBQ will see widely spread deployment. Newly developed allocators will require careful assessment, since KERNHEAP is tied to certain assumptions about their internals. For instance, we depend on the ability to track object sizes properly, and it remains untested for some obscure architectures, NUMA systems and so forth. Even a simple allocator like SLOB posed a challenge to implement safety checks, since the internals are greatly convoluted. Thus, it's uncertain if future ones will require a redesign of the concepts composing KERNHEAP. ------[ 2. Introduction: What is KERNHEAP? As of April 2009, no operating system has implemented any form of hardening in its kernel heap management interfaces. Attacks against the SLAB allocator in Linux have been documented and made available to the public as early as 2005, and used to develop highly reliable exploits to abuse different kernel vulnerabilities involving heap allocated buffers. The first public exploit making use of kmalloc() exploitation techniques was the MCAST_MSFILTER exploit by twiz [10]. In January 2009, an obscure, non advertised advisory surfaced about a buffer overflow in the SCTP implementation in the Linux kernel, which could be abused remotely, provided that a SCTP based service was listening on the target host. More specifically, the issue was located in the code which processes the stream numbers contained in FORWARD-TSN chunks. During a SCTP association, a client sends an INIT chunk specifying a number of inbound and outbound streams, which causes the kernel in the server to allocate space for them via kmalloc(). After the association is made effective (involving the exchange of INIT-ACK, COOKIE and COOKIE-ECHO chunks), the attacker can send a FORWARD-TSN chunk with more streams than those specified initially in the INIT chunk, leading to the overflow condition which can be used to overwrite adjacent heap objects with attacker controlled data. The vulnerability itself had certain quirks and requirements which made it a good candidate for a complex exploit, unlikely to be available to the general public, thus restricted to more technically adept circles on kernel exploitation. Nonetheless, reliable exploits for this issue were developed and successfully used in different scenarios (including all major distributions, such as Red Hat with SELinux enabled, and Ubuntu with AppArmor). At some point, Brad Spengler expressed interest on a potential protection against this vulnerability class, and asked the author what kind of measures could be taken to prevent new kernel-land heap related bugs from being exploited. Shortly afterwards, KERNHEAP was born. After development started, a fully remote exploit against the SCTP flaw surfaced, developed by sgrakkyu [15]. In private discussions with few individuals, a technique for executing a successful attack remotely was proposed: overwrite a syscall pointer to an attacker controlled location (like a hook) to safely execute our payload out of the interrupt context. This is exactly what sgrakkyu implemented for x86_64, using the vsyscall table, which bypasses CONFIG_DEBUG_RODATA (read-only .rodata) restrictions altogether. His exploit exposed not only the flawed nature of the vulnerability classification process of several organizations, the hypocritical and unethical handling of security flaws of the Linux kernel developers, but also the futility of SELinux and other security models against kernel vulnerabilities. In order to prevent and detect exploitation of this class of security flaws in the kernel, a new set of protections had to be designed and implemented: KERNHEAP. KERNHEAP encompasses different concepts to prevent and detect heap overflows in the Linux kernel, as well as other well known heap related vulnerabilities, namely double frees, partial overwrites, etc. These concepts have been implemented introducing modifications into the different allocators, as well as common interfaces, not only preventing generic forms of memory corruption but also hardening specific areas of the kernel which have been used or could be potentially used to leverage attacks corrupting the heap. For instance, the IPC subsystem, the copy_to_user() and copy_from_user() APIs and others. This is still ongoing research and the Linux kernel is an ever evolving project which poses significant challenges. The inclusion of new allocators will always pose a risk for new issues to surface, requiring these protections to be adapted, or new ones developed for them. ------[ 3. Integrity assurance for kernel heap allocators ---[ 3.1 Meta-data protection against full and partial overwrites As of the current (yet ever changing) upstream design of the current kernel allocators (SLUB, SLAB, SLOB, future SLQB, etc.), we assume: 1. A set of caches exist which hold dynamically allocated slabs, composed of one of more physically contiguous pages, containing same size chunks. 2. These are initialized by default or created explicitly, always with a known size. For example, multiple default caches exist to hold slabs of common sizes which are a multiple of two (32, 64, 128, 256 and so forth). 3. These caches grow or shrink in size as required by the allocator. 4. At the end of a kmem cache life, it must be destroyed and its slabs released. The linked list of slabs is implicitly trusted in this context. 5. The caches can be allocated contiguously, or adjacent to an actual chain of slabs from another cache. Because the current kmem_cache structure holds potentially harmful information (including a pointer to the constructor of the cache), this could be leveraged in an attack to subvert the execution flow. 6. The debugging facilities of these allocators provide a merely informational value with their error detection mechanisms, which are also inherently insecure. They are not enabled by default and have a extremely high performance impact (accounting up to 50 to 70% slowdown). In addition, they leak information which could be invaluable for a local attacker (ex. fixed known values). We are facing multiple issues in this scenario. First, the kernel developers expect the third-party to handle situations like a cache being destroyed while an object is being allocated. Albeit highly unusual, such circumstances (like {6}) can arise provided the right conditions are present. In order to prevent {5} from being abused, we are left with two realistic possibilities to deter a potential attack: randomization of the allocator routines (see ASLR from the PaX documentation in [7] for the concept) or introduce a guard (known in modern times as a 'cookie') which contains information to validate the integrity of the kmem_cache structure. Thus, a decision was made to introduce a guard which works in 'cascade': +--------------+ | global guard |------------------+ +--------------| kmem_cache guard |------------+ +------------------| slab guard | ... +------------+ The idea is simple: break down every potential path of abuse and add integrity information to each lower level structure. By deploying a check which relies in all the upper level guards, we can detect corruption of the data at any stage. In addition, this makes the safety checks more resilient against information leaks, since an attacker will be forced to access and read a wider range of values than one single cookie. Such data could be out of range to the context of the execution path being abused. The global guard is initialized at the kernheap_init() function, called from init/main.c during kernel start. In order to gather entropy for its value, we need to initialize the random32 PRNG earlier than in a default, upstream kernel. On x86, this is done with the rdtsc xor'd with the jiffies value, and then seeded multiple times during different stages of the kernel initialization, ensuring we have a decent amount of entropy to avoid an easily predictable result. Unfortunately, an architecture-independent method to seed the PRNG hasn't been devised yet. Right now this is specific to platforms with a working get_cycles() implementation (otherwise it falls back to a more insecure seeding using different counters), though it is intended to support all architectures where PaX is currently supported. The slab and kmem_cache structures are defined in mm/slab.c and mm/slub.c for the SLAB and SLUB allocators, respectively. The kernel developers have chosen to make their type information static to those files, and not available in the mm/slab.h header file. Since the available allocators have generally different internals, they only export a common API (even though few functions remain as no-op, for example in SLOB). A guard field has been added at the start of the kmem_cache structure, and other structures might be modified to include a similar field (depending on the allocator). The approach is to add a guard anywhere where it can provide balanced performance (including memory footprint) and security results. In order to calculate the final checksum used in each kmem_cache and their slabs, a high performance, yet collision resistant hash function was required. This instantly left options such as the CRC family, FNV, etc. out, since they are inefficient for our purposes. Therefore, Murmur2 was chosen [9]. It's an exceptionally fast, yet simple algorithm created by Austin Appleby, currently used by libmemcached and other software. Custom optimized versions were developed to calculate hashes for the slab and cache structures, taking advantage of the fact that only a relatively small set of word values need to be hashed. The coverage of the guard checks is obviously limited to the meta-data, but yields reliable protection for all objects of 1/8 page size and any adjacent ones, during allocation and release operations. The copy_from_user() and copy_to_user() functions have been modified to include a slab and cache integrity check as well, which is orthogonal to the boundary enforcement modifications explained in another section of this paper. The redzone approach used by the SLAB/SLUB/SLQB allocators used a fixed known value to detect certain scenarios (explained in the next subsection). The values are 64-bit long: #define RED_INACTIVE 0x09F911029D74E35BULL #define RED_ACTIVE 0xD84156C5635688C0ULL This is clearly suitable for debugging purposes, but largely inefficient for security. An immediate improvement would be to generate these values on runtime, but then it is still possible to avoid writing over them and still modify the meta-data. This is exactly what is being prevented by using a checksum guard, which depends on a runtime generated cookie (at boot time). The examples below show an overwrite of an object in the kmalloc-64 cache: slab error in verify_redzone_free(): cache `size-64': memory outside object was overwritten Pid: 6643, comm: insmod Not tainted 2.6.29.2-grsec #1 Call Trace: [<c0889a81>] __slab_error+0x1a/0x1c [<c088aee9>] cache_free_debugcheck+0x137/0x1f5 [<c088ba14>] kfree+0x9d/0xd2 [<c0802f22>] syscall_call+0x7/0xb df271338: redzone 1:0xd84156c5635688c0, redzone 2:0x4141414141414141. Slab corruption: size-64 start=df271398, len=64 Redzone: 0x4141414141414141/0x9f911029d74e35b. Last user: [<c08d1da5>](free_rb_tree_fname+0x38/0x6f) 000: 41 41 41 41 41 41 41 41 41 41 41 41 41 41 41 41 010: 41 41 41 41 41 41 41 41 41 41 41 41 41 41 41 41 020: 41 41 41 41 41 41 41 41 6b 6b 6b 6b 6b 6b 6b 6b Prev obj: start=df271340, len=64 Redzone: 0xd84156c5635688c0/0xd84156c5635688c0. Last user: [<c08d1e55>](ext3_htree_store_dirent+0x34/0x124) 000: 48 8e 78 08 3b 49 86 3d a8 1f 27 df e0 10 27 df 010: a8 14 27 df 00 00 00 00 62 d3 03 00 0c 01 75 64 Next obj: start=df2713f0, len=64 Redzone: 0x9f911029d74e35b/0x9f911029d74e35b. Last user: [<c08d1da5>](free_rb_tree_fname+0x38/0x6f) 000: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 010: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b The trail of 0x6B bytes can be observed in the output above. This is the SLAB_POISON feature. Poisoning is the approach that will be described in the next subsection. It's basically overwriting the object contents with a known value to detect modifications post-release or uninitialized usage. The values are defined (like the redzone ones) at include/linux/poison.h: #define POISON_INUSE 0x5a #define POISON_FREE 0x6b #define POISON_END 0xa5 KERNHEAP performs validation of the cache guards at allocation and release related functions. This allows detection of corruption in the chain of guards and results in a system halt and a stack dump. The safety checks are triggered from kfree() and kmem_cache_free(), kmem_cache_destroy() and other places. Additional checkpoints are being considered, since taking a wrong approach could lead to TOCTOU issues, again depending on the allocator. In SLUB, merging is disabled to avoid the potentially detrimental effects (to security) of this feature. This might kill one of the most attractive points of SLUB, but merging comes at the cost of letting objects be neighbors to other objects which would have been placed elsewhere out of reach, allowing overflow conditions to produce likely exploitable conditions. Even with guard checks in place, this is still a scenario to be avoided. One additional change, first introduced by PaX, is to change the address of the ZERO_SIZE_PTR. In mainline kernel, this address points to 0x00000010. An address reachable in userland is clearly a bad idea in security terms, and PaX wisely solves this by setting it to 0xfffffc00, and modifying the ZERO_OR_NULL_PTR macro. This protects against a situation in which kmalloc is called with a zero size (for example due to an integer overflow in a length parameter) and the pointer is used to read or write information from or to userland. ---[ 3.2 Detection of arbitrary free pointers and freelist corruption In the history of heap related memory corruption vulnerabilities, a more obscure class of flaws has been long time known, albeit less publicized: arbitrary pointer and double free issues. The idea is simple: a programming mistake leads to an exploitable condition in which the state of the heap allocator can be made inconsistent when an already freed object is being released again, or an arbitrary pointer is passed to the free function. This is a strictly allocator internals-dependent scenario, but generally the goal is to control a function pointer (for example, a constructor/destructor function used for object initialization, which is later called) or a write-n primitive (a single byte, four bytes and so forth). In practice, these vulnerabilities can pose a true challenge for exploitation, since thorough knowledge of the allocator and state of the heap is required. Manipulating the freelist (also known as freelist in the kernel) might cause the state of the heap to be unstable post-exploitation and thwart cleanup efforts or graceful returns. In addition, another thread might try to access it or perform operations (such as an allocation) which yields a page fault. In an environment with 2.6.29.2 (grsecurity patch applied, full PaX feature set enabled except for KERNEXEC, RANDKSTACK and UDEREF) and the SLAB allocator, the following scenarios could be observed: 1. An object is allocated and shortly afterwards, the object is released via kfree(). Another allocation follows, and a pointer referencing to the previous allocation is passed to kfree(), therefore the newly allocated object is released instead due to the LIFO nature of the allocator. void *a = kmalloc(64, GFP_KERNEL); foo_t *b = (foo_t *) a; /* ... */ kfree(a); a = kmalloc(64, GFP_KERNEL); /* ... */ kfree(b); 2. An object is allocated, and two successive calls to kfree() take place with no allocation in-between. void *a = kmalloc(64, GFP_KERNEL); foo_t *b = (foo_t *) a; kfree(a); kfree(b); In both cases we are releasing an object twice, but the state of the allocator changes slightly. Also, there could be more than just a single allocation in-between (for example, if this condition existed within filesystem or network stack code) leading to less predictable results. The more obvious result of the first scenario is corruption of the freelist, and a potential information leak or arbitrary access to memory in the second (for instance, if an attacker could force a new allocation before the incorrectly released object is used, he could control the information stored there). The following output can be observed in a system using the SLAB allocator with is debugging facilities enabled: slab error in verify_redzone_free(): cache `size-64': double free detected Pid: 4078, comm: insmod Not tainted 2.6.29.2-grsec #1 Call Trace: [<c0889a81>] __slab_error+0x1a/0x1c [<c088aee9>] cache_free_debugcheck+0x137/0x1f5 [<c088ba14>] kfree+0x9d/0xd2 [<c0802f22>] syscall_call+0x7/0xb df2e42e0: redzone 1:0x9f911029d74e35b, redzone 2:0x9f911029d74e35b. The debugging facilities of SLAB and SLUB provide a redzone-based approach to detect the first scenario, but introduce a performance impact while being useless security-wise, since the system won't halt and the state of the allocator will be left unstable. Therefore, their value is only informational and useful for debugging purposes, not as a security measure. The redzone values are also static. The other approach taken by the debugging facilities is poisoning, as mentioned in the previous subsection. An object is 'poisoned' with a value, which can be checked at different places to detect if the object is being used uninitialized or post-release. This rudimentary but effective method is implemented upstream in a manner which makes it inefficient for security purposes. Currently, upstream poisoning is clearly oriented to debugging. It writes a single-byte pattern in the whole object space, marking the end with a known value. This incurs in a significant performance impact. KERNHEAP performs the following safety checks at the time of this writing: 1. During cache destruction: a) The guard value is verified. b) The entire cache is walked, verifying the freelists for potential corruption. Reference counters, guards, validity of pointers and other structures are checked. If any mismatch is found, a system halt ensues. c) The pointer to the cache itself is changed to ZERO_SIZE_PTR. This should not affect any well behaving (that is, not broken) kernel code. 2. After successful kfree, a word value is written to the memory and pointer location is changed to ZERO_SIZE_PTR. This will trigger a distinctive page fault if the pointer is accessed again somewhere. Currently this operation could be invasive for drivers or code with dubious coding practices. 3. During allocation, if the word value at the start of the to-be-returned object doesn't match our post-free value, a system halt ensues. The object-level guard values (equivalent to the redzoning) are calculated on runtime. This deters bypassing of the checks via fake objects, resulting from a slab overflow scenario. It does introduce a low performance impact on setup and verification, minimized by the use of inline functions, instead of external definitions like those used for some of the more general cache checks. The effectiveness of the reference counter checks is orthogonal to the deployment of PaX's REFCOUNT, which protects many object reference counters against overflows (including SLAB/SLUB). Safe unlinking is enforced in all LIST_HEAD based linked lists, which obviously includes the partial/empty/full lists for SLAB and several other structures (including the freelists) in other allocators. If a corrupted entry is being unlinked, a system halt is forced. The values used for list pointer poisoning have been changed to point non-userland-reachable addresses (this change has been taken from PaX). The use-after-free and double-free detection mechanisms in KERNHEAP are still under development, and it's very likely that substantial design changes will occur after the release of this paper. ---[ 3.3 Overview of NetBSD and OpenBSD kernel heap safety checks At the moment KERNHEAP exclusively covers the Linux kernel, but it is interesting to observe the approaches taken by other projects to detect kernel heap integrity issues. In this section we will briefly analyze the NetBSD and OpenBSD kernels, which are largely the same code base in regards of kernel malloc implementation and diagnostic checks. Both currently implement rudimentary but effective measures to detect use-after-free and double-free scenarios, albeit these are only enabled as part of the DIAGNOSTIC and DEBUG configurations. The following source code is taken from NetBSD 4.0 and should be almost identical to OpenBSD. Their approach to detect use-after-free relies on copying a known 32-bit value (WEIRD_ADDR, from kern/kern_malloc.c): /* * The WEIRD_ADDR is used as known text to copy into free objects so * that modifications after frees can be detected. */ #define WEIRD_ADDR ((uint32_t) 0xdeadbeef) ... void *malloc(unsigned long size, struct malloc_type *ksp, int flags) ... { ... #ifdef DIAGNOSTIC /* * Copy in known text to detect modification * after freeing. */ end = (uint32_t *)&cp[copysize]; for (lp = (uint32_t *)cp; lp < end; lp++) *lp = WEIRD_ADDR; freep->type = M_FREE; #endif /* DIAGNOSTIC */ The following checks are the counterparts in free(), which call panic() when the checks fail, causing a system halt (this obviously has a better security benefit than just the information approach taken by Linux's SLAB diagnostics): #ifdef DIAGNOSTIC ... if (__predict_false(freep->spare0 == WEIRD_ADDR)) { for (cp = kbp->kb_next; cp; cp = ((struct freelist *)cp)->next) { if (addr != cp) continue; printf("multiply freed item %p\n", addr); panic("free: duplicated free"); } } ... copysize = size < MAX_COPY ? size : MAX_COPY; end = (int32_t *)&((caddr_t)addr)[copysize]; for (lp = (int32_t *)addr; lp < end; lp++) *lp = WEIRD_ADDR; freep->type = ksp; #endif /* DIAGNOSTIC */ Once the object is released, the 32-bit value is copied, along the type information to detect the potential origin of the problem. This should be enough to catch basic forms of freelist corruption. It's worth noting that the freelist_sanitycheck() function provides integrity checking for the freelist, but is enclosed in an ifdef 0 block. The problem affecting these diagnostic checks is the use of known values, as much as Linux's own SLAB redzoning and poisoning might be easily bypassed in a deliberate attack scenario. It still remains slightly more effective due to the system halt enforcing upon detection, which isn't present in Linux. Other sanity checks are done with the reference counters in free(): if (ksp->ks_inuse == 0) panic("free 1: inuse 0, probable double free"); And validating (with a simple address range test) if the pointer being freed looks sane: if (__predict_false((vaddr_t)addr < vm_map_min(kmem_map) || (vaddr_t)addr >= vm_map_max(kmem_map))) panic("free: addr %p not within kmem_map", addr); Ultimately, users of either NetBSD or OpenBSD might want to enable KMEMSTATS or DIAGNOSTIC configurations to provide basic protection against heap corruption in those systems. ---[ 3.4 Microsoft Windows 7 kernel pool allocator safe unlinking In 26 May 2009, a suspiciously timed article was published by Peter Beck from the Microsoft Security Engineering Center (MSEC) Security Science team, about the inclusion of safe unlinking into the Windows 7 kernel pool (the equivalent to the slab allocators in Linux). This has received a deal of publicity for a change which accounts up to two lines of effective code, and surprisingly enough, was already present in non-retail versions of Vista. In addition, safe unlinking has been present in other heap allocators for a long time: in the GNU libc since at least 2.3.5 (proposed by Stefan Esser originally to Solar Designer for the Owl libc) and the Linux kernel since 2006 (CONFIG_DEBUG_LIST). While it is out of scope for this paper to explain the internals of the Windows kernel pool allocator, this section will provide a short overview of it. For true insight the slides by Kostya Kortchinsky, "Exploiting Kernel Pool Overflows" [14], can provide a through look at it from a sound security perspective. The allocator is very similar to SLAB and the API to obtain allocations and release them is straightforward (nt!ExAllocatePool(WithTag), nt!ExFreePool(WithTag) and so forth). The default pools (sort of a kmem_cache equivalent) are the (two) paged, non-paged and session paged ones. Non-paged for physical memory allocations and paged for pageable memory. The structure defining a pool can be seen below: kd> dt nt!_POOL_DESCRIPTOR +0x000 PoolType : _POOL_TYPE +0x004 PoolIndex : Uint4B +0x008 RunningAllocs : Uint4B +0x00c RunningDeAllocs : Uint4B +0x010 TotalPages : Uint4B +0x014 TotalBigPages : Uint4B +0x018 Threshold : Uint4B +0x01c LockAddress : Ptr32 Void +0x020 PendingFrees : Ptr32 Void +0x024 PendingFreeDepth : Int4B +0x028 ListHeads : [512] _LIST_ENTRY The most important member in the structure is ListHeads, which contains 512 linked lists, to hold the free chunks. The granularity of the allocator is 8 bytes for Windows XP and up, and 32 bytes for Windows 2000. The maximum allocation size possible is 4080 bytes. LIST_ENTRY is exactly the same as LIST_HEAD in Linux. Each chunk contains a 8 byte header. The chunk header is defined as follows for Windows XP and up: kd> dt nt!_POOL_HEADER +0x000 PreviousSize : Pos 0, 9 Bits +0x000 PoolIndex : Pos 9, 7 Bits +0x002 BlockSize : Pos 0, 9 Bits +0x002 PoolType : Pos 9, 7 Bits +0x000 Ulong1 : Uint4B +0x004 ProcessBilled : Ptr32 _EPROCESS +0x004 PoolTag : Uint4B +0x004 AllocatorBackTraceIndex : Uint2B +0x006 PoolTagHash : Uint2B The PreviousSize contains the value of the BlockSize of the previous chunk, or zero if it's the first. This value could be checked during unlinking for additional safety, but this isn't the case (their checks are limited to validity of prev/next pointers relative to the entry being deleted). PooType is zero if free, and PoolTag contains four printable characters to identify the user of the allocation. This isn't authenticated nor verified in any way, therefore it is possible to provide a bogus tag to one of the allocation or free APIs. For small allocations, the pool allocator uses lookaside caches, with a maximum BlockSize of 256 bytes. Kostya's approach to abuse pool allocator overflows involves the classic write-4 primitive through unlinking of a fake chunk under his control. For the rest of information about the allocator internals, please refer to his excellent slides [14]. The minimal change introduced by Microsoft to enable safe unlinking in Windows 7 was already present in Vista non-retail builds, thus it is likely that the announcement was merely a marketing exercise. Furthermore, Beck states that this allows to detect "memory corruption at the earliest opportunity", which isn't necessarily correct if they had pursued a more complete solution (for example, verifying that pointers belong to actual freelist chunks). Those might incur in a higher performance overhead, but provide far more consistent protection. The affected API is RemoveEntryList(), and the result of unlinking an entry with incorrect prev/next pointers will be a BugCheck: Flink = Entry->Flink; Blink = Entry->Blink; if (Flink->Blink != Entry) KeBugCheckEx(...); if (Blink->Flink != Entry) KeBugCheckEx(...); It's unlikely that there will be further changes to the pool allocator for Windows 7, but there's still time for this to change before release date. ------[ 4. Sanitizing memory of the look-aside caches The objects and data contained in slabs allocated within the kmem caches could be of sensitive nature, including but not limited to: cryptographic secrets, PRNG state information, network information, userland credentials and potentially useful internal kernel state information to leverage an attack (including our guards or cookie values). In addition, neither kfree() nor kmalloc() zero memory, thus allowing the information to stay there for an indefinite time, unless they are overwritten after the space is claimed in an allocation procedure. This is a security risk by itself, since an attacker could essentially rely on this condition to "spray" the kernel heap with his own fake structures or machine instructions to further improve the reliability of his attack. PaX already provides a feature to sanitize memory upon release, at a performance cost of roughly 3%. This an opt-all policy, thus it is not possible to choose in a fine-grained manner what memory is sanitized and what isn't. Also, it works at the lowest level possible, the page allocator. While this is a safe approach and ensures that all allocated memory is properly sanitized, it is desirable to be able to opt-in voluntarily to have your newly allocated memory treated as sensitive. Hence, a GFP_SENSITIVE flag has been introduced. While a security conscious developer could zero memory on his own, the availability of a flag to assure this behavior (as well as other enhancements and safety checks) is convenient. Also, the performance cost is negligible, if any, since the flag could be applied to specific allocations or caches altogether. The low level page allocator uses a PF_sensitive flag internally, with the associated SetPageSensitive, ClearPagesensitiv and PageSensitive macros. These changes have been introduced in the linux/page-flags.h header and mm/page_alloc.c. SLAB / kmalloc layer Low-level page allocator include/linux/slab.h include/linux/page-flags.h +----------------. +--------------+ | SLAB_SENSITIVE | ->| PG_sensitive | +----------------. | +--------------+ | | |-> SetPageSensitive | +---------------+ | |-> ClearPageSensitive \---> | GFP_SENSITIVE |-/ |-> PageSensitive +---------------+ ... This will prevent the aforementioned leak of information post-release, and provide an easy to use mechanism for third-party developers to take advantage of the additional assurance provided by this feature. In addition, another loophole that has been removed is related with situations in which successive allocations are done via kmalloc(), and the information is still accessible through the newly allocated object. This happens when the slab is never released back to the page allocator, since slabs can live for an indefinite amount of time (there's no assurance as to when the cache will go through shrinkage or reaping). Upon release, the cache can be checked for the SLAB_SENSITIVE flag, the page can be checked for the PG_sensitive bit, and the allocation flags can be checked for GFP_SENSITIVE. Currently, the following interfaces have been modified to operate with this flag when appropriate: - IPC kmem cache - Cryptographic subsystem (CryptoAPI) - TTY buffer and auditing API - WEP encryption and decryption in mac80211 (key storage only) - AF_KEY sockets implementation - Audit subsystem The RBAC engine in grsecurity can be modified to add support for enabling the sensitive memory flag per-process. Also, a group id based check could be added, configurable via sysctl. This will allow fine-grained policy or group based deployment of the current and future benefits of this flag. SELinux and any other policy based security frameworks could benefit from this feature as well. This patchset has been proposed to the mainline kernel developers as of May 21st 2009 (see http://patchwork.kernel.org/patch/25062). It received feedback from Alan Cox and Rik van Riel and a different approach was used after some developers objected to the use of a page flag, since the functionality can be provided to SLAB/SLUB allocators and the VMA interfaces without the use of a page flag. Also, the naming changed to CONFIDENTIAL, to avoid confusion with the term 'sensitive'. Unfortunately, without a page bit, it's impossible to track down what pages shall be sanitized upon release, and provide fine-grained control over these operations, making the gfp flag almost useless, as well as other interesting features, like sanitizing pages locked via mlock(). The mainline kernel developers oppose the introduction of a new page flag, even though SLUB and SLOB introduced their own flags when they were merged, and this wasn't frowned upon in such cases. Hopefully this will change in the future, and allow a more complete approach to be merged in mainline at some point. Despite the fact that Ingo Molnar, Pekka Enberg and Peter Zijlstra completely missed the point about the initially proposed patches, new ones performing selective sanitization were sent following up their recommendations of a completely flawed approach. This case serves as a good example of how kernel developers without security knowledge nor experience take decisions that negatively impact conscious users of the Linux kernel as a whole. Hopefully, in order to provide a reliable protection, the upstream approach will finally be selective sanitization using kzfree(), allowing us to redefine it to kfree() in the appropriate header file, and use something that actually works. Fixing a broken implementation is an undesirable burden often found when dealing with the 2.6 branch of the kernel, as usual. ------[ 5. Deterrence of IPC based kmalloc() overflow exploitation In addition to the rest of the features which provide a generic protection against common scenarios of kernel heap corruption, a modification has been introduced to deter a specific local attack for abusing kmalloc() overflows successfully. This technique is currently the only public approach to kernel heap buffer overflow exploitation and relies on the following circumstances: 1. The attacker has local access to the system and can use the IPC subsystem, more specifically, create, destroy and perform operations on semaphores. 2. The attacker is able to abuse a allocate-overflow-free situation which can be leveraged to overwrite adjacent objects, also allocated via kmalloc() within the same kmem cache. 3. The attacker can trigger the overflow in the right timing to ensure that the adjacent object overwritten is under his control. In this case, the shmid_kernel structure (used internally within the IPC subsystem), leading to a userland pointer dereference, pointing at attacker controlled structures. 4. Ultimately, when these attacker controlled structures are used by the IPC subsystem, a function pointer is called. Since the attacker controls this information, this is essentially a game-over scenario. The kernel will execute arbitrary code of the attacker's choice and this will lead to elevation of privileges. Currently, PaX UDEREF [8] on x86 provides solid protection against (3) and (4). The attacker will be unable to force the kernel into executing instructions located in the userland address space. A specific class of vulnerabilities, kernel NULL pointer deferences (which were, for a long time, overlooked and not considered exploitable by most of the public players in the security community, with few exceptions) were mostly eradicated (thanks to both UDEREF and further restrictions imposed on mmap(), later implemented by Red Hat and accepted into mainline, albeit containing flaws which made the restriction effectively useless). On systems where using UDEREF is unbearable for performance or functionality reasons (for example, virtualization), a workaround to harden the IPC subsystem was necessary. Hence, a set of simple safety checks were devised for the shmid_kernel structure, and the allocation helper functions have been modified to use their own private cache. The function pointer verification checks if the pointers located within the file structure, are actually addresses within the kernel text range (including modules). The internal allocation procedures of the IPC code make use of both vmalloc() and kmalloc(), for sizes greater than a page or lower than a page, respectively. Thus, the size for the cache objects is PAGE_SIZE, which might be suboptimal in terms of memory space, but does not impact performance. These changes have been tested using the IBM ipc_stress test suite distributed in the Linux Test Project sources, with successful results (can be obtained from http://ltp.sourceforge.net). ------[ 6. Prevention of copy_to_user() and copy_from_user() abuse A vast amount of kernel vulnerabilities involving information leaks to userland, as well as buffer overflows when copying data from userland, are caused by signedness issues (meaning integer overflows, reference counter overflows, et cetera). The common scenario is an invalid integer passed to the copy_to_user() or copy_from_user() functions. During the development of KERNHEAP, a question was raised about these functions: Is there a existent, reliable API which allows retrieval of the target buffer information in both copy-to and copy-from scenarios? Introducing size awareness in these functions would provide a simple, yet effective method to deter both information leaks and buffer overflows through them. Obviously, like in every security system, the effectiveness of this approach is orthogonal to the deployment of other measures, to prevent potential corner cases and rare situations useful for an attacker to bypass the safety checks. The current kernel heap allocators (including SLOB) provide a function to retrieve the size of a slab object, as well as testing the validity of a pointer to see if it's within the known caches (excluding SLOB which required this function to be written since it's essentially a no-op in upstream sources). These functions are ksize() and kmem_validate_ptr() respectively (in each pertinent allocator source: mm/slab.c, mm/slub.c and mm/slob.c). In order to detect whether a buffer is stack or heap based in the kernel, the object_is_on_stack() function (from include/linux/sched.h) can be used. The drawback of these functions is the computational cost of looking up the page where this buffer is located, checking its validity wherever applicable (in the case of kmem_validate_ptr() this involves validating against a known cache) and performing other tasks to determine the validity and properties of the buffer. Nonetheless, the performance impact might be negligible and reasonable for the additional assurance provided with these changes. Brad Spengler devised this idea, developed and introduced the checks into the latest test patches as of April 27th (test10 to test11 from PaX and the grsecurity counterparts for the current kernel stable release, 2.6.29.1). A reliable method to detect stack-based objects is still being considered for implementation, and might require access to meta-data used for debuggers or future GCC built-ins. ------[ 7. Prevention of vsyscall overwrites on x86_64 This technique is used in sgrakkyu's exploit for CVE-2009-0065. It involves overwriting a x86_64 specific location within a top memory allocated page, containing the vsyscall mapping. This mapping is used to implement a high performance entry point for the gettimeofday() system call, and other functionality. An attacker can target this mapping by means of an arbitrary write-N primitive and overwrite the machine instructions there to produce a reliable return vector, for both remote and local attacks. For remote attacks the attacker will likely use an offset-aware approach for reliability, but locally it can be used to execute an offset-less attack, and force the kernel into dereferencing userland memory. This is problematic since presently PaX does not support UDEREF on x86_64 and the performance cost of its implementation could be significant, making abuse a safe bet even against hardened environments. Therefore, contrary to past popular belief, x86_64 systems are more exposed than i386 in this regard. During conversations with the PaX Team, some difficulties came to attention regarding potential approaches to deter this technique: 1. Modifying the location of the vsyscall mapping will break compatibility. Thus, glibc and other userland software would require further changes. See arch/x86/kernel/vmlinux_64.lds.S and arch/x86/kernel/vsyscall_64.c 2. The vsyscall page is defined within the ld linked script for x86_64 (arch/x86/kernel/vmlinux_64.lds.S). It is defined by default (as of 2.6.29.3) within the boundaries of the .data section, thus writable for the kernel. The userland mapping is read-execute only. 3. Removing vsyscall support might have a large performance impact on applications making extensive use of gettimeofday(). 4. Some data has to be written in this region, therefore it can't be permanently read-only. PaX provides a write-protect mechanism used by KERNEXEC, together with its definition for an actual working read-only .rodata implementation. Moving the vsyscall within the .rodata section provides reliable protection against this technique. In order to prevent sections from overlapping, some changes had to be introduced, since the section has to be aligned to page size. In non-PaX kernels, .rodata is only protected if the CONFIG_DEBUG_RODATA option is enabled. The PaX Team solved {4} using pax_open_kernel() and pax_close_kernel() to allow writes temporarily. This has some performance impact but is most likely far lower than removing vsyscall support completely. This deters abuse of the vsyscall page on x86_64, and prevents offset-based remote and offset-less local exploits from leveraging a reliable attack against a kernel vulnerability. Nonetheless, protection against this venue of attack is still work in progress. ------[ 8. Developing the right regression testsuite for KERNHEAP Shortly after the initial development process started, it became evident that a decent set of regression tests was required to check if the implementation worked as expected. While using single loadable modules for each test was a straightforward solution, in the longterm, having a real tool to perform thorough testing seemed the most logical approach. Hence, KHTEST has been developed. It's composed of a kernel module which communicates to a userland Python program over Netlink sockets. The ctypes API is used to handle the low level structures that define commands and replies. The kernel module exposes internal APIs to the userland process, such as: - kmalloc - kfree - memset and memcpy - copy_to_user and copy_from_user Using this interface, allocation and release of kernel memory can be controlled with a simple Python script, allowing efficient development of testcases: e = KernHeapTester() addr = e.kmalloc(size) e.kfree(addr) e.kfree(addr) When this test runs on an unprotected 2.6.29.2 system (SLAB as allocator, debugging capabilities enabled) the following output can be observed in the kernel message buffer, with a subsequent BUG on cache reaping: KERNHEAP test-suite loaded. run_cmd_kmalloc: kmalloc(64, 000000b0) returned 0xDF1BEC30 run_cmd_kfree: kfree(0xDF1BEC30) run_cmd_kfree: kfree(0xDF1BEC30) slab error in verify_redzone_free(): cache `size-64': double free detected Pid: 3726, comm: python Not tainted 2.6.29.2-grsec #1 Call Trace: [<c0889a81>] __slab_error+0x1a/0x1c [<c088aee9>] cache_free_debugcheck+0x137/0x1f5 [<e082f25c>] ? run_cmd_kfree+0x1e/0x23 [kernheap_test] [<c088ba14>] kfree+0x9d/0xd2 [<e082f25c>] run_cmd_kfree+0x1e/0x23 kernel BUG at mm/slab.c:2720! invalid opcode: 0000 [#1] SMP last sysfs file: /sys/kernel/uevent_seqnum Pid: 10, comm: events/0 Not tainted (2.6.29.2-grsec #1) VMware Virtual Platform EIP: 0060:[<c088ac00>] EFLAGS: 00010092 CPU: 0 EIP is at slab_put_obj+0x59/0x75 EAX: 0000004f EBX: df1be000 ECX: c0828819 EDX: c197c000 ESI: 00000021 EDI: df1bec28 EBP: dfb3deb8 ESP: dfb3de9c DS: 0068 ES: 0068 FS: 00d8 GS: 0000 SS: 0068 Process events/0 (pid: 10, ti=dfb3c000 task=dfb3ae30 task.ti=dfb3c000) Stack: c0bc24ee c0bc1fd7 df1bec28 df800040 df1be000 df8065e8 df800040 dfb3dee0 c088b42d 00000000 df1bec28 00000000 00000001 df809db4 df809db4 00000001 df809d80 dfb3df00 c088be34 00000000 df8065e8 df800040 df8065e8 df800040 Call Trace: [<c088b42d>] ? free_block+0x98/0x103 [<c088be34>] ? drain_array+0x85/0xad [<c088beba>] ? cache_reap+0x5e/0xfe [<c083586a>] ? run_workqueue+0xc4/0x18c [<c088be5c>] ? cache_reap+0x0/0xfe [<c0838593>] ? kthread+0x0/0x59 [<c0803717>] ? kernel_thread_helper+0x7/0x10 The following code presents a more complex test to evaluate a double-free situation which will put a random kmalloc cache into an unpredictable state: e = KernHeapTester() addrs = [] kmalloc_sizes = [ 32, 64, 96, 128, 196, 256, 1024, 2048, 4096] i = 0 while i < 1024: addr = e.kmalloc(random.choice(kmalloc_sizes)) addrs.append(addr) i += 1 random.seed(os.urandom(32)) random.shuffle(addrs) e.kfree(random.choice(addrs)) random.shuffle(addrs) for addr in addrs: e.kfree(addr) On a KERNHEAP protected host: Kernel panic - not syncing: KERNHEAP: Invalid kfree() in (objp df38e000) by python:3643, UID:0 EUID:0 The testsuite sources (including both the Python module and the LKM for the 2.6 series, tested with 2.6.29) are included along this paper. Adding support for new kernel APIs should be a trivial task, requiring only modification of the packet handler and the appropriate addition of a new command structure. Potential improvements include the use of a shared memory page instead of Netlink responses, to avoid impacting the allocator state or conflict with our tests. ------[ 9. The Inevitability of Failure In 1998, members (Loscocco, Smalley et. al) of the Information Assurance Group at the NSA published a paper titled "The Inevitability of Failure: The Flawed Assumption of Security in Modern Computing Environments" [12]. The paper explains how modern computing systems lacked the necessary features and capabilities for providing true assurance, to prevent compromise of the information contained in them. As systems were becoming more and more connected to networks, which were growing exponentially, the exposure of these systems grew proportionally. Therefore, the state of art in security had to progress in a similar pace. From an academic standpoint, it is interesting to observe that more than 10 years later, the state of art in security hasn't evolved dramatically, but threats have gone well beyond the initial expectations. "Although public awareness of the need for security in computing systems is growing rapidly, current efforts to provide security are unlikely to succeed. Current security efforts suffer from the flawed assumption that adequate security can be provided in applications with the existing security mechanisms of mainstream operating systems. In reality, the need for secure operating systems is growing in today's computing environment due to substantial increases in connectivity and data sharing." Page 1, [12] Most of the authors of this paper were involved in the development of the Flux Advanced Security Kernel (FLASK), at the University of Utah. Flask itself has its roots in an original joint project of the then known as Secure Computing Corporation (SCC) (acquired by McAfee in 2008) and the National Security Agency, in 1992 and 1993, the Distributed Trusted Operating System (DTOS). DTOS inherited the development and design ideas of a previous project named DTMach (Distributed Trusted Match) which aimed to introduce a flexible access control framework into the GNU Mach microkernel. Type Enforcement was first introduced in DTMach, superseded in Flask with a more flexible design which allowed far greater granularity (supporting mixing of different types of labels, beyond only types, such as sensitivity, roles and domains). Type Enforcement is a simple concept: a Mandatory Access Control (MAC) takes precedence over a Discretionary Access Control (DAC) to contain subjects (processes, users) from accessing or manipulating objects (files, sockets, directories), based on the decision made by the security system upon a policy and subject's attached security context. A subject can undergo a transition from one security context to another (for example, due to role change) if it's explicitly allowed by the policy. This design allows fine-grained, albeit complex, decision making. Essentially, MAC means that everything is forbidden unless explicitly allowed by a policy. Moreover, the MAC framework is fully integrated into the system internals in order to catch every possible data access situation and store state information. The true benefits of these systems could be exercised mostly in military or government environments, where models such as Multi-Level Security (MLS) are far more applicable than for the general public. Flask was implemented in the Fluke research operating system (using the OSKit framework) and ultimately lead to the development of SELinux, a modification of the Linux kernel, initially standalone and ported afterwards to use the Linux Security Modules (LSM) framework when its inclusion into mainline was rejected by Linus Tordvals. Flask is also the basis for TrustedBSD and OpenSolaris FMAC. Apple's XNU kernel, albeit being largely based off FreeBSD (which includes TrustedBSD modifications since 6.0) decided to implement its own security mechanism (non-MAC) known as Seatbelt, with its own policy language. While the development of these systems represents a significant step towards more secure operating systems, without doubt, the real-world perspective is of a slightly more bleak nature. These systems have steep learning curves (their policy languages are powerful but complex, their nature is intrinsically complicated and there's little freely available support for them, plus the communities dedicated to them are fairly small and generally oriented towards development), impose strict restrictions to the system and applications, and in several cases, might be overkill to the average user or administrator. A security system which requires (expensive, length) specialized training is dramatically prone to being disabled by most of its potential users. This is the reality of SELinux in Fedora and other systems. The default policies aren't realistic and users will need to write their own modules if they want to use custom software. In addition, the solution to this problem was less then suboptimal: the targeted (now modular) policy was born. The SELinux targeted policy (used by default in Fedora 10) is essentially a contradiction of the premises of MAC altogether. Most applications run under the unconfined_t domain, while a small set of daemons and other tools run confined under their own domains. While this allows basic, usable security to be deployed (on a related note, XNU Seatbelt follows a similar approach, although unsuccessfully), its effectiveness to stop determined attackers is doubtful. For instance, the Apache web server daemon (httpd) runs under the httpd_t domain, and is allowed to access only those files labeled with the httpd_sys_content_t type. In a PHP local file include scenario this will prevent an attacker from loading system configuration files, but won't prevent him from reading passwords from a PHP configuration file which could provide credentials to connect to the back-end database server, and further compromise the system by obtaining any access information stored there. In a relatively more complex scenario, a PHP code execution vulnerability could be leveraged to access the apache process file descriptors, and perhaps abuse a vulnerability to leak memory or inject code to intercept requests. Either way, if an attacker obtains unconfined_t access, it's a game over situation. This is acknowledged in [13], along an interesting citation about the managerial decisions that lead to the targeted policy being developed: "SELinux can not cause the phones to ring" "SELinux can not cause our support costs to rise." Strict Policy Problems, slide 5. [13] ---[ 9.1 Subverting SELinux and the audit subsystem Fedora comes with SELinux enabled by default, using the targeted policy. In remote and local kernel exploitation scenarios, disabling SELinux and the audit framework is desirable, or outright necessary if MLS or more restrictive policies are used. In March 2007, Brad Spengler sent a message to a public mailing-list, announcing the availability of an exploit abusing a kernel NULL pointer dereference (more specifically, an offset from NULL) which disabled all LSM modules atomically, including SELinux. tee42-24tee.c exploited a vulnerability in the tee() system call, which was silently fixed by Jens Axboe from SUSE (as "[patch 25/45] splice: fix problems with sys_tee()"). Its approach to disable SELinux locally was extremely reliable and simplistic at the same. Once the kernel continues execution at the code in userland, using shellcode is unnecessary. This applies only to local exploits normally, and allows offset-less exploitation, resulting in greater reliability. All the LSM disabling logic in tee42-24tee.c is written in C which can be easily integrated in other local exploits. The disable_selinux() function has two different stages independent of each other. The first finds the selinux_enabled 32-bit integer, through a linear memory search that seeks for a cmp opcode within the selinux_ctxid_to_string() function (defined in selinux/exports.c and present only in older kernels). In current kernels, a suitable replacement is the selinux_string_to_sid() function. Once the address to selinux_enabled is found, its value is set to zero. this is the first step towards disabling SELinux. Currently, additional targets should be selinux_enforcing (to disable enforcement mode) and selinux_mls_enabled. The next step is the atomic disabling of all LSM modules. This stage also relies on an finding an old function of the LSM framework, unregister_security(), which replaced the security_ops with dummy_security_ops (a set of default hooks that perform simple DAC without any further checks), given that the current security_ops matched the ops parameter. This function has disappeared in current kernels, but setting the security_ops to default_security_ops achieves the same effect, and it should be reasonably easy to find another function to use as reference in the memory search. This change was likely part of the facelift that LSM underwent to remove the possibility of using the framework in loadable kernel modules. With proper fine-tuning and changes to perform additional opcode checks, recent kernels should be as easy to write a SELinux/LSM disabling functionality that works across different architectures. For remote exploitation, a typical offset-based approach like that used in sgraykku's sctp_houdini.c exploit (against x86_64) should be reliable and painless. Simply write a zero value to selinux_enforcing, selinux_enabled and selinux_mls_enabled (albeit the first is well enough). Further more, if we already know the address of security_ops and default_security_ops, we can disable LSMs altogether that way too. If an attacker has enough permissions to control a SCTP listener or run his own, then remote exploitation on x86_64 platforms can be made completely reliable against unknown kernels through the use of the vsyscall exploitation technique, to return control to the attacker controller listener in a previous mapped -fixed- address of his choice. In this scenario, offset-less SELinux/LSM disabling functionality can be used. Fortunately, this isn't even necessary since most Linux distributions still ship with world-readable /boot mount points, and their package managers don't do anything to solve this when new kernel packages are installed: Ubuntu 8.04 (Hardy Heron) -rw-r--r-- 1 root 413K /boot/abi-2.6.24-24-generic -rw-r--r-- 1 root 79K /boot/config-2.6.24-24-generic -rw-r--r-- 1 root 8.0M /boot/initrd.img-2.6.24-24-generic -rw-r--r-- 1 root 885K /boot/System.map-2.6.24-24-generic -rw-r--r-- 1 root 62M /boot/vmlinux-debug-2.6.24-24-generic -rw-r--r-- 1 root 1.9M /boot/vmlinuz-2.6.24-24-generic Fedora release 10 (Cambridge) -rw-r--r-- 1 root 84K /boot/config-2.6.27.21-170.2.56.fc10.x86_64 -rw------- 1 root 3.5M /boot/initrd-2.6.27.21-170.2.56.fc10.x86_64.img -rw-r--r-- 1 root 1.4M /boot/System.map-2.6.27.21-170.2.56.fc10.x86_64 -rwxr-xr-x 1 root 2.6M /boot/vmlinuz-2.6.27.21-170.2.56.fc10.x86_64 Perhaps, one easy step before including complex MAC policy based security frameworks, would be to learn how to use DAC properly. Contact your nearest distribution security officer for more information. ---[ 9.2 Subverting AppArmor Ubuntu and SUSE decided to bundle AppArmor (aka SubDomain) instead (Novell acquired Immunix in May 2005, only to lay off their developers in September 2007, leaving AppArmor development "open for the community"). AppArmor is completely different than SELinux in both design and implementation. It uses pathname based security, instead of using filesystem object labeling. This represents a significant security drawback itself, since different policies can apply to the same object when it's accessed by different names. For example, through a symlink. In other words, the security decision making logic can be forced into using a less secure policy by accessing the object through a pathname that matches to an existent policy. It's been argued that labeling-based approaches are due to requirements of secrecy and information containment, but in practice, security itself equals to information containment. Theory-related discussions aside, this section will provide a basic overview on how AppArmor policy enforcement works, and some techniques that might be suitable in local and remote exploitation scenarios to disable it. The most simple method to disable AppArmor is to target the 32-bit integers used to determine if it's initialized or enabled. In case the system being targeted runs a stock kernel, the task of accessing these symbols is trivial, although an offset-dependent exploit is certainly suboptimal: c03fa7ac D apparmorfs_profiles_op c03fa7c0 D apparmor_path_max (Determines the maximum length of paths before access is rejected by default) c03fa7c4 D apparmor_enabled (Determines if AppArmor is currently enabled - used on runtime) c04eb918 B apparmor_initialized (Determines if AppArmor was enabled on boot time) c04eb91c B apparmor_complain (The equivalent to SELinux permissive mode, no enforcement) c04eb924 B apparmor_audit (Determines if the audit subsystem will be used to log messages) c04eb928 B apparmor_logsyscall (Determines if system call logging is enabled - used on runtime) A NULL-write primitive suffices to overwrite the values of any of those integers. But for local or shellcode based exploitation, a function exists that can disable AppArmor on runtime, apparmor_disable(). This function is straightforward and reasonably easy to fingerprint: 0xc0200e60 mov eax,0xc03fad54 0xc0200e65 call 0xc031bcd0 <mutex_lock> 0xc0200e6a call 0xc0200110 <aa_profile_ns_list_release> 0xc0200e6f call 0xc01ff260 <free_default_namespace> 0xc0200e74 call 0xc013e910 <synchronize_rcu> 0xc0200e79 call 0xc0201c30 <destroy_apparmorfs> 0xc0200e7e mov eax,0xc03fad54 0xc0200e83 call 0xc031bc80 <mutex_unlock> 0xc0200e88 mov eax,0xc03bba13 0xc0200e8d mov DWORD PTR ds:0xc04eb918,0x0 0xc0200e97 jmp 0xc0200df0 <info_message> It sets a lock to prevent modifications to the profile list, and releases it. Afterwards, it unloads the apparmorfs and releases the lock, resetting the apparmor_initialized variable. This method is not stealth by any means. A message will be printed to the kernel message buffer notifying that AppArmor has been unloaded and the lack of the apparmor directory within /sys/kernel (or the mount-point of the sysfs) can be easily observed. The apparmor_audit variable should be preferably reset to turn off logging to the audit subsystem (which can be disabled itself as explained in the previous section). Both AppArmor and SELinux should be disabled together with their logging facilities, since disabling enforcement alone will turn off their effective restrictions, but denied operations will still get recorded. Therefore, it's recommended to reset apparmor_logsyscall, apparmor_audit, apparmor_enabled and apparmor_complain altogether. Another viable option, albeit slightly more complex, is to target the internals of AppArmor, more specifically, the profile list. The main data structure related to profiles in AppArmor is 'aa_profile' (defined in apparmor.h): struct aa_profile { char *name; struct list_head list; struct aa_namespace *ns; int exec_table_size; char **exec_table; struct aa_dfa *file_rules; struct { int hat; int complain; int audit; } flags; int isstale; kernel_cap_t set_caps; kernel_cap_t capabilities; kernel_cap_t audit_caps; kernel_cap_t quiet_caps; struct aa_rlimit rlimits; unsigned int task_count; struct kref count; struct list_head task_contexts; spinlock_t lock; unsigned long int_flags; u16 network_families[AF_MAX]; u16 audit_network[AF_MAX]; u16 quiet_network[AF_MAX]; }; The definition in the header file is well commented, thus we will look only at the interesting fields from an attacker's perspective. The flags structure contains relevant fields: 1. audit: checked by the PROFILE_AUDIT macro, used to determine if an event shall be passed to the audit subsystem. 2. hat: checked by the PROFILE_IS_HAT macro, used to determine if this profile is a subprofile ('hat'). 3. complain: checked by the PROFILE_COMPLAIN macro, used to determine if this profile is in complain/non-enforcement mode (for example in aa_audit(), from main.c). Events are logged but no policy is enforced. From the flags, the immediately useful ones are audit and complain, but the hat flag is interesting nonetheless. AppArmor supports 'hats', being subprofiles which are used for transitions from a different profile to enable different permissions for the same subject. A subprofile belongs to a profile and has its hat flag set. This is worth looking at if, for example, altering the hat flag leads to a subprofile being handled differently (ex. it remains set despite the normal behavior would be to fall back to the original profile). Investigating this possibility in depth is out of the scope of this article. The task_contexts holds a list of the tasks confined by the profile (the number of tasks is stored in task_count). This is an interesting target for overwrites, and a look at the aa_unconfine_tasks() function shows the logic to unconfine all tasks associated for a given profile. The change itself is done by aa_change_task_context() with NULL parameters. Each task has an associated context (struct aa_task_context) which contains references to the applied profile, the magic cookie, the previous profile, its task struct and other information. The task context is retrieved using an inlined function: static inline struct aa_task_context *aa_task_context(struct task_struct *task) { return (struct aa_task_context *) rcu_dereference(task->security); } And after this dissertation on AppArmor internals, the long awaited method to unconfine tasks is unfold: set task->security to NULL. It's that simple, but it would have been unfair to provide the answer without a little analytical effort. It should be noted that this method likely works for most LSM based solutions, unless they specifically handle the case of a NULL security context with a denial response. The serialized profiles passed to the kernel are unpacked by the aa_unpack_profile() function (defined in module_interface.c). Finally, these structures are allocated within one of the standard kmem caches, via kmalloc. AppArmor does not use a private cache, therefore it is feasible to reach these structures in a slab overflow scenario. The approach to abuse AppArmor isn't really different from that of any other kernel security frameworks, technical details aside. ------[ 10. References [1] "The Slab Allocator: An Object-Caching Kernel Memory Allocator" Jeff Bonwick, Sun Microsystems. USENIX Summer, 1994. http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.29.4759 [2] "Anatomy of the Linux slab allocator" M. Tim Jones, Consultant Engineer, Emulex Corp. 15 May 2007, IBM developerWorks. http://www.ibm.com/developerworks/linux/library/l-linux-slab-allocator [3] "Magazines and vmem: Extending the slab allocator to many CPUs and arbitrary resources" Jeff Bonwick, Sun Microsystems. In Proc. 2001 USENIX Technical Conference. USENIX Association. http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.97.708 [4] "The Linux Slab Allocator" Brad Fitzgibbons, 2000. http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.29.4759 [5] "SLQB - and then there were four" Jonathan Corbet, 16 December 2008. http://lwn.net/Articles/311502/ [6] "Kmalloc Internals: Exploring Linux Kernel Memory Allocation" Sean. http://jikos.jikos.cz/Kmalloc_Internals.html [7] "Address Space Layout Randomization" PaX Team, 2003. http://pax.grsecurity.net/docs/aslr.txt [8] In-depth description of PaX UDEREF, the PaX Team. http://grsecurity.net/~spender/uderef.txt [9] "MurmurHash2" Austin Appleby, 2007. http://murmurhash.googlepages.com [10] "Attacking the Core : Kernel Exploiting Notes" sgrakkyu and twiz, Phrack #64 file 6. http://phrack.org/issues.html?issue=64&id=6&mode=txt [11] "Sysenter and the vsyscall page" The Linux kernel. Andries Brouwer, 2003. http://www.win.tue.nl/~aeb/linux/lk/lk-4.html [12] "The Inevitability of Failure: The Flawed Assumption of Security in Modern Computing Environments" Peter A. Loscocco, Stephen D. Smalley, Patrick A. Muckelbauer, Ruth C. Taylor, S. Jeff Turner, John F. Farrell. In Proceedings of the 21st National Information Systems Security Conference. http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.117.5890 [13] "Targeted vs Strict policy History and Strategy" Dan Walsh. 3 March 2005. In Proceedings of the 2005 SELinux Symposium. http://selinux-symposium.org/2005/presentations/session4/4-1-walsh.pdf [14] "Exploiting Kernel Pool Overflows" Kostya Kortchinsky. 11 June 2008. In Proceedings of SyScan'08 Hong Kong. http://immunitysec.com/downloads/KernelPool.odp [15] "When a "potential D.o.S." means a one-shot remote kernel exploit: the SCTP story" sgrakkyu. 27 April 2009. http://kernelbof.blogspot.com/2009/04/kernel-memory-corruptions-are-not-just.html ------[ 11. Thanks and final statements "For there is nothing hid, which shall not be manifested; neither was any thing kept secret, but that it should come abroad." Mark IV:XXII The research and work for KERNHEAP has been conducted by Larry Highsmith of Subreption LLC. Thanks to Brad Spengler, for his contributions to the otherwise collapsing Linux security in the past decade, the PaX Team (for the same reason, and their behind-the iron-curtain support, technical acumen and patience). Thanks to the editorial staff, for letting me publish this work in a convenient technical channel away of the encumbrances and distractions present in other forums, where facts and truth can't be expressed non-distilled, for those morally obligated to do so. Thanks to sgrakkyu for his feedback, attitude and technical discussions on kernel exploitation. The decision of SUSE and Canonical to choose AppArmor over more complete solutions like grsecurity will clearly take a toll in its security in the long term. This applies to Fedora and Red Hat Enterprise Linux, albeit SELinux is well suited for federal customers, which are a relevant part of their user base. The problem, though, is the inability of SELinux to contemplate kernel vulnerabilities in its threat model, and the lack of sound and well informed interest on developing such protections from the side of the Linux kernel developers. Hopefully, as time passes on and the current maintainers grow older, younger developers will come to replace them in their management roles. If they get over past mistakes and don't inherit old grudges and conflicts of interest, there's hope the Linux kernel will be more receptive to security patches which actually provide effective protections, for the benefit of the whole community. Paraphrasing the last words of a character from an Alexandre Dumas novel: until the future deigns to reveal the fate of Linux security to us, all wisdom can be summed up in these two words: Wait and hope. Last but not least, It should be noted that currently no true mechanism exists to enforce kernel security protections, and thus, KERNHEAP and grsecurity could also fall prey to more or less realistic attacks. The requirements to do this go beyond the capabilities of currently available hardware, and Trusted Computing seems to be taking a more DRM-oriented direction, which serves some commercial interests well, but leaves security lagging behind for another ten years. We present the next kernel security technology from yesterday, to be found independently implemented by OpenBSD, Red Hat, Microsoft or all of them at once, tomorrow. "And ye shall know the truth, and the truth shall make you free." John VIII:XXXII ------[ 12. Source code begin 644 kernheap_phrack-66.tgz M'XL(`%U3/$H``^P]:U?;2++SU?H5'<]A(Q%C;&-,$H8YAX#)<,+K8+/#W"Q' M1\AMK$66-)),8'=S?_NMJN[6VV!GD^S,7;09+/6C7EW=5=6OO>6A-^%68`:3 MT+)OUWJ]]1^^]M."9VMS$W_;6YNM[*]Z?FBW.^VMUF:[NP'EVMU6>^,'MOG5 M*:EX9E%LA8S]$+G6`P_GEWLJ_T_ZW);;/^)VR&.76[=-^ZO@P`;N=;OSVK_3 M[6Y1^[>[W?96IPWMW]MH0?NWO@KV)Y[_\O9?7]78*MOS@X?0N9G$3-\S6*?5 M>L,&L^N0!['C>^SH:*_)=EV749&(A3SBX1T?-;'J.0=-B?B(S;P1#UD\X2SF MX31B_I@^,G!.`^ZQ@3\+;<Z.')M[$6?ZX'1P9+"[=K.%X-8U[4?'L]W9B+.? MHGCD^,W)S_DDU[DNIH6.=Y-/&]M>[&*2!LT;.S:S?2^*@<;(N?&`6-?W;L2? MP(J!7H_ML-9]M]79[6QVNKUWO79OL[MW<72TG4!P/-?Q.+OSG9&J9(X=U]43 MH/8$-&GU>C9NL,CY!S=CYG+/T/ZIU0I%`H!:<\9,AWRV0H7]L5XFSC"T6@WZ MXBSTL$8`1`+T;:WV:>*X(+N`_<1T2&&O")/!`%5M5:^`Q%8-%AA07U(.,`#< MJYU'<$.1S]IG37.\F$TMQ]/QQ0IO[(;D8A4^[H@]R>WT&E^@'I8<<>@PVR76 MB7[!/,("!C8$V>,`6C{body}gt;Z]#"/`P;K'X163?\+5N)V{body}lt;$"_)GQ^^NV{body}lt;"C%\P M3D7-J[]Y]0:2=?>Q=84TU_B]$^O]R\.A>;![>'1QWA><:#5!'LC`BGU'IRKM M*P-46&_#&`2_^(.%!89LN0Z"A@80)-9_A69GENOZMA5SMC)CUP\QCYB^,CLV MFD20P-7(HR(8V%P[(%"LK8MBB!,%0ED[[`343L@D`$GXH5X7I>N/LZ>(^Q\> M^M`?4%/&T"&MF*T$S6:3`560A*6G?!KQ6"=%;2E:,4.!V/<]CGSDF0Z=V,ET MF-@'P,"SZUX:*?R&RA=UL_V{body}lt;A=$=A!R_C@38RBAR_='00U<S@.$-?9#MC)" MM?&]4030J*6HL1%(A.5T]:G)?L=:V]@)_M/#]#=[*NR_2EH#G8ZCF1/S?],E M7,;_V^CVP/YO];JM9__O>SR+M;_M.MR+OU0-EFK_SM8/+7`%VYO/[?\]GJ7: M_W:"2<W@83D<C[=_N]>!8"]I?WAO=3I;6YUG__][/#^^6)]%X?JUXZUS[XX% M#_'$][1ZO:YE8@)[F9C@I#\\.CSYP"+?ON4Q.-XC\(/L^"gemini - kennedy.gemi.dev #VG$QX[G(("( M6=Z(^1`AA.S:`0`^`'(\RWU;1KRU!G]>L]VC(80-WNR^P<XXN!7LKTTVL.XX MZ.:=]C7#{body}amp;V?1W;H4.FWVG#B1`S^(131#\B3^-`_/_FEOWL&G-\`\Q%"QMZ! MH0@[1+ZGTYGGH(<8:>`K877L6=QE4W\T`_?=OT//AN7EU2"YA/SWF0-06>C[ M,0-/Y@[\_1L!*)QY33:<H$,<`S&(%ZGS,<2)0]]EM]*Y-!KLEKPD>.$QN.UW MC@7^_C1PN8;4`:*(C4-_RF;0=BY\(ER`A=4_1<0-`;[F"'LTLV.0+P>'S(X= M$/M#4_O-G[$I-16TJVI.I"L4DD#($8DK1L!"%\#5?8!ZGYQHTB1-TX@(J2)` MGQ_&;%4D2B62B;L'II16@PU.]SZ8Y[N_-I0`S8M!_QQ3,>_(Q+?^$-_-P<G^ MNXL#>CW?^ZM\A=K[[\]WCS4)VX_46T">(P2B;,P_L?<'9VSL6C>19IKP;OZZ M>SBLU6H8+K9;,NWP5*9T5,K!0*:\;FGXC=K2/Z(T/87#_L44@.3U8&``9H[2 M'&>0:TK?S),C\_WYZ<49PGKS)DT_WGU_N">QMB"4L:[M4:;:WO&^.>P/AE0@ MG_SA>/?HZ'0/<]J%G(/S?A_3._GTX_[QH$^@-DH9>V>_848WG[%W>O:;.3S% MG,V*G(/STV/,ZQ4I`WB[>[_TS;WS_NZP#R6VYI40/.RPU_,*$"\@LWGY^R"= M\]/?4`HM[>3H>/`>9'II'O5/DO9NM337N;9KE+"W?W2DU_&[&?G-7MW0--NU MHHC=3NSI:#(*]4{body}lt;0J>9A=QXJ]4@".+N*#(A^OL(<1Q&=3>.76_4;'-&`7># M4K$7E!(#:P1IA40(^6_B"945`;BAU:X4#9X[C6Z>IB$#8P9*O]$I40&I[9Y, M)54L)T?\=TE=%D3@C'*I1>*>H`QHQ^J*#PET9,66(`!#;HC9<^TD<&1;P7P: MD<(`8`O(1"M"NFK0)8C(TH`#Z>)$S*4B):/,IAKSOR86D4%P2<%$:TI5RRM$ M03/_P4-?I&85`FS"S(VSI4N2$O;J&S`Q&J&-?A2WG!7Y+L@+';@H5IPJ$\EB MTHQU*\FU@X=O0.X(M+5,:T3^TD(\E$BUP:$ST<?X!M3.T<"OR03Z(-B&3U`_ MMJ:.^R#&A6@"CD1I[,ZE%D9'D7H3^K,@'6'SH^8S'7DZ/H`W_0O$J4-H:]`M M__KOX)@B*>,1 'B>URK@1+!.RF3AA.[8_"R,`(Q33WB[AB<T<@<\>O9S<Z! MY4;$2`TSF@1#VG9T0?62[TG>8]G[!"^@5A,5>8QU_0`&E6J'M+>YN=$SGBZO MO%997M$882L`F4F#&+F<)L@4<OVH><-C>"]F"^%"B9)KF2\G&A+*I2)(2'!3 M.4$(.=*E[!KL^B'D8SV%8C344D,F#0F":""!]`*\+&P!`=MV_8@3T;70<B!( MZ]_;(H33ZX3,8&,+XJ(1A!+QA*V,ZFR%Z0J807("Z*J%4\#T"60/PQG'&75H M^'(FZ4/"9X!K+R(P:)Z%/(X?SD**<W0@!**P'1RAA8*1M<=*I$PP6