Qualys Security Advisory
System Down: A systemd-journald exploit
========================================================================
Contents
========================================================================
Summary
CVE-2018-16864
- Analysis
- Exploitation
CVE-2018-16865
- Analysis
- Exploitation
CVE-2018-16866
- Analysis
- Exploitation
Combined Exploitation of CVE-2018-16865 and CVE-2018-16866
- amd64 Exploitation
- i386 Exploitation
Acknowledgments
Timeline
Conversion, software version 7.0
-- System of a Down, "Toxicity"
========================================================================
Summary
========================================================================
We discovered three vulnerabilities in systemd-journald
(https://en.wikipedia.org/wiki/Systemd):
- CVE-2018-16864 and CVE-2018-16865, two memory corruptions
(attacker-controlled alloca()s);
- CVE-2018-16866, an information leak (an out-of-bounds read).
CVE-2018-16864 was introduced in April 2013 (systemd v203) and became
exploitable in February 2016 (systemd v230). We developed a proof of
concept for CVE-2018-16864 that gains eip control on i386.
CVE-2018-16865 was introduced in December 2011 (systemd v38) and became
exploitable in April 2013 (systemd v201). CVE-2018-16866 was introduced
in June 2015 (systemd v221) and was inadvertently fixed in August 2018.
We developed an exploit for CVE-2018-16865 and CVE-2018-16866 that
obtains a local root shell in 10 minutes on i386 and 70 minutes on
amd64, on average. We will publish our exploit in the near future.
To the best of our knowledge, all systemd-based Linux distributions are
vulnerable, but SUSE Linux Enterprise 15, openSUSE Leap 15.0, and Fedora
28 and 29 are not exploitable because their user space is compiled with
GCC's -fstack-clash-protection.
This confirms https://grsecurity.net/an_ancient_kernel_hole_is_not_closed.php:
"It should be clear that kernel-only attempts to solve [the Stack Clash]
will necessarily always be incomplete, as the real issue lies in the
lack of stack probing."
========================================================================
CVE-2018-16864
========================================================================
------------------------------------------------------------------------
Analysis
------------------------------------------------------------------------
The waves all keep on crashing by
-- System of a Down, "Suggestions"
We accidentally discovered CVE-2018-16864 while working on the exploit
for Mutagen Astronomy (CVE-2018-14634); if we pass several megabytes of
command-line arguments to a program that calls syslog(), then journald
crashes:
systemd-journal[472]: segfault at 7ffe9a077420 ip 00007f45f6174877 sp 00007ffe9a0773f0 error 6 in systemd-journald[7f45f6169000+3f000]
(gdb) disassemble 0x7f45f6174877 - 0x7f45f6169000
Dump of assembler code for function dispatch_message_real.4064:
...
0x000000000000b82c <+988>: callq 0x2bd10 <get_process_cmdline.constprop.96>
0x000000000000b831 <+993>: test %eax,%eax
0x000000000000b833 <+995>: js 0xb8ea <dispatch_message_real.4064+1178>
0x000000000000b839 <+1001>: mov -0x218(%rbp),%rbx
0x000000000000b840 <+1008>: test %rbx,%rbx
0x000000000000b843 <+1011>: je 0xd31b <dispatch_message_real.4064+7883>
0x000000000000b849 <+1017>: mov %rbx,%rdi
0x000000000000b84c <+1020>: callq 0x5360 <strlen@plt>
0x000000000000b851 <+1025>: add $0xa,%eax
0x000000000000b854 <+1028>: cltq
0x000000000000b856 <+1030>: add $0x1e,%rax
0x000000000000b85a <+1034>: and $0xfffffffffffffff0,%rax
0x000000000000b85e <+1038>: sub %rax,%rsp
0x000000000000b861 <+1041>: movabs $0x454e494c444d435f,%rax
0x000000000000b86b <+1051>: lea 0x37(%rsp),%r15
0x000000000000b870 <+1056>: and $0xfffffffffffffff0,%r15
0x000000000000b874 <+1060>: test %rbx,%rbx
0x000000000000b877 <+1063>: mov %rax,(%r15)
0x000000000000b87a <+1066>: mov $0x3d,%eax
0x000000000000b87f <+1071>: mov %ax,0x8(%r15)
0x000000000000b884 <+1076>: lea 0x9(%r15),%rax
0x000000000000b888 <+1080>: je 0xb895 <dispatch_message_real.4064+1093>
0x000000000000b88a <+1082>: mov %rbx,%rsi
0x000000000000b88d <+1085>: mov %rax,%rdi
0x000000000000b890 <+1088>: callq 0x5370 <stpcpy@plt>
538 static void dispatch_message_real(
...
604 r = get_process_cmdline(ucred->pid, 0, false, &t);
605 if (r >= 0) {
606 x = strjoina("_CMDLINE=", t);
919 #define strjoina(a, ...) \
920 ({ \
921 const char *_appendees_[] = { a, __VA_ARGS__ }; \
922 char *_d_, *_p_; \
923 int _len_ = 0; \
924 unsigned _i_; \
925 for (_i_ = 0; _i_ < ELEMENTSOF(_appendees_) && _appendees_[_i_]; _i_++) \
926 _len_ += strlen(_appendees_[_i_]); \
927 _p_ = _d_ = alloca(_len_ + 1); \
928 for (_i_ = 0; _i_ < ELEMENTSOF(_appendees_) && _appendees_[_i_]; _i_++) \
929 _p_ = stpcpy(_p_, _appendees_[_i_]); \
930 *_p_ = 0; \
931 _d_; \
932 })
This vulnerability, an attacker-controlled alloca()
(https://wiki.sei.cmu.edu/confluence/display/c/MEM05-C.+Avoid+large+stack+allocations)
at instruction 0xb85e and line 927, was introduced in systemd v203:
commit ae018d9bc900d6355dea4af05119b49c67945184
Date: Mon Apr 22 23:10:13 2013 -0300
...
r = get_process_cmdline(ucred->pid, 0, false, &t);
if (r >= 0) {
- cmdline = strappend("_CMDLINE=", t);
+ cmdline = strappenda("_CMDLINE=", t);
(strappenda() was renamed strjoina() in systemd v219) and became
exploitable in systemd v230:
commit ac2e41f5103ce2c679089c4f8fb6be61d7caec07
Date: Fri Feb 12 04:59:57 2016 -0800
...
This adds a wait flag to journal_file_set_offline(), when false the offline is
performed asynchronously in a separate thread.
------------------------------------------------------------------------
Exploitation
------------------------------------------------------------------------
... it's the race
Can you break out?
-- System of a Down, "36"
CVE-2018-16864 is similar to a Stack Clash vulnerability
(https://www.qualys.com/2017/06/19/stack-clash/stack-clash.txt), but:
- Steps 1 (Clash the stack with another memory region) and 2 (Run the
stack pointer to the start of the stack) are not needed, because the
attacker-controlled alloca() can be very large (several megabytes of
command-line arguments); only Steps 3 (Jump over the stack guard page,
into another memory region) and 4 (Smash the stack, or another memory
region) are needed.
- In Step 4 (Smash), the alloca() is fully written to (the vulnerability
is essentially a stpcpy(alloca(strlen(cmdline) + 1), cmdline)), and
the stpcpy() (a "wild copy") will therefore always crash into a
read-only or unmapped memory region:
https://googleprojectzero.blogspot.com/2015/03/taming-wild-copy-parallel-thread.html
https://cansecwest.com/slides/2015/Taming%20wild%20copies%20-%20Chris%20evans.pdf
We tried to asynchronously interrupt this stpcpy() before it crashes,
with a signal or a timer, but we failed because journald uses signalfd()
and timerfd_create() to handle these events synchronously.
We eventually gained control of eip (i386's instruction pointer) by
jumping into and smashing the stack of a concurrent thread (a "Parallel
Thread Corruption"):
- First, we send a large, high-priority message (LOG_CRIT or higher) to
journald, from a process whose cmdline is small; this message forces a
large write() (between 1MB and 2MB) to /var/log/journal/ and forces
the creation of a short-lived thread that fsync()s the journal (the
stack of this thread is allocated in the mmap region).
- Next, we create several processes (between 32 and 64) that write() and
fsync() large files (between 1MB and 8MB) to /var/tmp/ (for example);
these processes stall journald's fsync() thread and will allow us to
win a tight race: exploit the "wild copy" before it crashes.
- Last, we send a small, low-priority message to journald, from a
process whose cmdline is very large (roughly 128MB, the distance
between the main stack and the mmap region); this message forces a
very large alloca() that jumps from journald's main stack into the
stack of the fsync() thread, and smashes a saved eip before fsync()
returns from kernel space.
On a Debian stable (9.5), our proof of concept wins this race and gains
eip control after a dozen tries (systemd automatically restarts journald
after each crash):
systemd-journal[2195]: segfault at 41414141 ip 41414141 sp b5f3d22c error 14
Despite this initial success, we abandoned the exploitation of
CVE-2018-16864: while working on our proof of concept, we discovered two
different vulnerabilities (CVE-2018-16865, another attacker-controlled
alloca(), and CVE-2018-16866, an information leak) that are reliably
exploitable on both i386 and amd64.
========================================================================
CVE-2018-16865
========================================================================
------------------------------------------------------------------------
Analysis
------------------------------------------------------------------------
Can you feel their haunting presence?
-- System of a Down, "Holy Mountains"
Surprised by the heavy usage of alloca() in journald, we searched for
another attacker-controlled alloca() and found CVE-2018-16865:
1963 int journal_file_append_entry(JournalFile *f, const dual_timestamp *ts, const struct iovec iovec[], unsigned n_iovec, uint64_t *seqnum, Object **ret, uint64_t *offset) {
....
1986 items = alloca(sizeof(EntryItem) * MAX(1u, n_iovec));
1987
1988 for (i = 0; i < n_iovec; i++) {
1989 uint64_t p;
1990 Object *o;
1991
1992 r = journal_file_append_data(f, iovec[i].iov_base, iovec[i].iov_len, &o, &p);
1993 if (r < 0)
1994 return r;
1995
1996 xor_hash ^= le64toh(o->data.hash);
1997 items[i].object_offset = htole64(p);
1998 items[i].hash = o->data.hash;
1999 }
This vulnerability was introduced in systemd v38:
commit cf244689e9d1ab50082c9ddd0f3c4d1eb982badc
Date: Thu Dec 29 15:00:57 2011 +0100
...
- items = new(EntryItem, n_iovec);
- if (!items)
- return -ENOMEM;
+ items = alloca(sizeof(EntryItem) * n_iovec);
and became exploitable in systemd v201:
commit c4aa09b06f835c91cea9e021df4c3605cff2318d
Date: Mon Apr 8 20:32:03 2013 +0200
...
-#define ENTRY_SIZE_MAX (1024*1024*64)
-#define DATA_SIZE_MAX (1024*1024*64)
...
+#define ENTRY_SIZE_MAX (1024*1024*768)
+#define DATA_SIZE_MAX (1024*1024*768)
If we send a large "native" message to /run/systemd/journal/socket:
since the maximum size of a "native" entry is 768MB, and the minimum
length of a "native" item is 3 ("A=\n"), and the size of an EntryItem
structure is 16 (a 64-bit offset and a 64-bit hash), the maximum size of
the attacker-controlled alloca() in journal_file_append_entry() is 768MB
/ 3 * 16 = 4GB, large enough to jump from journald's main stack into the
mmap region, even on amd64.
On amd64, as described in the "64-bit exploitation" of our Stack Clash
advisory, the randomized distance between the main stack and the mmap
region is shorter than 4GB with a probability of (approximately):
SUM(d = 0; d < 4GB; d++) d / (16GB * 1TB) ~= 1 / 2048
------------------------------------------------------------------------
Exploitation
------------------------------------------------------------------------
Jump (pogo, pogo, pogo, pogo, pogo, pogo, pogo)
-- System of a Down, "Bounce"
CVE-2018-16865 is basically a simplified Stack Clash vulnerability:
- Steps 1 (Clash) and 2 (Run) of the Stack Clash are not needed, since
the largest attacker-controlled alloca() is 4GB; only Steps 3 (Jump)
and 4 (Smash) are needed.
- In Step 4 (Smash), the alloca() is not necessarily fully written to:
if the size of an item is larger than 128MB (DEFAULT_MAX_SIZE_UPPER),
then journal_file_append_data() returns an error that breaks the "for"
loop in journal_file_append_entry() (at lines 1992-1994) and avoids a
crash into a read-only or unmapped memory region.
We eventually transformed this vulnerability into a crude
"write-what-where" (https://cwe.mitre.org/data/definitions/123.html):
- "write-where": We jump into and smash libc's read-write segment, and
thereby overwrite a function pointer. Unfortunately this "write-where"
is not surgical: the stack frames of the functions called from within
the "for" loop (in journal_file_append_entry()) smash a few kilobytes
below our target function pointer, and therefore overwrite vital libc
variables that may crash or deadlock journald. Consequently, we must
sometimes shift our alloca() jump slightly, to avoid overwriting such
vital variables.
- "write-what": We want to overwrite our target function pointer with
the address of another function or ROP chain, but unfortunately the
stack frames of the functions called from within the "for" loop (in
journal_file_append_entry()) do not contain any data that we control.
However, the 64-bit "hash" values that are written to the alloca()ted
"items" are produced by jenkins_hashlittle2(), a noncryptographic hash
function: we can easily find a short string (a preimage) that hashes
to a given value (the address that will overwrite our target function
pointer) and is also a valid_user_field() (or journal_field_valid()).
This "write-what" restricts our "write-where" to function pointers
whose address modulo 16 is equal to 8 (the offset of "hash" in the
EntryItem structure).
To complete our exploit, we need the address of journald's stack pointer
before the alloca() jump, and the address of our target function pointer
in libc's read-write segment -- we need an information leak.
========================================================================
CVE-2018-16866
========================================================================
------------------------------------------------------------------------
Analysis
------------------------------------------------------------------------
When they speak, we can peek from the windows of their mouths
-- System of a Down, "Know"
We discovered an out-of-bounds read in journald (CVE-2018-16866), and
transformed it into an information leak:
31 #define WHITESPACE " \t\n\r"
...
194 size_t syslog_parse_identifier(const char **buf, char **identifier, char **pid) {
195 const char *p;
...
197 size_t l, e;
...
203 p = *buf;
204
205 p += strspn(p, WHITESPACE);
206 l = strcspn(p, WHITESPACE);
207
208 if (l <= 0 ||
209 p[l-1] != ':')
210 return 0;
211
212 e = l;
...
240 if (strchr(WHITESPACE, p[e]))
241 e++;
242 *buf = p + e;
243 return e;
244 }
If we send a syslog message to journald (in *buf), and if the last
character of this message is a ':' (before the '\0' terminator), then:
- at line 240, p[e] is the '\0' terminator of our message;
- at line 240, strchr(WHITESPACE, p[e]) returns a pointer to the '\0'
terminator of the WHITESPACE string (as mentioned in man strchr: "The
terminating null byte is considered part of the string, so that if c
is specified as '\0', these functions return a pointer to the
terminator.");
- at line 241, e is incremented;
- at line 242, *buf points out-of-bounds, to the first character after
the '\0' terminator of our message;
- later, the out-of-bounds string at *buf (supposedly the body of our
syslog message) is written (leaked) to the journal.
Consequently, we can read this out-of-bounds string:
- either directly from the journal (if journald's "Storage" is
"persistent", or "auto" and /var/log/journal/ exists), because
journald supports extended file ACLs (Access Control Lists):
$ id
uid=1000(john) gid=1000(john) groups=1000(john) context=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023
$ ls -l /var/log/journal/*/user-$UID.journal
-rw-r-----+ 1 root systemd-journal 8388608 Nov 20 09:35 /var/log/journal/2562d1eced654f44a3d3a217d66b9ff3/user-1000.journal
$ getfacl /var/log/journal/*/user-$UID.journal
...
user:john:r--
$ ./infoleak
$ journalctl --all --user --lines=1 --identifier=infoleak | hexdump -C
...
00000050 2e 20 2d 2d 0a 4e 6f 76 20 32 30 20 31 36 3a 30 |. --.Nov 20 16:0|
00000060 30 3a 33 36 20 6c 6f 63 61 6c 68 6f 73 74 2e 6c |0:36 localhost.l|
00000070 6f 63 61 6c 64 6f 6d 61 69 6e 20 69 6e 66 6f 6c |ocaldomain infol|
00000080 65 61 6b 5b 33 35 34 38 5d 3a 20 78 fb 1e 78 54 |eak[3548]: x..xT|
00000090 7f 0a |..|
- or (if journald's "Storage" is "volatile", or "auto" and
/var/log/journal/ does not exist) from a tty that we recorded to
/var/run/utmp, because journald writes ("walls") emergency messages
(LOG_EMERG) to the tty of every logged-in user; our exploit records a
tty to /var/run/utmp via an ssh connection to localhost, but other
methods exist (for example, utempter and gnome-pty-helper):
$ ./infoleak
...
00003510 0a 07 0d 0d 0a 42 72 6f 61 64 63 61 73 74 20 6d |.....Broadcast m|
00003520 65 73 73 61 67 65 20 66 72 6f 6d 20 73 79 73 74 |essage from syst|
00003530 65 6d 64 2d 6a 6f 75 72 6e 61 6c 64 40 6c 6f 63 |emd-journald@loc|
00003540 61 6c 68 6f 73 74 2e 6c 6f 63 61 6c 64 6f 6d 61 |alhost.localdoma|
00003550 69 6e 20 28 54 75 65 20 32 30 31 38 2d 31 31 2d |in (Tue 2018-11-|
00003560 32 30 20 31 36 3a 32 35 3a 34 36 20 43 53 54 29 |20 16:25:46 CST)|
00003570 3a 0d 0d 0a 0d 0d 0a 69 6e 66 6f 6c 65 61 6b 5b |:......infoleak[|
00003580 33 38 37 32 5d 3a 20 78 6b a2 e1 2f 7f 0d 0d 0a |3872]: xk../....|
This vulnerability was introduced in systemd v221:
commit ec5ff4445cca6a1d786b8da36cf6fe0acc0b94c8
Date: Wed Jun 10 22:33:44 2015 -0700
...
- e += strspn(p + e, WHITESPACE);
+ if (strchr(WHITESPACE, p[e]))
+ e++;
and was inadvertently fixed in August 2018:
commit a6aadf4ae0bae185dc4c414d492a4a781c80ffe5
Date: Wed Aug 8 15:06:36 2018 +0900
...
- if (strchr(WHITESPACE, p[e]))
- e++;
+ e += strspn(p + e, WHITESPACE);
commit 8595102d3ddde6d25c282f965573a6de34ab4421
Date: Fri Aug 10 11:07:54 2018 +0900
...
- e += strspn(p + e, WHITESPACE);
+ /* Single space is used as separator */
+ if (p[e] != '\0' && strchr(WHITESPACE, p[e]))
+ e++;
------------------------------------------------------------------------
Exploitation
------------------------------------------------------------------------
For today we will take the body parts and put them on the wall
-- System of a Down, "Dreaming"
To leak a stack address or an mmap address from journald:
- First, we send a large native message to /run/systemd/journal/socket;
journald mmap()s our message, and malloc()ates a large array of iovec
structures: most of these structures point into our mmap()ed message,
but some of them point to the stack (in dispatch_message_real()). The
contents of this iovec array (especially the mmap and stack pointers)
are preserved in a heap hole after free() (after journald finishes
processing our message).
- Next, we send a large syslog message to /run/systemd/journal/dev-log;
to receive our large message (in server_process_datagram()), journald
realloc()ates its server buffer into the heap hole that previously
contained the iovec array (and still contains remains of mmap and
stack pointers).
- Last, we send a large syslog message that exploits CVE-2018-16866;
journald receives our large message in its server buffer (in the heap
chunk that previously contained the iovec array), and if we carefully
choose the size of our message and position its terminating ":" in
front of a remaining mmap or stack pointer, then we can leak this
pointer (it is mistakenly read out-of-bounds as the body of our
message).
>From this leaked stack pointer we easily deduce journald's stack pointer
before the alloca() jump, because the distance between the two depends
only on journald's executable.
>From the leaked mmap address we can deduce libc's address, but chunks of
unknown sizes are mmap()ed between the two, and we must therefore adopt
different strategies based on our target architecture (i386 or amd64).
========================================================================
Combined Exploitation of CVE-2018-16865 and CVE-2018-16866
========================================================================
Don't leave your seats now
Popcorn everywhere ...
-- System of a Down, "CUBErt"
------------------------------------------------------------------------
amd64 Exploitation
------------------------------------------------------------------------
- To deduce libc's address from the leaked mmap address of our native
message, we arrange for this message to be mmap()ed into the 2MB hole
between ld.so's read-execute and read-only segments: from this hole's
address we deduce ld.so's address, and hence libc's address (with help
from ldd's output).
- If the resulting stack-to-libc distance is jumpable (if it is shorter
than 4GB), then we proceed with our "write-what-where"; otherwise, we
restart journald (we crash it with an alloca() of RLIMIT_STACK -- 8MB
by default) and try again.
We have a good chance of obtaining a jumpable stack-to-libc distance
(and hence a root shell) after 2048 tries * 2 seconds ~= 68 minutes
(by default, if journald crashes less than 5 times within 10 seconds,
it is restarted automatically by systemd).
- For the "write-where" part of our "write-what-where", we overwrite
libc's __free_hook function pointer, whose address modulo 16 is always
equal to 8 (on every amd64 distribution that we exploited).
- For the "write-what" part of our "write-what-where", we overwrite
__free_hook with the address of libc's system() function: whenever
journald free()s data that we control, we achieve arbitrary command
execution.
Last-minute note: on CentOS 7, the usual function pointers in libc's
read-write segment (__free_hook, __malloc_hook, etc) are not located at
multiples of 16 plus 8. To circumvent this problem:
- First, we overwrite the "_chain" pointer of stderr's FILE structure
with the address of our own fake FILE structure (this "_chain" pointer
is located at a multiple of 16 plus 8, in libc's read-write segment).
- Next, we corrupt one of malloc's internal variables (also in libc's
read-write segment).
- Last, we force a call to malloc() or free(), which detects the
corruption of its internal variable and calls abort(), which calls
_IO_flush_all_lockp(), which follows stderr's overwritten "_chain"
pointer to our fake FILE structure; we eventually achieve arbitrary
command execution by calling libc's system() via one of the function
pointers in our fake FILE structure.
------------------------------------------------------------------------
i386 Exploitation
------------------------------------------------------------------------
Our i386 exploit is very similar to the amd64 exploit, but:
- The stack-to-libc distance is always jumpable (it is roughly 128MB).
- There is no hole between ld.so's read-execute and read-only segments.
However, libc's address is randomized in a narrow range of 1MB and is
therefore brute forcible: we have a good chance of correctly guessing
libc's address after 1MB / 4KB = 256 tries * 2 seconds ~= 8 minutes.
- For the "write-where" part of our "write-what-where", we overwrite
libc's __malloc_hook function pointer (__free_hook was never located
at a multiple of 16 plus 8 or 12 on the i386 distributions that we
exploited, but __malloc_hook always is).
- For the "write-what" part of our "write-what-where", we overwrite
__malloc_hook with the address of a "mov esp, 0x89fffa5d ; ret" gadget
(or equivalent stack pivot): since our native message can be as large
as 768MB, we can mmap() it at 0x89fffa5d, take control of the stack,
and return into libc's execve().
========================================================================
Acknowledgments
========================================================================
We thank systemd's developers, Red Hat Product Security, and the members
of linux-distros@openwall.
========================================================================
Timeline
========================================================================
2018-11-26: Advisory sent to Red Hat Product Security (as recommended by
https://github.com/systemd/systemd/blob/master/docs/CONTRIBUTING.md#security-vulnerability-reports).
2018-12-26: Advisory and patches sent to linux-distros@openwall.
2019-01-09: Coordinated Release Date (6:00 PM UTC).