jonw's mayhem academy

0x84

A Primer On Berkeley Packet Filters (BPF)

image courtesy of pixabay.com

I was recently tasked to investigate Berkeley Packet Filters (BPF) as a possible replacement for our iptables firewall system. I had never heard of BPF before, but that has never stopped a professional sysadmin before and it wasn’t going to now. I dutifully started searching for BPF, what it was, and what we might be able to do with it. I found lots of information, but it was mostly geared towards someone who already knew what BPF was which definitely was not me. It took me a while to get a grip on the subject matter because I could not find a simple primer to bootstrap my knowledge. So, I wrote one and here it is.

This will not be a deep-dive into BPF. Rather, it will be an overview of its history, some information on what it is and how to use it, and a pointer to some tools to get you started.

Who names this stuff?

BPF has been around a long time and the first confusing thing I ran across was the naming history. The original BPF was named just that: BPF. It was introduced into the kernel in version 2.1.75 in 1997. A few years later, BPF was rewritten to be more capable and started showing up in kernel versions 3.16 onwards, depending on the architecture. Being more capable, it was only natural that it was given the monicker “eBPF” for ‘extended’ BPF. The original BPF then became referred to as ‘classic’ BPF, or “cBPF”. These days, cBPF has been completely abandoned and eBPF is backwards compatible with any cBPF code out there, so the name has just gone back to simply BPF. That explanation alone will save you hours or searching and trying to make sense of the different names used in the many historical documents about BPF.

Got that?

BPF —> cBPF eBPF —> BPF

What is it?

BPF started life as a network packet filter. It allows sysadmins to write filter code and pass it to the kernel. The kernel would return packets that matched the code, effectively dropping those packets that did not. These days, BPF is capable of much, much more than just network filtering.

BPF programs are written in bytecode that is compiled and run by the kernel. The kernel contains a JIT (Just In Time) compiler and a virtual machine to execute the code. BPF programs are simple step-by-step programs that evaluate some condition, and return a value.

At first glance, allowing user code to run in the kernel sounds like a terrible idea. If a user space program crashes, that’s annoying but fixable. If the kernel crashes, you have a bigger problem. To mitigate this, the compiler ensures that the program cannot crash the kernel by reviewing the code to make sure it cannot get stuck or run away. It does this by enforcing some simple rules such as making sure that BPF program can only leap forward during execution, never backwards, to avoid runaway loops.

What can I do with it?

That’s the million dollar question. It’s kind of like asking “what can I use C for?” There’s probably no functional limit to the possible uses of BPF. It can make current things you do much faster, and it can introduce new functionality that you can’t do now at all. Some examples are probably in order here and the bcc github project comes stocked with a tools directory full of python scripts to get you up and running with some quick wins. More on that later.

A faster iptables

This is the door through which I entered the BPF arena. Many organizations that process large amounts of network traffic end up one day realizing that iptables isn’t performant enough at scale. When that happens, the next logical technology is something that can do the same job, but with less load on the system. BPF excels at this.

IPTables runs in user space. This means that every packet coming in on a network card has to be copied to user space before it can be processed by IPTables. This process takes almost no time at all so it is unnoticeable until a certain level of scale is achieved.

BPF is faster than iptables because the packets can be inspected and dropped by the kernel which is much earlier than iptables. Earlier processing means faster processing. When you’re looking to drop 10 million packets a second, you need to look at BPF.

This is such a common use for BPF that iptables comes with an extension that supports BPF bytecode. For example, this iptables rule:

/sbin/iptables -A INPUT -p udp -j DROP -m comment —comment “Drop UDP packets”

Becomes this BPF rule, and we can still use iptables to inject it into the kernel if the BPF extension is enabled:

/sbin/iptables -A INPUT -m bpf —bytecode “15,48 0 0 0,84 0 0 240,21 0 5 96,48 0 0 6,21 8 0 17,21 0 8 44,48 0 0 40,21 5 6 17,48 0 0 0,84 0 0 240,21 0 3 64,48 0 0 9,21 0 1 17,6 0 0 65535,6 0 0 0” -j DROP -m comment —comment “Drop UDP packets”iptables comes with an extension that supports BPF bytecode

Bonus tip: You will want to ensure you write excellent comments because the bytecode is much less readable than normal iptables syntax.

A better system analyzer

There are a number of user space tools like top and ss that can give a sysadmin information on what is happening on a system at any given time. However, they generally only run periodically and it is easy to miss short-lived processes or quick block i/o issue that cause problems, but don’t get picked up by these tools. BPF sees everything because it runs in the kernel.

execsnoop and opensnoop are excellent examples of BPF tools that can tell you every single process that executes on your system, no matter how quickly.

# ./opensnoop.py PID COMM FD ERR PATH22771 pickup 12 0 maildrop22905 opensnoop.py -1 2 /usr/lib64/python2.7/encodings/ascii.so22905 opensnoop.py -1 2 /usr/lib64/python2.7/encodings/asciimodule.so22905 opensnoop.py 12 0 /usr/lib64/python2.7/encodings/ascii.py22905 opensnoop.py 13 0 /usr/lib64/python2.7/encodings/ascii.pyc1 systemd 13 0 /proc/577/cgroup1 systemd 13 0 /proc/802/cgroup

Another tool, biolatency, has a funny sounding name, but it looks at Block I/O Latency (see the name, now?) and it can help identify disk read and write issues.

A complete network tracer

BPF can be used to track kernel calls to network functions like connect() and syscall() and print those connections to the terminal. You will not miss a single TCP connect this way.

# ./tcpconnectPID COMM IP SADDR DADDR DPORT1479 telnet 4 127.0.0.1 127.0.0.1 231469 curl 4 10.201.219.236 54.245.105.25 801469 curl 4 10.201.219.236 54.67.101.145 801991 telnet 6 ::1 ::1 23

How do I get started?

There are a couple of git repos full of tools that will help you get started. They’re primarily filled with example python scripts, but those scripts are so useful they may be all you need. Or, at least, they can provide a strong springboard to customize them for your needs instead of starting from scratch. There’s also some tools you’re probably already familiar with that can help, and of course good old school reading is always effective.

Sample tool repos

The two main projects you’ll likely end up using are bcc and bpftrace. They are in a github project named IOVisor which has a number of related interesting projects in it as well.

These tools are heavily used in Brendon Gregg’s recently released book entitled BPF Performance Tools. This book is almost entirely sure to be overkill because Gregg goes into detail about everything BPF can be used for which is a pretty large knowledge surface. However, the book is laid out so that you can also jump around to the things you want to know right now.

Cloudflare tools

Cloudflare publishes a lot of their tools for others to use. The tools the Cloudflare developers have released for BPF are concerned mostly with the ability to handle DNS packets. As such, they provide a decent baseline to learn some things from, but unless you’re also in the DNS business, the tools won’t be an exact match for you. I found it was easier to use the bcc and bptrace tools than it was to modify the Cloudflare tools.

Existing tools

Believe it or not, you can learn a lot about generating BPF compatible bytecode from that old standby tcpdump.It has a -d option which causes tcpdump to generate bytecode.

# tcpdump -ddd 'port 443 and tcp'2240 0 0 1221 0 7 3452548 0 0 2021 17 0 13221 0 16 640 0 0 5421 13 0 44340 0 0 5621 11 12 44321 0 11 204848 0 0 2321 9 0 13221 0 8 640 0 0 2069 6 0 8191177 0 0 1472 0 0 1421 2 0 44372 0 0 1621 0 1 4436 0 0 2621446 0 0 0

Unfortunately, this bytecode can’t be jammed into BPF as-is because tcpdump starts examining the packet further along than BFP does. But, you can still see the basic bytecode.

Fun fact, a single -d will get you almost readable bytecode which is a great learning tool. This is the same filter, just written out line by line, and you can see where it “jumps” (jt) to different lines in the code depending on how the evaluation of that line turned out. Also note that the jumps are always forward which is part of the kernel’s safety enforcement that I mentioned earlier.

# tcpdump -d 'port 443 and tcp'(000) ldh 12 jeq #0x86dd jt 2 jf 9(002) ldb 20 jeq #0x84 jt 21 jf 4(004) jeq #0x6 jt 5 jf 21(005) ldh 54 jeq #0x1bb jt 20 jf 7(007) ldh 56 jeq #0x1bb jt 20 jf 21(009) jeq #0x800 jt 10 jf 21(010) ldb 23 jeq #0x84 jt 21 jf 12(012) jeq #0x6 jt 13 jf 21(013) ldh 20 jset #0x1fff jt 21 jf 15(015) ldxb 4*([14]&0xf)(016) ldh x + 14 jeq #0x1bb jt 20 jf 18(018) ldh x + 16 jeq #0x1bb jt 20 jf 21(020) ret #262144(021) ret #0

More information

My main interest in BPF is for networking stuff. As such, I need to know how packets are constructed so that I can write code that loads the correct bits of a packet during execution to ensure that my evaluations work. Jeff Stebelton’s primer on BPF from a networking perspective (PDF) helped me a lot.

Happy filtering!

my shorter content on the fediverse: https://the.mayhem.academy/@jdw