This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

Tools and languages

User space helpers and patterns that make building and debugging eBPF programs easier

1 - bpftrace

DTrace style one liners for quick ad hoc tracing of kernel and user processes.

bpftrace is a powerful, high-level tracing language for Linux that simplifies the process of creating eBPF (Extended Berkeley Packet Filter) programs. It simplifies the process of instrumenting kernel and user-space code by providing a simple language to attach probes to kernel functions, tracepoints, and user-defined events in a user-friendly syntax, inspired by awk, C, and other tracing tools, enabling users to quickly gain insights into system behavior. By abstracting away the complexities of low-level eBPF programming and leveraging libbpf as its backend, bpftrace allows system administrators, performance engineers, and developers to easily observe and analyze system performance without requiring extensive eBPF expertise. Let’s start by looking at the bpftrace command.

bpftrace Options

When running bpftrace, you can use various command-line options to control its behavior. Some commonly used options include:

OPTIONS:
    -B MODE        output buffering mode ('full', 'none')
    -f FORMAT      output format ('text', 'json')
    -o file        redirect bpftrace output to file
    -e 'program'   execute this program
    -h, --help     show this help message
    -I DIR         add the directory to the include search path
    --include FILE add an #include file before preprocessing
    -l [search|filename]
                   list kernel probes or probes in a program
    -p PID         enable USDT probes on PID
    -c 'CMD'       run CMD and enable USDT probes on resulting process
    --usdt-file-activation
                   activate usdt semaphores based on file path
    --unsafe       allow unsafe/destructive functionality
    -q             keep messages quiet
    --info         Print information about kernel BPF support
    -k             emit a warning when a bpf helper returns an error (except read functions)
    -kk            check all bpf helper functions
    -V, --version  bpftrace version
    --no-warnings  disable all warning messages

For example, we can use -l along with * for wildcard for listing such as listing all kprobes:

sudo bpftrace -l 'kprobe:*' 
[...]
kprobe:zswap_store
kprobe:zswap_swapoff
kprobe:zswap_swapon
kprobe:zswap_total_pages
kprobe:zswap_writeback_entry
kprobe:zswap_writeback_show
kprobe:zswap_writeback_write
kprobe:zswap_zpool_param_set

We can list probe parameters for a certain function using

sudo bpftrace -lv 'fentry:tcp_reset'
fentry:vmlinux:tcp_reset
    struct sock * sk
    struct sk_buff * skb

We can list all symbols from object or binary files for uprobe such as the following:

sudo bpftrace -l 'uprobe:/bin/bash:*'
[...]
uprobe:/bin/bash:async_redirect_stdin
uprobe:/bin/bash:base_pathname
uprobe:/bin/bash:bash_add_history
uprobe:/bin/bash:bash_brace_completion
uprobe:/bin/bash:bash_clear_history
uprobe:/bin/bash:bash_default_completion
uprobe:/bin/bash:bash_delete_histent
uprobe:/bin/bash:bash_delete_history_range
uprobe:/bin/bash:bash_delete_last_history
uprobe:/bin/bash:bash_dequote_text
[...]

Using -e can be used to execute a program in one-liner. For example,

sudo bpftrace -e 'uprobe:/bin/bash:shell_execve { printf("shell_execve called\n"); }'
Attaching 1 probe...
open() called
open() called
open() called

This program uprobe:/bin/bash:shell_execve { printf("shell_execve called\n"); } means this action printf("shell_execve called\n"); will be executed when uprobe:/bin/bash:shell_execve get triggered.

If we want to print out which command is being executed is by printing the first argument with arg0 using str function which reads a NULL terminated string similar to bpf_probe_read_str helper function. argN is a bpf builtins while hold arguments passed to the function being traced and it can be used with kprobe and uprobe.

sudo bpftrace -e 'uprobe:/bin/bash:shell_execve { printf("command:%s\n", str(arg0)); }'
Attaching 1 probe...
command:/usr/bin/ls
command:/usr/bin/ping
command:/usr/bin/cat

The following table is from the bpftrace manual, listing special variables along with their corresponding helper functions and descriptions.

Variable Type Kernel BPF Helper Description
$1, $2, ...$n int64 n/a n/a The nth positional parameter passed to the bpftrace program. If less than n parameters are passed this evaluates to 0. For string arguments use the str() call to retrieve the value.
$# int64 n/a n/a Total amount of positional parameters passed.
arg0, arg1, ...argn int64 n/a n/a nth argument passed to the function being traced. These are extracted from the CPU registers. The amount of args passed in registers depends on the CPU architecture. (kprobes, uprobes, usdt).
args struct args n/a n/a The struct of all arguments of the traced function. Available in tracepoint, fentry, fexit, and uprobe (with DWARF) probes. Use args.x to access argument x or args to get a record with all arguments.
cgroup uint64 4.18 get_current_cgroup_id ID of the cgroup the current process belongs to. Only works with cgroupv2.
comm string[16] 4.2 get_current_comm Name of the current thread.
cpid uint32 n/a n/a Child process ID, if bpftrace is invoked with -c.
cpu uint32 4.1 raw_smp_processor_id ID of the processor executing the BPF program.
curtask uint64 4.8 get_current_task Pointer to struct task_struct of the current task.
elapsed uint64 (see nsec) ktime_get_ns / ktime_get_boot_ns Nanoseconds elapsed since bpftrace initialization, based on nsecs.
func string n/a n/a Name of the current function being traced (kprobes, uprobes).
gid uint64 4.2 get_current_uid_gid Group ID of the current thread, as seen from the init namespace.
jiffies uint64 5.9 get_jiffies_64 Jiffies of the kernel. In 32-bit systems, using this builtin might be slower.
numaid uint32 5.8 numa_node_id ID of the NUMA node executing the BPF program.
pid uint32 4.2 get_current_pid_tgid Process ID of the current thread (aka thread group ID), as seen from the init namespace.
probe string n/na n/a Name of the current probe.
rand uint32 4.1 get_prandom_u32 Random number.
return n/a n/a n/a The return keyword is used to exit the current probe. This differs from exit() in that it doesn’t exit bpftrace.
retval uint64 n/a n/a Value returned by the function being traced (kretprobe, uretprobe, fexit). For kretprobe and uretprobe, its type is uint64, but for fexit it depends. You can look up the type using bpftrace -lv.
tid uint32 4.2 get_current_pid_tgid Thread ID of the current thread, as seen from the init namespace.
uid uint64 4.2 get_current_uid_gid User ID of the current thread, as seen from the init namespace.

The following table is from the bpftrace manual, listing bpftrace functions along with their corresponding descriptions.

Name Description
bswap Reverse byte order
buf Returns a hex-formatted string of the data pointed to by d
cat Print file content
cgroupid Resolve cgroup ID
cgroup_path Convert cgroup id to cgroup path
exit Quit bpftrace with an optional exit code
join Print the array
kaddr Resolve kernel symbol name
kptr Annotate as kernelspace pointer
kstack Kernel stack trace
ksym Resolve kernel address
len Count ustack/kstack frames
macaddr Convert MAC address data
nsecs Timestamps and Time Deltas
ntop Convert IP address data to text
offsetof Offset of element in structure
override Override return value
path Return full path
percpu_kaddr Resolve percpu kernel symbol name
print Print a non-map value with default formatting
printf Print formatted
pton Convert text IP address to byte array
reg Returns the value stored in the named register
signal Send a signal to the current process
sizeof Return size of a type or expression
skboutput Write skb ’s data section into a PCAP file
str Returns the string pointed to by s
strcontains Compares whether the string haystack contains the string needle.
strerror Get error message for errno code
strftime Return a formatted timestamp
strncmp Compare first n characters of two strings
system Execute shell command
time Print formatted time
uaddr Resolve user-level symbol name
uptr Annotate as userspace pointer
ustack User stack trace
usym Resolve user space address

How to Code in bpftrace

bpftrace scripts are written using a custom domain-specific language (DSL) that is similar in syntax to awk. A basic script consists of one or more probe definitions followed by one or more actions. Each probe targets a specific event (e.g., kernel tracepoints, function entry/exit, or user-space events).

The following table is from the bpftrace manual, listing bpftrace probes along with their corresponding descriptions.

Probe Name Short Name Description Kernel/User Level
BEGIN/END - Built-in events Kernel/User
self - Built-in events Kernel/User
hardware h Processor-level events Kernel
interval i Timed output Kernel/User
iter it Iterators tracing Kernel
fentry/fexit f/fr Kernel functions tracing with BTF support Kernel
kprobe/kretprobe k/kr Kernel function start/return Kernel
profile p Timed sampling Kernel/User
rawtracepoint rt Kernel static tracepoints with raw arguments Kernel
software s Kernel software events Kernel
tracepoint t Kernel static tracepoints Kernel
uprobe/uretprobe u/ur User-level function start/return User
usdt U User-level static tracepoints User
watchpoint/asyncwatchpoint w/aw Memory watchpoints Kernel

Basic Structure of a bpftrace Script

probe_type:probe_identifier
{
    // Action code block
    printf("Hello, world!\n");
}

For example, to print a message every time a process calls the unlinkat() syscall, you might write:

#!/usr/bin/env bpftrace

tracepoint:syscalls:sys_enter_unlinkat
{
    printf("unlinkat syscall invoked\n");
}

sys_enter_unlinkat tracepoint’s arguments can be listed from /sys/kernel/debug/tracing/events/syscalls/sys_enter_unlinkat/format

name: sys_enter_unlinkat
ID: 849
format:
	field:unsigned short common_type;	offset:0;	size:2;	signed:0;
	field:unsigned char common_flags;	offset:2;	size:1;	signed:0;
	field:unsigned char common_preempt_count;	offset:3;	size:1;	signed:0;
	field:int common_pid;	offset:4;	size:4;	signed:1;

	field:int __syscall_nr;	offset:8;	size:4;	signed:1;
	field:int dfd;	offset:16;	size:8;	signed:0;
	field:const char * pathname;	offset:24;	size:8;	signed:0;
	field:int flag;	offset:32;	size:8;	signed:0;

print fmt: "dfd: 0x%08lx, pathname: 0x%08lx, flag: 0x%08lx", ((unsigned long)(REC->dfd)), ((unsigned long)(REC->pathname)), ((unsigned long)(REC->flag))

Therefore, we can use str(args.pathname) to extract the name of the file being deleted. args is one of the bpftrace builtins which is a data struct of all arguments of the traced function and it can be used with tracepoint, fentry, fexit.

#!/usr/bin/env bpftrace

tracepoint:syscalls:sys_enter_unlinkat
{
    printf("Process %s (PID: %d) is deleting a file %s\n", comm, pid, str(args.pathname));
}
Attaching 1 probe...
Process rm (PID: 2269) is deleting a file test1
Process rm (PID: 2270) is deleting a file test2

Let’s convert this eBPF kernel code to bpftrace

#define __TARGET_ARCH_x86
#include "vmlinux.h"
#include <bpf/bpf_helpers.h>
#include <bpf/bpf_tracing.h>
#include <bpf/bpf_core_read.h>

struct event {
    pid_t pid;
    char filename[256];
    umode_t mode;
};

struct {
    __uint(type, BPF_MAP_TYPE_PERF_EVENT_ARRAY); // Type of BPF map
    __uint(max_entries, 1024);                   // Maximum number of entries in the map
    __type(key, int);                            // Type of the key
    __type(value, int);                          // Type of the value
} mkdir SEC(".maps");

char LICENSE[] SEC("license") = "Dual BSD/GPL";

SEC("kprobe/do_mkdirat")
int BPF_KPROBE(do_mkdirat, int dfd, struct filename *name, umode_t mode)
{
    pid_t pid = bpf_get_current_pid_tgid() >> 32;
    struct event ev = {};
    ev.pid = pid;
    ev.mode = mode;
    const char *filename = BPF_CORE_READ(name, name);
    bpf_probe_read_str(ev.filename, sizeof(ev.filename), filename);
    bpf_perf_event_output(ctx, &mkdir, BPF_F_CURRENT_CPU, &ev, sizeof(ev));
    return 0;
}

Let’s build the same code without Maps as it will be explained shortly

#!/usr/bin/env bpftrace

kprobe:do_mkdirat
{
  printf("PID: %d, mode: %d, filename: %s\n", pid, arg2, str(((struct filename *)arg1)->name));
}

The idea is to cast arg1 to a pointer to struct filename before accessing name field.

bpftrace Maps

Maps in bpftrace are defined with @ such as @testmap. The following table is from bpftrace manual, listing bpftrace map functions along with their corresponding descriptions.

Name Description
avg Calculate the running average of n between consecutive calls.
clear Clear all keys/values from a map.
count Count how often this function is called.
delete Delete a single key from a map.
has_key Return true (1) if the key exists in this map. Otherwise return false (0).
hist Create a log2 histogram of n using buckets per power of 2, 0 <= k <= 5, defaults to 0.
len Return the number of elements in a map.
lhist Create a linear histogram of n. lhist creates M ((max - min) / step) buckets in the range [min, max) where each bucket is step in size.
max Update the map with n if n is bigger than the current value held.
min Update the map with n if n is smaller than the current value held.
stats Combines the count, avg and sum calls into one.
sum Calculate the sum of all n passed.
zero Set all values for all keys to zero.
#include "vmlinux.h"
#include <bpf/bpf_helpers.h>
#include <bpf/bpf_core_read.h>

struct {
    __uint(type, BPF_MAP_TYPE_HASH);
    __uint(max_entries, 1024);
    __type(key, u32);
    __type(value, u8);
} forks SEC(".maps");

struct {
    __uint(type, BPF_MAP_TYPE_HASH);
    __uint(max_entries, 1024);
    __type(key, u32);
    __type(value, u8);
} setuid SEC(".maps");

SEC("tracepoint/syscalls/sys_enter_fork")
int trace_fork(struct trace_event_raw_sys_enter *ctx)
{
    u32 pid = bpf_get_current_pid_tgid() >> 32;
    u8 val = 1;

    bpf_map_update_elem(&forks, &pid, &val, BPF_ANY);
    bpf_printk("Fork detected: PID %d\n", pid);
    return 0;
}

SEC("tracepoint/syscalls/sys_enter_setuid")
int trace_setuid(struct trace_event_raw_sys_enter *ctx)
{
    u32 uid = ctx->args[0];
    if (uid == 0) {
        u32 pid = bpf_get_current_pid_tgid() >> 32;
        u8 val = 1;
        bpf_map_update_elem(&setuid, &pid, &val, BPF_ANY);
        bpf_printk("Setuid detected: PID %d\n", pid);
    }
    return 0;
}

SEC("tracepoint/syscalls/sys_enter_execve")
int trace_execve(struct trace_event_raw_sys_enter *ctx)
{
    u32 pid = bpf_get_current_pid_tgid() >> 32;
    u8 *forked = bpf_map_lookup_elem(&forks, &pid);
    u8 *priv = bpf_map_lookup_elem(&setuid, &pid);

    if (forked && priv) {
        bpf_printk("Privilege escalation detected: fork, setuid(0), execve, PID %d\n", pid);
        bpf_send_signal(9);
    }
    return 0;
}

char LICENSE[] SEC("license") = "GPL";

Let’s see the previous code in bpftrace:

#!/usr/bin/env bpftrace         

tracepoint:syscalls:sys_enter_fork
{
    @forks[pid] = 1;
    printf("Fork detected: PID %d\n", pid);
}

tracepoint:syscalls:sys_enter_setuid
{
    if (uid == 0)
    {
        @setuid[pid] = 1;
        printf("Setuid detected: PID %d\n", pid);
    }
}

tracepoint:syscalls:sys_enter_execve
{
    if (@forks[pid] == 1 && @setuid[pid] == 1)
    {
        printf("Privilege escalation detected: fork, setuid(0), execve, PID %d\n", pid);
        signal(9)
    }
}

Define a map with name forks and add currentpid as key and 1 as value if sys_enter_setuid tracepoint is triggered.

@forks[pid] = 1;

Define a map with name setuid and add current pid as key and 1 as value if sys_enter_fork tracepoint is triggered and UID is zero.

@setuid[pid] = 1;

If sys_enter_execve is triggered, hen it will check if the current pid triggered by sys_enter_setuid and sys_enter_fork

if (@forks[pid] == 1 && @setuid[pid] == 1)

signal function is equivalent to bpf_send_signal helper function to terminate the process.

signal(9)

We have to run this code with --unsafe because we running dangerous function which is signal, then to run it sudo bpftrace --unsafe priv-esc.bt. This code is much smaller and simpler than eBPF kernel code, and no need for user-space code.

The next script attaches probes to the sys_enter_read and sys_enter_write syscalls (separated with comma ,) and uses a map to count the number of system calls per process using count() map function.

#!/usr/bin/env bpftrace

tracepoint:syscalls:sys_enter_read,
tracepoint:syscalls:sys_enter_write
{
    @syscalls[comm] = count();
}

interval:s:5 {
   printf("\033[H\033[2J");
   print(@syscalls);
}

This will activate every 5 seconds (using interval probe) to clear the screen using ANSI escape sequences printf("\033[H\033[2J");, then print the content of syscalls map.

interval:s:5 { 
   printf("\033[H\033[2J");
   print(@syscalls);
}
@syscalls[systemd-timesyn]: 1
@syscalls[systemd-journal]: 1
@syscalls[systemd]: 4
@syscalls[rtkit-daemon]: 8
@syscalls[sudo]: 10
@syscalls[gnome-shell]: 13
@syscalls[gvfsd-wsdd]: 16
@syscalls[bash]: 20
@syscalls[ls]: 26
@syscalls[bpftrace]: 47
@syscalls[sshd-session]: 818

bpftrace Tools

The following tools from bpftrace github repository. They cover a wide range of functions from tracing I/O and network events to monitoring process and syscall activity.

Name Description
bashreadline.bt Print entered bash commands system wide. Examples.
biolatency.bt Block I/O latency as a histogram. Examples.
biosnoop.bt Block I/O tracing tool, showing per I/O latency. Examples.
biostacks.bt Show disk I/O latency with initialization stacks. Examples.
bitesize.bt Show disk I/O size as a histogram. Examples.
capable.bt Trace security capability checks. Examples.
cpuwalk.bt Sample which CPUs are executing processes. Examples.
dcsnoop.bt Trace directory entry cache (dcache) lookups. Examples.
execsnoop.bt Trace new processes via exec() syscalls. Examples.
gethostlatency.bt Show latency for getaddrinfo/gethostbyname[2] calls. Examples.
killsnoop.bt Trace signals issued by the kill() syscall. Examples.
loads.bt Print load averages. Examples.
mdflush.bt Trace md flush events. Examples.
naptime.bt Show voluntary sleep calls. Examples.
opensnoop.bt Trace open() syscalls showing filenames. Examples.
oomkill.bt Trace OOM killer. Examples.
pidpersec.bt Count new processes (via fork). Examples.
runqlat.bt CPU scheduler run queue latency as a histogram. Examples.
runqlen.bt CPU scheduler run queue length as a histogram. Examples.
setuids.bt Trace the setuid syscalls: privilege escalation. Examples.
ssllatency.bt Summarize SSL/TLS handshake latency as a histogram. Examples.
sslsnoop.bt Trace SSL/TLS handshake, showing latency and return value. Examples.
statsnoop.bt Trace stat() syscalls for general debugging. Examples.
swapin.bt Show swapins by process. Examples.
syncsnoop.bt Trace sync() variety of syscalls. Examples.
syscount.bt Count system calls. Examples.
tcpaccept.bt Trace TCP passive connections (accept()). Examples.
tcpconnect.bt Trace TCP active connections (connect()). Examples.
tcpdrop.bt Trace kernel-based TCP packet drops with details. Examples.
tcplife.bt Trace TCP session lifespans with connection details. Examples.
tcpretrans.bt Trace TCP retransmits. Examples.
tcpsynbl.bt Show TCP SYN backlog as a histogram. Examples.
threadsnoop.bt List new thread creation. Examples.
undump.bt Capture UNIX domain socket packages. Examples.
vfscount.bt Count VFS calls. Examples.
vfsstat.bt Count some VFS calls, with per-second summaries. Examples.
writeback.bt Trace file system writeback events with details. Examples.
xfsdist.bt Summarize XFS operation latency distribution as a histogram. Examples.

2 - BCC

BPF Compiler Collection a Python and Lua toolkit that wraps Clang so you can embed small eBPF snippets in scripts and run them with just a few lines of loader code.

BCC in short is a toolkit that makes eBPF development easier by providing a higher-level interface. It compiles your eBPF C code at runtime to match the target kernel’s data structures. BCC works with languages like Python, Lua, and C++, and includes helpful macros and shortcuts for simpler programming. Essentially, BCC takes your eBPF program as a C string, preprocesses it, and then compiles it using clang.

The following is a crash course on BCC. I strongly recommend reading the BCC manual as well—it’s incredibly detailed and covers topics that are too extensive for this chapter.

Probes Definition

  1. kprobe: kprobe__ followed by the name of kernel function name. For example, int kprobe__do_mkdirat(struct pt_regs *ctx). struct pt_regs *ctx as a context for kprobe. Arguments can be extracted using PT_REGS_PARM1(ctx), PT_REGS_PARM2(ctx), … macros.
  2. kretprobe: kretprobe__ followed by the name of kernel function name. For example, int kretprobe__do_mkdirat(struct pt_regs *ctx). Return value can be extracted using PT_REGS_RC(ctx) macro.
  3. uprobes: Can be declared as regular C function. For example, int function(struct pt_regs *ctx). Arguments can be extracted using PT_REGS_PARM1(ctx), PT_REGS_PARM2(ctx), … macros.
  4. uretprobes: Can be declared as regular C functionint function(struct pt_regs *ctx). Return value can be extracted using PT_REGS_RC(ctx) macro.
  5. Tracepoints: TRACEPOINT_PROBE followed by (category, event) . For example, TRACEPOINT_PROBE(sched,sched_process_exit). Arguments are available in an args struct and you can list of argument from the format file. Foe example, args->pathname in case of TRACEPOINT_PROBE(syscalls, sys_enter_unlinkat).
  6. Raw Tracepoints: RAW_TRACEPOINT_PROBE(event).
    For example, RAW_TRACEPOINT_PROBE(sys_enter). As stated before, raw tracepoint uses bpf_raw_tracepoint_args as context and it has args as args[0] -> points to pt_regs structure and args[1] is the syscall number. To access the target functions’ parameters, you can either cast ctx->args[0] to a pointer to a struct pt_regs and use it directly, or copy its contents into a local variable of type struct pt_regs (e.g., struct pt_regs regs;). Then, you can extract the syscall parameters using the PT_REGS_PARM macros (such as PT_REGS_PARM1, PT_REGS_PARM2, etc.).
    // Copy the pt_regs structure from the raw tracepoint args.
    if (bpf_probe_read(&regs, sizeof(regs), (void *)ctx->args[0]) != 0)
        return 0;
        
    // Get the second parameter (pathname) from the registers.
    const char *pathname = (const char *)PT_REGS_PARM2(&regs);
  1. LSM: LSM_PROBE(hook_name,typeof(arg1), typeof(arg1)...). For example, to prevent creating a new directory:
LSM_PROBE(path_mkdir, const struct path *dir, struct dentry *dentry, umode_t mode, int ret)
{
    bpf_trace_printk("LSM path_mkdir: mode=%d, ret=%d\n", mode, ret);
    return -1;
}

Data handling

  1. bpf_probe_read_kernel helper function with the following prototype:
int bpf_probe_read_kernel(void *dst, int size, const void *src)

bpf_probe_read_kernel is used for copying arbitrary data (e.g., structures, buffers) from kernel space and returns 0 on success.

  1. bpf_probe_read_kernel_str helper function with the following prototype:
int bpf_probe_read_kernel_str(void *dst, int size, const void *src)

bpf_probe_read_kernel_str is used for reading null-terminated strings from kernel space and returns the length of the string including the trailing NULL on success.

  1. bpf_probe_read_user helper function with the following prototype:
int bpf_probe_read_user(void *dst, int size, const void *src)

bpf_probe_read_user is used for copying arbitrary data (e.g., structures, buffers) from user space and returns 0 on success.

  1. bpf_probe_read_user_str helper function with the following prototype:
int bpf_probe_read_user_str(void *dst, int size, const void *src)  

bpf_probe_read_user_str is used for reading null-terminated strings from user space and returns the length of the string including the trailing NULL on success.

  1. bpf_ktime_get_ns: returns u64 time elapsed since system boot in nanoseconds.
  2. bpf_get_current_pid_tgid: returns u64 current tgid and pid.
  3. bpf_get_current_uid_gid: returns u64 current pid and gid.
  4. bpf_get_current_comm(void *buf, __u32 size_of_buf): copy current process name into pointerbuf and sizeof at least 16 bytes. For example:
    char comm[TASK_COMM_LEN]; // TASK_COMM_LEN = 16, defined in include/linux/sched.h
    bpf_get_current_comm(&comm, sizeof(comm));
  1. bpf_get_current_task helper function returns current task as a pointer to struct task_struct.

Buffers

  1. BPF_PERF_OUTPUT(name): creates eBPF table to push data out to user-space using perf buffer.
  2. perf_submit: a method of a BPF_PERF_OUTPUT to submit data to user-space. perf_submit has the following prototype: int perf_submit((void *)ctx, (void *)data, u32 data_size).
  3. BPF_RINGBUF_OUTPUT: creates eBPF table to push data out to user-space using ring buffer. It has the following prototype BPF_RINGBUF_OUTPUT(name, page_cnt), page_cnt is number of memory pages for ring buffer size.
  4. ringbuf_output: a method of the BPF_RINGBUF_OUTPUT to submit data to user-space.ringbuf_output has the following prototype: int ringbuf_output((void *)data, u64 data_size, u64 flags).
  5. ringbuf_reserve: a method of the BPF_RINGBUF_OUTPUT to reserve a space in ring buffer and allocate data structure pointer for output data. It has the following prototype: void* ringbuf_reserve(u64 data_size).
  6. ringbuf_submit: a method of the BPF_RINGBUF_OUTPUT to submit data to user-space. ringbuf_submit has the following prototype: void ringbuf_submit((void *)data, u64 flags).

Maps

  1. BPF_HASH: creates hash map. For example, BPF_HASH(my_hash, u64, u64);.
  2. BPF_ARRAY: creates array map.

BCC has also BPF_HISTOGRAM, BPF_STACK_TRACE, BPF_PERF_ARRAY, BPF_PERCPU_HASH, BPF_PERCPU_ARRAY, BPF_LPM_TRIE,BPF_PROG_ARRAY, BPF_CPUMAP, BPF_ARRAY_OF_MAPS and BPF_HASH_OF_MAPS.

Map Operations

  1. *val map.lookup(&key): return a pointer to value if exists.
  2. map.delete(&key): delete a key from map.
  3. map.update(&key, &val): updates value for a given key.
  4. map.insert(&key, &val): inserts a value for a given key.
  5. map.increment(key[, increment_amount]): increments the key by increment_amount.

BCC Python

  1. BPF(text=prog): creates eBPF object.
  2. BPF.attach_kprobe(event="event", fn_name="name"): attach a probe into kernel function event and use name as kprobe handler.
  3. BPF.attach_kretprobe(event="event", fn_name="name"): the same as attach_kprobe.
  4. BPF.attach_tracepoint(tp="tracepoint", fn_name="name"): attach a probe into tracepoint and use name as tracepoint handler.
  5. BPF.attach_uprobe(name="location", sym="symbol", fn_name="name"): attach a probe tolocation with symboluse name as uprobe handler. For example,
b.attach_uprobe(name="/bin/bash", sym="shell_execve", fn_name="bash_exec")

Attach a probe to shell_execve symbol in binary or object file/bin/bashand use bash_exec as a handler.

  1. BPF.attach_uretprobe(name="location", sym="symbol", fn_name="name"): the same as attach_uprobe.
  2. BPF.attach_raw_tracepoint(tp="tracepoint", fn_name="name"): the same as attach_tracepoint.
  3. BPF.attach_xdp(dev="device", fn=b.load_func("fn_name",BPF.XDP), flags): attach XDP to device, use fn_name as handler for each ingress packet. Flags are defined in include/uapi/linux/if_link.h as the following:
#define XDP_FLAGS_UPDATE_IF_NOEXIST	(1U << 0) //-->
#define XDP_FLAGS_SKB_MODE		(1U << 1)
#define XDP_FLAGS_DRV_MODE		(1U << 2)
#define XDP_FLAGS_HW_MODE		(1U << 3)

XDP_FLAGS_UPDATE_IF_NOEXIST (1U « 0): This flag attaches the XDP program if there isn’t already one present. XDP_FLAGS_SKB_MODE (1U « 1): This flag attaches the XDP program in generic mode. XDP_FLAGS_DRV_MODE (1U « 2): This flag attaches the XDP program in native driver mode. XDP_FLAGS_HW_MODE (1U « 3): This flag is used for offloading the XDP program to supported hardware (NICs that support XDP offload).

  1. BPF.remove_xdp("device"): removes XDP program from interface device.
  2. BPF.detach_kprobe(event="event", fn_name="name"): detach a kprobe.
  3. BPF.detach_kretprobe(event="event", fn_name="name"): detach a kretprobe.

Output

  1. BPF.perf_buffer_poll(timeout=T): polls data from perf buffer.
  2. BPF.ring_buffer_poll(timeout=T): polls data from ring buffer.
  3. table.open_perf_buffer(callback, page_cnt=N, lost_cb=None): opens a perf ring buffer for BPF_PERF_OUTPUT.
  4. table.open_ring_buffer(callback, ctx=None): opens a buffer ring for BPF_RINGBUF_OUTPUT.
  5. BPF.trace_print: reads from /sys/kernel/debug/tracing/trace_pipe and prints the contents.

Examples

Let’s look at example The following eBPF kernel code yo attack kprobe to do_mkdirat kernel function.

from bcc import BPF

prog = r"""
#include <uapi/linux/ptrace.h>
#include <linux/fs.h>

#define MAX_FILENAME_LEN 256

int kprobe__do_mkdirat(struct pt_regs *ctx)
{
    pid_t pid = bpf_get_current_pid_tgid() >> 32;
    char fname[MAX_FILENAME_LEN] = {};

    struct filename *name = (struct filename *)PT_REGS_PARM2(ctx);
    const char *name_ptr = 0;

    bpf_probe_read(&name_ptr, sizeof(name_ptr), &name->name);

    bpf_probe_read_str(fname, sizeof(fname), name_ptr);

    umode_t mode = PT_REGS_PARM3(ctx);

    bpf_trace_printk("KPROBE ENTRY pid = %d, filename = %s, mode = %u", pid, fname, mode);
    return 0;
}
"""

b = BPF(text=prog)
print("Tracing mkdir calls... Hit Ctrl-C to exit.")
b.trace_print()

bpf_trace_printk function is similar to bpf_printk macro. First, we create an eBPF object from the C code using b = BPF(text=prog). We don’t have to add attach_kprobe because we followed the naming convention of kprobe which is kprobe__ followed by the name of kernel function name kprobe__do_mkdirat. Run sudo python3 bcc-mkdir.py:

b'  mkdir-1706 [...] KPROBE ENTRY pid = 1706, filename = test1, mode = 511'
b'  mkdir-1708 [...] KPROBE ENTRY pid = 1708, filename = test2, mode = 511'

We don’t have to compile because BCC compiles eBPF C code at runtime and there is no need to write a separate user-space. If you notice, the previous code is exactly similar to the following:

#define __TARGET_ARCH_x86
#include "vmlinux.h"
#include <bpf/bpf_helpers.h>
#include <bpf/bpf_tracing.h>
#include <bpf/bpf_core_read.h>

char LICENSE[] SEC("license") = "GPL";

SEC("kprobe/do_mkdirat")
int kprobe_mkdir(struct pt_regs *ctx)
{
    pid_t pid;
    const char *filename;
    umode_t mode;

    pid = bpf_get_current_pid_tgid() >> 32;
    struct filename *name = (struct filename *)PT_REGS_PARM2(ctx);
    filename = BPF_CORE_READ(name, name);
	mode = PT_REGS_PARM3(ctx);
   
    bpf_printk("KPROBE ENTRY pid = %d, filename = %s, mode = %u\n", pid, filename,mode);

    return 0;
}

Let’s explore another example which uses Tracepoints with ring buffer map.

from bcc import BPF

prog = """

struct event {
    u32 pid;
    char comm[16];
    char filename[256];
};

BPF_RINGBUF_OUTPUT(events, 4096);

TRACEPOINT_PROBE(syscalls, sys_enter_unlinkat) {
    struct event *evt = events.ringbuf_reserve(sizeof(*evt));
    if (!evt)
        return 0;

    evt->pid = bpf_get_current_pid_tgid() >> 32;
    bpf_get_current_comm(evt->comm, sizeof(evt->comm));
    // In the tracepoint for unlinkat, the second argument (args->pathname) is the filename.
    bpf_probe_read_user_str(evt->filename, sizeof(evt->filename), args->pathname);

    events.ringbuf_submit(evt, 0);
    return 0;
}
"""

def print_event(cpu, data, size):
    event = b["events"].event(data)
    print("PID: %d, COMM: %s, File: %s" %
          (event.pid,
           event.comm.decode('utf-8'),
           event.filename.decode('utf-8')))

b = BPF(text=prog)
b["events"].open_ring_buffer(print_event)

print("Tracing unlinkat syscall... Hit Ctrl-C to end.")
while True:
    try:
        b.ring_buffer_poll(100)
    except KeyboardInterrupt:
        exit()

The print_event callback function to the ring buffer which will be called every time an event is received from the ring buffer. print_event takes cpu number, pointer to raw data of the event data structure defined in eBPF code and size of the event data.

def print_event(cpu, data, size):

Then, event is defined automatically using BCC from eBPF data structure event:

event = b["events"].event(data)

Then, printing out the contents of the event:

print("PID: %d, COMM: %s, File: %s" %
      (event.pid,
      event.comm.decode('utf-8'),
      event.filename.decode('utf-8')))

Then,we create an eBPF object from the C code using b = BPF(text=prog). Then, we opened the ring buffer associated with the map named events with callback function (print_event) to process data that is submitted to the ring buffer.

b["events"].open_ring_buffer(print_event)

We don’t need to use BPF.attach_tracepoint because we followed the naming convention for tracepoints which is TRACEPOINT_PROBE(_category_, _event_). Finally, we start polling from the ring buffer with 100ms timeout.

b.ring_buffer_poll(100)  # 100ms timeout

If you noticed, this code is similar to what we did in tracepoint.

#define __TARGET_ARCH_x86
#include "vmlinux.h"
#include <bpf/bpf_helpers.h>
#include <bpf/bpf_tracing.h>
#include <bpf/bpf_core_read.h>

struct event {
    __u32 pid;
    char comm[16];
    char filename[256];
};

struct {
    __uint(type, BPF_MAP_TYPE_RINGBUF);
    __uint(max_entries, 4096);
} events SEC(".maps");

char _license[] SEC("license") = "GPL";

SEC("tracepoint/syscalls/sys_enter_unlinkat")
int trace_unlinkat(struct trace_event_raw_sys_enter* ctx) {
    struct event *evt;

    evt = bpf_ringbuf_reserve(&events, sizeof(struct event), 0);
    if (!evt)
        return 0;

    evt->pid = bpf_get_current_pid_tgid() >> 32;
    bpf_get_current_comm(&evt->comm, sizeof(evt->comm));
    bpf_probe_read_user_str(&evt->filename, sizeof(evt->filename), (const char *)ctx->args[1]);

    bpf_ringbuf_submit(evt, 0);

    return 0;
}

BCC Tools

The following tables are from the BCC GitHub repository. These tables contain useful tools for different aspects of system analysis, including general tracing and debugging, memory and process monitoring, performance and timing, CPU and scheduler statistics, network and socket monitoring, as well as storage and filesystems diagnostics. Each category offers a range of tools designed to help you quickly diagnose issues, tune performance, and gather insights into system behavior using BPF-based instrumentation.

General

Name Description
argdist Display function parameter values as a histogram or frequency count.
bashreadline Print entered bash commands system wide.
bpflist Display processes with active BPF programs and maps.
capable Trace security capability checks.
compactsnoop Trace compact zone events with PID and latency.
criticalstat Trace and report long atomic critical sections in the kernel.
deadlock Detect potential deadlocks on a running process.
drsnoop Trace direct reclaim events with PID and latency.
funccount Count kernel function calls.
inject Targeted error injection with call chain and predicates.
klockstat Traces kernel mutex lock events and displays lock statistics.
opensnoop Trace open() syscalls.
readahead Show performance of read-ahead cache.
reset-trace Reset the state of tracing. Maintenance tool only.
stackcount Count kernel function calls and their stack traces.
syncsnoop Trace sync() syscall.
threadsnoop List new thread creation.
tplist Display kernel tracepoints or USDT probes and their formats.
trace Trace arbitrary functions, with filters.
ttysnoop Watch live output from a tty or pts device.
ucalls Summarize method calls or Linux syscalls in high-level languages.
uflow Print a method flow graph in high-level languages.
ugc Trace garbage collection events in high-level languages.
uobjnew Summarize object allocation events by object type and number of bytes allocated.
ustat Collect events such as GCs, thread creations, object allocations, exceptions, etc.
uthreads Trace thread creation events in Java and raw pthreads.

Memory and Process Tools

Name Description
execsnoop Trace new processes via exec() syscalls.
exitsnoop Trace process termination (exit and fatal signals).
killsnoop Trace signals issued by the kill() syscall.
kvmexit Display the exit_reason and its statistics of each vm exit.
memleak Display outstanding memory allocations to find memory leaks.
numasched Track the migration of processes between NUMAs.
oomkill Trace the out-of-memory (OOM) killer.
pidpersec Count new processes (via fork).
rdmaucma Trace RDMA Userspace Connection Manager Access events.
shmsnoop Trace System V shared memory syscalls.
slabratetop Kernel SLAB/SLUB memory cache allocation rate top.

Performance and Time Tools

Name Description
dbslower Trace MySQL/PostgreSQL queries slower than a threshold.
dbstat Summarize MySQL/PostgreSQL query latency as a histogram.
funcinterval Time interval between the same function as a histogram.
funclatency Time functions and show their latency distribution.
funcslower Trace slow kernel or user function calls.
hardirqs Measure hard IRQ (hard interrupt) event time.
mysqld_qslower Trace MySQL server queries slower than a threshold.
ppchcalls Summarize ppc hcall counts and latencies.
softirqs Measure soft IRQ (soft interrupt) event time.
syscount Summarize syscall counts and latencies.

CPU and Scheduler Tools

Name Description
cpudist Summarize on- and off-CPU time per task as a histogram.
cpuunclaimed Sample CPU run queues and calculate unclaimed idle CPU.
llcstat Summarize CPU cache references and misses by process.
offcputime Summarize off-CPU time by kernel stack trace.
offwaketime Summarize blocked time by kernel off-CPU stack and waker stack.
profile Profile CPU usage by sampling stack traces at a timed interval.
runqlat Run queue (scheduler) latency as a histogram.
runqlen Run queue length as a histogram.
runqslower Trace long process scheduling delays.
wakeuptime Summarize sleep-to-wakeup time by waker kernel stack.
wqlat Summarize work waiting latency on workqueue.

Network and Sockets Tools

Name Description
gethostlatency Show latency for getaddrinfo/gethostbyname[2] calls.
bindsnoop Trace IPv4 and IPv6 bind() system calls (bind()).
netqtop Trace and display packets distribution on NIC queues.
sofdsnoop Trace FDs passed through unix sockets.
solisten Trace TCP socket listen.
sslsniff Sniff OpenSSL written and readed data.
tcpaccept Trace TCP passive connections (accept()).
tcpconnect Trace TCP active connections (connect()).
tcpconnlat Trace TCP active connection latency (connect()).
tcpdrop Trace kernel-based TCP packet drops with details.
tcplife Trace TCP sessions and summarize lifespan.
tcpretrans Trace TCP retransmits and TLPs.
tcprtt Trace TCP round trip time.
tcpstates Trace TCP session state changes with durations.
tcpsubnet Summarize and aggregate TCP send by subnet.
tcpsynbl Show TCP SYN backlog.
tcptop Summarize TCP send/recv throughput by host. Top for TCP.
tcptracer Trace TCP established connections (connect(), accept(), close()).
tcpcong Trace TCP socket congestion control status duration.

Storage and Filesystems Tools

Name Description
bitesize Show per process I/O size histogram.
cachestat Trace page cache hit/miss ratio.
cachetop Trace page cache hit/miss ratio by processes.
dcsnoop Trace directory entry cache (dcache) lookups.
dcstat Directory entry cache (dcache) stats.
biolatency Summarize block device I/O latency as a histogram.
biotop Top for disks: Summarize block device I/O by process.
biopattern Identify random/sequential disk access patterns.
biosnoop Trace block device I/O with PID and latency.
dirtop File reads and writes by directory. Top for directories.
filelife Trace the lifespan of short-lived files.
filegone Trace why file gone (deleted or renamed).
fileslower Trace slow synchronous file reads and writes.
filetop File reads and writes by filename and process. Top for files.
mdflush Trace md flush events.
mountsnoop Trace mount and umount syscalls system-wide.
virtiostat Show VIRTIO device IO statistics.

Filesystems Tools

Name Description
btrfsdist Summarize btrfs operation latency distribution as a histogram.
btrfsslower Trace slow btrfs operations.
ext4dist Summarize ext4 operation latency distribution as a histogram.
ext4slower Trace slow ext4 operations.
nfsslower Trace slow NFS operations.
nfsdist Summarize NFS operation latency distribution as a histogram.
vfscount Count VFS calls.
vfsstat Count some VFS calls, with column output.
xfsdist Summarize XFS operation latency distribution as a histogram.
xfsslower Trace slow XFS operations.
zfsdist Summarize ZFS operation latency distribution as a histogram.
zfsslower Trace slow ZFS operations.

3 - BPFTool

The official command-line utility shipped with the kernel source that loads pins inspects and benchmarks eBPF programs and maps from user space.

BPFTool is a command-line utility for interacting with eBPF programs and maps in the Linux kernel. It provides a comprehensive set of commands for loading, inspecting, and debugging eBPF objects. With bpftool, users can load compiled eBPF programs into the kernel, attach them to various kernel events or network interfaces, and manage eBPF maps used for storing and sharing data between kernel and user space. It also offers powerful introspection capabilities, allowing users to examine the state of running eBPF programs, including their maps, attached probes, and verifier logs, making it an indispensable tool for eBPF developers and system administrators working with modern Linux observability and networking.

We used many times to generate vmlinux.h header file and the skeleton header files:

sudo bpftool btf dump file /sys/kernel/btf/vmlinux format c > vmlinux.h
sudo bpftool gen skeleton obj_file.o > obj_file.skel.h

Using sudo bpftool -h to show the program’s options:

Usage: bpftool [OPTIONS] OBJECT { COMMAND | help }
       bpftool batch file FILE
       bpftool version

       OBJECT := { prog | map | link | cgroup | perf | net | feature | btf | gen | struct_ops | iter }
       OPTIONS := { {-j|--json} [{-p|--pretty}] | {-d|--debug} |
                    {-V|--version} }

Let’s explore some of these options. First, let’s try the following simple code which count how many times getpid syscall get invoked:

#include <linux/bpf.h>
#include <bpf/bpf_helpers.h>

struct {
    __uint(type, BPF_MAP_TYPE_ARRAY);
    __uint(max_entries, 1);
    __type(key, int);
    __type(value, int);
} count SEC(".maps");

char _license[] SEC("license") = "GPL";

SEC("tracepoint/syscalls/sys_enter_getpid")
int count_getpid(void *ctx)
{
    int key = 0;
    int *value;

    value = bpf_map_lookup_elem(&count, &key);
    if (value)
        (*value)++;
    return 0;
}

User-space code:

#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <signal.h>
#include <bpf/libbpf.h>
#include "getpid_count.skel.h"

int main(int argc, char **argv)
{
    struct getpid_count *skel;
    int err;

    skel = getpid_count__open();
    if (!skel) {
        fprintf(stderr, "Failed to open BPF skeleton\n");
        return 1;
    }

    err = getpid_count__load(skel);
    if (err) {
        fprintf(stderr, "Failed to load BPF skeleton: %d\n", err);
        goto cleanup;
    }

    err = getpid_count__attach(skel);
    if (err) {
        fprintf(stderr, "Failed to attach BPF skeleton: %d\n", err);
        goto cleanup;
    }

    int map_fd = bpf_map__fd(skel->maps.count);
    if (map_fd < 0) {
        fprintf(stderr, "Failed to get map FD\n");
        goto cleanup;
    }

    printf("BPF program loaded and map updated. Press Ctrl+C to exit.\n");

    while (1) {
        sleep(1);
        int lookup_key = 0;
        int count = 0;
        err = bpf_map__lookup_elem(skel->maps.count,
                                   &lookup_key, sizeof(lookup_key),
                                   &count, sizeof(count), 0);
        if (err == 0) {
            printf("getpid call count: %d\n", count);
        } else {
            fprintf(stderr, "Lookup failed for key %d: %d\n", lookup_key, err);
        }
    }
cleanup:
    getpid_count__destroy(skel);
    return 0;
}

Before compiling and running the code. Run sudo bpftool prog to list all running eBPF programs:

[...]
42: cgroup_device  name sd_devices  tag 2705a24f44b96941  gpl
	loaded_at 2025-03-17T23:38:12-0400  uid 0
	xlated 464B  jited 301B  memlock 4096B
43: cgroup_skb  name sd_fw_egress  tag 6deef7357e7b4530  gpl
	loaded_at 2025-03-17T23:38:12-0400  uid 0
	xlated 64B  jited 67B  memlock 4096B
44: cgroup_skb  name sd_fw_ingress  tag 6deef7357e7b4530  gpl
	loaded_at 2025-03-17T23:38:12-0400  uid 0
	xlated 64B  jited 67B  memlock 4096B
46: cgroup_device  name sd_devices  tag 30c3c39a95291292  gpl
	loaded_at 2025-03-18T00:27:21-0400  uid 0
	xlated 1664B  jited 1027B  memlock 4096B

You will see a list of cgroup programs running by the system. After compiling and running the previous code and then list all running eBPF code sudo bpftool prog:

46: cgroup_device  name sd_devices  tag 30c3c39a95291292  gpl
	loaded_at 2025-03-18T00:27:21-0400  uid 0
	xlated 1664B  jited 1027B  memlock 4096B
312: tracepoint  name count_getpid  tag be075f8b6a94de72  gpl
	loaded_at 2025-03-18T05:38:34-0400  uid 0
	xlated 152B  jited 99B  memlock 4096B  map_ids 147
	btf_id 511

Also --pretty option can be used sudo bpftool prog --pretty to display the output in prettified JSON.

[...]
        "id": 312,
        "type": "tracepoint",
        "name": "count_getpid",
        "tag": "be075f8b6a94de72",
        "gpl_compatible": true,
        "loaded_at": 1742290714,
        "uid": 0,
        "orphaned": false,
        "bytes_xlated": 152,
        "jited": true,
        "bytes_jited": 99,
        "bytes_memlock": 4096,
        "map_ids": [147],
        "btf_id": 511
[...]

Our code has the following properties:
id = 312
Program Type = tracepoint which represents the type of eBPF program
name = count_getpid which is the name of the function defined in the code
tag = be075f8b6a94de72 which is a hash of the compiled instructions.
gpl which is the liscence
loaded_at = 2025-03-18T05:38:34-0400 timestamp when the program was loaded
uid = 0 which indicated that loaded by root
xlated = 152B represents the size of eBPF bytecode
jited = 99B represents the size of the machine code
memlock = 4096B represents the size resrved for this program
map_ids = 147 which is the id of the map loaded in this program
btf_id = 511 which is a unique identifier that the kernel assigns to that block of BTF metadata and can inspect its details using sudo bpftool btf show id 511

BPFTool can dump eBPF bytecode using sudo bpftool prog dump xlated id 312 :

int count_getpid(void * ctx):
; int count_getpid(void *ctx)
   0: (b7) r1 = 0
; int key = 0;
   1: (63) *(u32 *)(r10 -4) = r1
   2: (bf) r2 = r10
   3: (07) r2 += -4
; value = bpf_map_lookup_elem(&count, &key);
   4: (18) r1 = map[id:147]
   6: (07) r1 += 264
   7: (61) r0 = *(u32 *)(r2 +0)
   8: (35) if r0 >= 0x1 goto pc+3
   9: (67) r0 <<= 3
  10: (0f) r0 += r1
  11: (05) goto pc+1
  12: (b7) r0 = 0
; if (value)
  13: (15) if r0 == 0x0 goto pc+3
; (*value)++;
  14: (61) r1 = *(u32 *)(r0 +0)
  15: (07) r1 += 1
  16: (63) *(u32 *)(r0 +0) = r1
; return 0;
  17: (b7) r0 = 0
  18: (95) exit

We can get visual representation of our code instructions using sudo bpftool prog dump xlated id 312 visual &> vis.out and the output vis.out is a DOT language file which is a graph description language and can be viewed Graphviz. It can be converted to PNG using dot -Tpng viz.out -o viz.png and you can display viz.png file.

Centered image

BPFTool can also dump jited or the program machine code using sudo bpftool prog dump jited id 312:

int count_getpid(void * ctx):
0xffffffffc03735d4:
; int count_getpid(void *ctx)
   0:	nopl	(%rax,%rax)
   5:	nopl	(%rax)
   8:	pushq	%rbp
   9:	movq	%rsp, %rbp
   c:	subq	$8, %rsp
  13:	xorl	%edi, %edi
; int key = 0;
  15:	movl	%edi, -4(%rbp)
  18:	movq	%rbp, %rsi
  1b:	addq	$-4, %rsi
; value = bpf_map_lookup_elem(&count, &key);
  1f:	movabsq	$-121297335560704, %rdi
  29:	addq	$272, %rdi
  30:	movl	(%rsi), %eax
  33:	cmpq	$1, %rax
  37:	jae	0xffffffffc0373616
  39:	shlq	$3, %rax
  3d:	addq	%rdi, %rax
  40:	jmp	0xffffffffc0373618
  42:	xorl	%eax, %eax
; if (value)
  44:	testq	%rax, %rax
  47:	je	0xffffffffc0373627
; (*value)++;
  49:	movl	(%rax), %edi
  4c:	addq	$1, %rdi
  50:	movl	%edi, (%rax)
; return 0;
  53:	xorl	%eax, %eax
  55:	leave
  56:	jmp	0xffffffff85f0410b
getpid.o:	file format elf64-bpf

Disassembly of section tracepoint/syscalls/sys_enter_getpid:

0000000000000000 <count_getpid>:
; {
       0:	b7 01 00 00 00 00 00 00	r1 = 0x0
;     int key = 0;
       1:	63 1a fc ff 00 00 00 00	*(u32 *)(r10 - 0x4) = r1
       2:	bf a2 00 00 00 00 00 00	r2 = r10
       3:	07 02 00 00 fc ff ff ff	r2 += -0x4
;     value = bpf_map_lookup_elem(&count, &key);
       4:	18 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00	r1 = 0x0 ll
       6:	85 00 00 00 01 00 00 00	call 0x1
;     if (value)
       7:	15 00 03 00 00 00 00 00	if r0 == 0x0 goto +0x3 <count_getpid+0x58>
;         (*value)++;
       8:	61 01 00 00 00 00 00 00	r1 = *(u32 *)(r0 + 0x0)
       9:	07 01 00 00 01 00 00 00	r1 += 0x1
      10:	63 10 00 00 00 00 00 00	*(u32 *)(r0 + 0x0) = r1
;     return 0;
      11:	b7 00 00 00 00 00 00 00	r0 = 0x0
      12:	95 00 00 00 00 00 00 00	exit

You can find a list of eBPF opcodes from kernel documentation
https://docs.kernel.org/bpf/standardization/instruction-set.html or RFC 9669 https://www.rfc-editor.org/rfc/rfc9669.html.

BPFTool can display the list of all maps using sudo bpftool map

11: hash_of_maps  name cgroup_hash  flags 0x0
	key 8B  value 4B  max_entries 2048  memlock 165152B
147: array  name count  flags 0x0
	key 4B  value 4B  max_entries 1  memlock 272B
	btf_id 511

We can also inspect map content with id 147 using sudo bpftool map dump id 147

[{
        "key": 0,
        "value": 22 # count of how many times `getpid` syscall get invoked
    }
]

Maps can be updated using bpftool. For example, let’s change the value to 90 sudo bpftool map update id 147 key 00 00 00 00 value 90 00 00 00 and inspect the content again sudo bpftool map dump id 147 you might see:

[{
        "key": 0,
        "value": 98
    }
]

The extra 8 indicates that between your update and the dump, the getpid syscall was triggered 8 times, and since your eBPF program increments the value each time getpid is called, the counter increased from 90 to 98.

Maps also can be pinned to eBPF filesystem using sudo bpftool map pin id 147 /sys/fs/bpf/getpid_map and even after termination of our program we can dump the content of the pinned map using sudo bpftool map dump pinned /sys/fs/bpf/getpid_map

[{
        "key": 0,
        "value": 122
    }
]

We can unpin simply by remove the created file sudo rm /sys/fs/bpf/getpid_map. BPFTool can also load programs. Let’s close our program and load it again using: sudo bpftool prog loadall getpid.o /sys/fs/bpf/test autoattach and then run sudo bpftool prog to make sure that the program is loaded:

741: tracepoint  name count_getpid  tag be075f8b6a94de72  gpl
	loaded_at 2025-03-18T07:03:14-0400  uid 0
	xlated 152B  jited 99B  memlock 4096B  map_ids 85
	btf_id 189

autoattach option is to load, attach and pin kprobe, kretprobe, uprobe, uretprobe and tracepoints in a single command.

We can again dump the content of the program map sudo bpftool map dump id 85

[{
        "key": 0,
        "value": 24
    }
]

BPFTool can also load and attach another types of eBPF programs such as XDP. Let’s see the following code of XDP which drops ingress traffic to port 8080:

#include <linux/bpf.h>
#include <linux/if_ether.h>
#include <linux/ip.h>
#include <linux/tcp.h>
#include <bpf/bpf_helpers.h>
#include <bpf/bpf_endian.h>
#include <linux/in.h>

char _license[] SEC("license") = "GPL";

SEC("xdp")
int drop_ingress_port_8080(struct xdp_md *ctx) {
    void *data = (void *)(long)ctx->data;
    void *data_end = (void *)(long)ctx->data_end;

    struct ethhdr *eth = data;
    if ((void *)eth + sizeof(*eth) > data_end)
        return XDP_PASS;

    if (bpf_ntohs(eth->h_proto) != ETH_P_IP)
        return XDP_PASS;

    struct iphdr *iph = data + sizeof(*eth);
    if ((void *)iph + sizeof(*iph) > data_end)
        return XDP_PASS;

    if (iph->protocol != IPPROTO_TCP)
        return XDP_PASS;

    int ip_header_length = iph->ihl * 4;

    struct tcphdr *tcph = data + sizeof(*eth) + ip_header_length;
    if ((void *)tcph + sizeof(*tcph) > data_end)
        return XDP_PASS;

    if (bpf_ntohs(tcph->dest) == 8080) {
        bpf_printk("Dropping XDP egress packet port 8080\n");
        return XDP_DROP;
    }
    return XDP_PASS;
}

We can load this program using sudo bpftool prog load xdp_drop8080.o /sys/fs/bpf/xdp type xdp. Then running sudo bpftool prog list pinned /sys/fs/bpf/xdp to see if the program is loaded successfully:

835: xdp  name drop_egress_port_8080  tag 7c15f4a6de3ceb0f  gpl
	loaded_at 2025-03-18T07:24:28-0400  uid 0
	xlated 248B  jited 164B  memlock 4096B  map_ids 122
	btf_id 310

Then we can attach this program to an interface sudo bpftool net attach xdp id 835 dev enp1s0. To confirm program is attached use sudo bpftool net list:

xdp:
enp1s0(2) driver id 835

tc:

flow_dissector:

netfilter:

Or by viewing ip a:

2: enp1s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 xdp/id:835 qdisc fq_codel state UP group default qlen 1000
    link/ether 52:54:00:f6:fe:bc brd ff:ff:ff:ff:ff:ff
    altname enx525400f6febc
    inet 192.168.1.238/24 brd 192.168.1.255 scope global dynamic noprefixroute enp1s0
       valid_lft 3543sec preferred_lft 3543sec
    inet6 fe80::5054:ff:fef6:febc/64 scope link noprefixroute 
       valid_lft forever preferred_lft forever

We can see xdp/id:835 which confirms program with id 835 of type XDP is attached to enp1s0 interface. Use sudo bpftool net detach xdp dev enp1s0 to detach the XDP program. Unloading eBPF program can be done by removing the pinned pseudofile. For example, sudo rm /sys/fs/bpf/xdp to unload the XDP program. We can also use bpftool to load and attached TC programs with tcx_egress for egress traffic and tcx_ingress for ingress traffic. For example, sudo bpftool net attach tcx_egress id 423 dev enp1s0.

4 - Tail call

One eBPF program can jump straight into another without returning which lets you chain logic and sidestep the per-program instruction limit while adding almost no overhead.

Tail calls in eBPF let you chain together multiple BPF programs, effectively bypassing the instruction limit imposed on individual programs which is around 4096 instructions before kernel 5.2 and now the limit is one million instructions. Tail calls can also be used to break up the code logic into multiple parts to enable modular design. Tail call transfers control from one eBPF program to another without returning to the caller. 1. The verifier ensures that tail calls do not lead to unbounded recursion and that the jump is safe. It also reduces the effective stack size available (e.g., from 512 bytes to 256 bytes) when tail calls are used with BPF-to-BPF function calls.

How Tail Calls Work

  1. eBPF uses a special map type called BPF_MAP_TYPE_PROG_ARRAY that holds file descriptors of other BPF programs. The tail call uses an index (or key) into this map to know which program to jump to.
  2. Within an eBPF program, you can use the helper function bpf_tail_call(ctx, prog_array, key) to transfer execution to another program. If the call is successful, the current program’s execution ends and the new program starts from its entry point.

Let’s explore an example of XDP code. The idea is to have a main XDP program that inspects incoming packets and, based on the TCP destination port, uses a tail call to jump to the appropriate program—one for port 8080 and one for port 22.

#include <linux/bpf.h>
#include <bpf/bpf_helpers.h>
#include <bpf/bpf_endian.h>
#include <linux/if_ether.h>
#include <linux/ip.h>
#include <linux/tcp.h>
#include <linux/in.h>

struct {
    __uint(type, BPF_MAP_TYPE_PROG_ARRAY);
    __uint(max_entries, 2);
    __uint(key_size, sizeof(int));
    __uint(value_size, sizeof(int));
} prog_array SEC(".maps");

char _license[] SEC("license") = "GPL";

SEC("xdp")
int main_prog(struct xdp_md *ctx)
{
    void *data = (void *)(long)ctx->data;
    void *data_end = (void *)(long)ctx->data_end;
    struct ethhdr *eth = data;
    struct iphdr *iph;
    struct tcphdr *tcph;

    if (data + sizeof(*eth) > data_end)
        return XDP_ABORTED;

    if (eth->h_proto != bpf_htons(ETH_P_IP))
        return XDP_PASS;

    iph = data + sizeof(*eth);
    if ((void *)iph + sizeof(*iph) > data_end)
        return XDP_ABORTED;

    if (iph->protocol != IPPROTO_TCP)
        return XDP_PASS;

    tcph = (void *)iph + (iph->ihl * 4);
    if ((void *)tcph + sizeof(*tcph) > data_end)
        return XDP_ABORTED;

    int dport = bpf_ntohs(tcph->dest);

    if (dport == 8080) {
        int key = 0;
        bpf_tail_call(ctx, &prog_array, key);
    } else if (dport == 22) {
        int key = 1;
        bpf_tail_call(ctx, &prog_array, key);
    }

    return XDP_PASS;
}

SEC("xdp")
int port8080_prog(struct xdp_md *ctx)
{
    bpf_printk("Packet to port 8080 processed\n");
    return XDP_PASS;
}

SEC("xdp")
int port22_prog(struct xdp_md *ctx)
{
    bpf_printk("Packet to port 22 processed\n");
    return XDP_PASS;
}

User-space code:

#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <errno.h>
#include <net/if.h>
#include <bpf/libbpf.h>
#include "tail_call.skel.h"

#ifndef libbpf_is_err
#define libbpf_is_err(ptr) ((unsigned long)(ptr) > (unsigned long)-1000L)
#endif

int main(int argc, char **argv)
{
    struct tail_call *skel;
    struct bpf_link *link = NULL;
    int err, key, prog_fd;
    int ifindex;
    const char *ifname;

    if (argc < 2) {
        fprintf(stderr, "Usage: %s <interface>\n", argv[0]);
        return 1;
    }
    ifname = argv[1];

    skel = tail_call__open();
    if (!skel) {
        fprintf(stderr, "Failed to open BPF skeleton\n");
        return 1;
    }

    err = tail_call__load(skel);
    if (err) {
        fprintf(stderr, "Failed to load BPF skeleton: %d\n", err);
        goto cleanup;
    }

    key = 0;
    prog_fd = bpf_program__fd(skel->progs.port8080_prog);
    if (prog_fd < 0) {
        fprintf(stderr, "Invalid FD for port8080_prog\n");
        goto cleanup;
    }
    err = bpf_map__update_elem(skel->maps.prog_array,
                               &key, sizeof(key),
                               &prog_fd, sizeof(prog_fd),
                               0);
    if (err) {
        fprintf(stderr, "Failed to update prog_array for port8080_prog: %d\n", err);
        goto cleanup;
    }

    key = 1;
    prog_fd = bpf_program__fd(skel->progs.port22_prog);
    if (prog_fd < 0) {
        fprintf(stderr, "Invalid FD for port22_prog\n");
        goto cleanup;
    }
    err = bpf_map__update_elem(skel->maps.prog_array,
                               &key, sizeof(key),
                               &prog_fd, sizeof(prog_fd),
                               0);
    if (err) {
        fprintf(stderr, "Failed to update prog_array for port22_prog: %d\n", err);
        goto cleanup;
    }

    ifindex = if_nametoindex(ifname);
    if (!ifindex) {
        perror("if_nametoindex");
        goto cleanup;
    }

    link = bpf_program__attach_xdp(skel->progs.main_prog, ifindex);
    if (libbpf_is_err(link)) {
        err = libbpf_get_error(link);
        fprintf(stderr, "Failed to attach XDP program on %s (ifindex: %d): %d\n",
                ifname, ifindex, err);
        link = NULL;
        goto cleanup;
    }

    printf("XDP program loaded and tail calls configured on interface %s (ifindex: %d).\n",
           ifname, ifindex);
    printf("Press Ctrl+C to exit...\n");

    while (1)
        sleep(1);

cleanup:
    if (link)
        bpf_link__destroy(link);
    tail_call__destroy(skel);
    return err < 0 ? -err : 0;
}

Centered image

Compile the code

clang -g -O2 -target bpf -c tail_call.c -o tail_call.o
sudo bpftool gen skeleton tail_call.o > tail_call.skel.h
clang -o loader loader.c -lbpf

Run the loader using sudo ./loader enp1s0 . Open trace buffer sudo cat /sys/kernel/debug/tracing/trace_pipe:

<idle>-0             [003] ..s2.  8853.538349: bpf_trace_printk: Packet to port 8080 processed
<idle>-0             [003] .Ns2.  8853.539270: bpf_trace_printk: Packet to port 8080 processed
<idle>-0             [003] .Ns2.  8853.539279: bpf_trace_printk: Packet to port 8080 processed
gnome-shell-2539     [003] ..s1.  8853.541321: bpf_trace_printk: Packet to port 8080 processed
gnome-shell-2539     [003] ..s1.  8853.541334: bpf_trace_printk: Packet to port 8080 processed
<idle>-0             [003] ..s2.  8860.777125: bpf_trace_printk: Packet to port 22 processed
<idle>-0             [003] ..s2.  8860.777420: bpf_trace_printk: Packet to port 22 processed
<idle>-0             [003] ..s2.  8860.777611: bpf_trace_printk: Packet to port 22 processed
llvmpipe-1-2569      [003] ..s1.  8860.783551: bpf_trace_printk: Packet to port 22 processed