bpftrace
bpftrace is a powerful, high-level tracing language for Linux that simplifies the process of creating eBPF (Extended Berkeley Packet Filter) programs. It simplifies the process of instrumenting kernel and user-space code by providing a simple language to attach probes to kernel functions, tracepoints, and user-defined events in a user-friendly syntax, inspired by awk, C, and other tracing tools, enabling users to quickly gain insights into system behavior. By abstracting away the complexities of low-level eBPF programming and leveraging libbpf as its backend, bpftrace allows system administrators, performance engineers, and developers to easily observe and analyze system performance without requiring extensive eBPF expertise. Let’s start by looking at the bpftrace command.
bpftrace Options
When running bpftrace, you can use various command-line options to control its behavior. Some commonly used options include:
OPTIONS:
-B MODE output buffering mode ('full', 'none')
-f FORMAT output format ('text', 'json')
-o file redirect bpftrace output to file
-e 'program' execute this program
-h, --help show this help message
-I DIR add the directory to the include search path
--include FILE add an #include file before preprocessing
-l [search|filename]
list kernel probes or probes in a program
-p PID enable USDT probes on PID
-c 'CMD' run CMD and enable USDT probes on resulting process
--usdt-file-activation
activate usdt semaphores based on file path
--unsafe allow unsafe/destructive functionality
-q keep messages quiet
--info Print information about kernel BPF support
-k emit a warning when a bpf helper returns an error (except read functions)
-kk check all bpf helper functions
-V, --version bpftrace version
--no-warnings disable all warning messages
For example, we can use -l
along with *
for wildcard for listing such as listing all kprobes:
sudo bpftrace -l 'kprobe:*'
[...]
kprobe:zswap_store
kprobe:zswap_swapoff
kprobe:zswap_swapon
kprobe:zswap_total_pages
kprobe:zswap_writeback_entry
kprobe:zswap_writeback_show
kprobe:zswap_writeback_write
kprobe:zswap_zpool_param_set
We can list probe parameters for a certain function using
sudo bpftrace -lv 'fentry:tcp_reset'
fentry:vmlinux:tcp_reset
struct sock * sk
struct sk_buff * skb
We can list all symbols from object or binary files for uprobe such as the following:
sudo bpftrace -l 'uprobe:/bin/bash:*'
[...]
uprobe:/bin/bash:async_redirect_stdin
uprobe:/bin/bash:base_pathname
uprobe:/bin/bash:bash_add_history
uprobe:/bin/bash:bash_brace_completion
uprobe:/bin/bash:bash_clear_history
uprobe:/bin/bash:bash_default_completion
uprobe:/bin/bash:bash_delete_histent
uprobe:/bin/bash:bash_delete_history_range
uprobe:/bin/bash:bash_delete_last_history
uprobe:/bin/bash:bash_dequote_text
[...]
Using -e
can be used to execute a program in one-liner. For example,
sudo bpftrace -e 'uprobe:/bin/bash:shell_execve { printf("shell_execve called\n"); }'
Attaching 1 probe...
open() called
open() called
open() called
This program uprobe:/bin/bash:shell_execve { printf("shell_execve called\n"); }
means this action printf("shell_execve called\n");
will be executed when uprobe:/bin/bash:shell_execve
get triggered.
If we want to print out which command is being executed is by printing the first argument with arg0
using str
function which reads a NULL terminated string similar to bpf_probe_read_str
helper function. argN
is a bpf builtins while hold arguments passed to the function being traced and it can be used with kprobe and uprobe.
sudo bpftrace -e 'uprobe:/bin/bash:shell_execve { printf("command:%s\n", str(arg0)); }'
Attaching 1 probe...
command:/usr/bin/ls
command:/usr/bin/ping
command:/usr/bin/cat
The following table is from the bpftrace manual, listing special variables along with their corresponding helper functions and descriptions.
Variable | Type | Kernel | BPF Helper | Description |
---|---|---|---|---|
$1 , $2 , ...$n |
int64 | n/a | n/a | The nth positional parameter passed to the bpftrace program. If less than n parameters are passed this evaluates to 0 . For string arguments use the str() call to retrieve the value. |
$# |
int64 | n/a | n/a | Total amount of positional parameters passed. |
arg0 , arg1 , ...argn |
int64 | n/a | n/a | nth argument passed to the function being traced. These are extracted from the CPU registers. The amount of args passed in registers depends on the CPU architecture. (kprobes, uprobes, usdt). |
args |
struct args | n/a | n/a | The struct of all arguments of the traced function. Available in tracepoint , fentry , fexit , and uprobe (with DWARF) probes. Use args.x to access argument x or args to get a record with all arguments. |
cgroup |
uint64 | 4.18 | get_current_cgroup_id | ID of the cgroup the current process belongs to. Only works with cgroupv2. |
comm |
string[16] | 4.2 | get_current_comm | Name of the current thread. |
cpid |
uint32 | n/a | n/a | Child process ID, if bpftrace is invoked with -c . |
cpu |
uint32 | 4.1 | raw_smp_processor_id | ID of the processor executing the BPF program. |
curtask |
uint64 | 4.8 | get_current_task | Pointer to struct task_struct of the current task. |
elapsed |
uint64 | (see nsec) | ktime_get_ns / ktime_get_boot_ns | Nanoseconds elapsed since bpftrace initialization, based on nsecs . |
func |
string | n/a | n/a | Name of the current function being traced (kprobes, uprobes). |
gid |
uint64 | 4.2 | get_current_uid_gid | Group ID of the current thread, as seen from the init namespace. |
jiffies |
uint64 | 5.9 | get_jiffies_64 | Jiffies of the kernel. In 32-bit systems, using this builtin might be slower. |
numaid |
uint32 | 5.8 | numa_node_id | ID of the NUMA node executing the BPF program. |
pid |
uint32 | 4.2 | get_current_pid_tgid | Process ID of the current thread (aka thread group ID), as seen from the init namespace. |
probe |
string | n/na | n/a | Name of the current probe. |
rand |
uint32 | 4.1 | get_prandom_u32 | Random number. |
return |
n/a | n/a | n/a | The return keyword is used to exit the current probe. This differs from exit() in that it doesn’t exit bpftrace. |
retval |
uint64 | n/a | n/a | Value returned by the function being traced (kretprobe, uretprobe, fexit). For kretprobe and uretprobe, its type is uint64 , but for fexit it depends. You can look up the type using bpftrace -lv . |
tid |
uint32 | 4.2 | get_current_pid_tgid | Thread ID of the current thread, as seen from the init namespace. |
uid |
uint64 | 4.2 | get_current_uid_gid | User ID of the current thread, as seen from the init namespace. |
The following table is from the bpftrace manual, listing bpftrace functions along with their corresponding descriptions.
Name | Description |
---|---|
bswap | Reverse byte order |
buf | Returns a hex-formatted string of the data pointed to by d |
cat | Print file content |
cgroupid | Resolve cgroup ID |
cgroup_path | Convert cgroup id to cgroup path |
exit | Quit bpftrace with an optional exit code |
join | Print the array |
kaddr | Resolve kernel symbol name |
kptr | Annotate as kernelspace pointer |
kstack | Kernel stack trace |
ksym | Resolve kernel address |
len | Count ustack/kstack frames |
macaddr | Convert MAC address data |
nsecs | Timestamps and Time Deltas |
ntop | Convert IP address data to text |
offsetof | Offset of element in structure |
override | Override return value |
path | Return full path |
percpu_kaddr | Resolve percpu kernel symbol name |
Print a non-map value with default formatting | |
printf | Print formatted |
pton | Convert text IP address to byte array |
reg | Returns the value stored in the named register |
signal | Send a signal to the current process |
sizeof | Return size of a type or expression |
skboutput | Write skb ’s data section into a PCAP file |
str | Returns the string pointed to by s |
strcontains | Compares whether the string haystack contains the string needle. |
strerror | Get error message for errno code |
strftime | Return a formatted timestamp |
strncmp | Compare first n characters of two strings |
system | Execute shell command |
time | Print formatted time |
uaddr | Resolve user-level symbol name |
uptr | Annotate as userspace pointer |
ustack | User stack trace |
usym | Resolve user space address |
How to Code in bpftrace
bpftrace scripts are written using a custom domain-specific language (DSL) that is similar in syntax to awk. A basic script consists of one or more probe definitions followed by one or more actions. Each probe targets a specific event (e.g., kernel tracepoints, function entry/exit, or user-space events).
The following table is from the bpftrace manual, listing bpftrace probes along with their corresponding descriptions.
Probe Name | Short Name | Description | Kernel/User Level |
---|---|---|---|
BEGIN/END | - | Built-in events | Kernel/User |
self | - | Built-in events | Kernel/User |
hardware | h |
Processor-level events | Kernel |
interval | i |
Timed output | Kernel/User |
iter | it |
Iterators tracing | Kernel |
fentry/fexit | f /fr |
Kernel functions tracing with BTF support | Kernel |
kprobe/kretprobe | k /kr |
Kernel function start/return | Kernel |
profile | p |
Timed sampling | Kernel/User |
rawtracepoint | rt |
Kernel static tracepoints with raw arguments | Kernel |
software | s |
Kernel software events | Kernel |
tracepoint | t |
Kernel static tracepoints | Kernel |
uprobe/uretprobe | u /ur |
User-level function start/return | User |
usdt | U |
User-level static tracepoints | User |
watchpoint/asyncwatchpoint | w /aw |
Memory watchpoints | Kernel |
Basic Structure of a bpftrace Script
probe_type:probe_identifier
{
// Action code block
printf("Hello, world!\n");
}
For example, to print a message every time a process calls the unlinkat()
syscall, you might write:
#!/usr/bin/env bpftrace
tracepoint:syscalls:sys_enter_unlinkat
{
printf("unlinkat syscall invoked\n");
}
sys_enter_unlinkat
tracepoint’s arguments can be listed from /sys/kernel/debug/tracing/events/syscalls/sys_enter_unlinkat/format
name: sys_enter_unlinkat
ID: 849
format:
field:unsigned short common_type; offset:0; size:2; signed:0;
field:unsigned char common_flags; offset:2; size:1; signed:0;
field:unsigned char common_preempt_count; offset:3; size:1; signed:0;
field:int common_pid; offset:4; size:4; signed:1;
field:int __syscall_nr; offset:8; size:4; signed:1;
field:int dfd; offset:16; size:8; signed:0;
field:const char * pathname; offset:24; size:8; signed:0;
field:int flag; offset:32; size:8; signed:0;
print fmt: "dfd: 0x%08lx, pathname: 0x%08lx, flag: 0x%08lx", ((unsigned long)(REC->dfd)), ((unsigned long)(REC->pathname)), ((unsigned long)(REC->flag))
Therefore, we can use str(args.pathname)
to extract the name of the file being deleted. args
is one of the bpftrace builtins which is a data struct of all arguments of the traced function and it can be used with tracepoint, fentry, fexit.
#!/usr/bin/env bpftrace
tracepoint:syscalls:sys_enter_unlinkat
{
printf("Process %s (PID: %d) is deleting a file %s\n", comm, pid, str(args.pathname));
}
Attaching 1 probe...
Process rm (PID: 2269) is deleting a file test1
Process rm (PID: 2270) is deleting a file test2
Let’s convert this eBPF kernel code to bpftrace
#define __TARGET_ARCH_x86
#include "vmlinux.h"
#include <bpf/bpf_helpers.h>
#include <bpf/bpf_tracing.h>
#include <bpf/bpf_core_read.h>
struct event {
pid_t pid;
char filename[256];
umode_t mode;
};
struct {
__uint(type, BPF_MAP_TYPE_PERF_EVENT_ARRAY); // Type of BPF map
__uint(max_entries, 1024); // Maximum number of entries in the map
__type(key, int); // Type of the key
__type(value, int); // Type of the value
} mkdir SEC(".maps");
char LICENSE[] SEC("license") = "Dual BSD/GPL";
SEC("kprobe/do_mkdirat")
int BPF_KPROBE(do_mkdirat, int dfd, struct filename *name, umode_t mode)
{
pid_t pid = bpf_get_current_pid_tgid() >> 32;
struct event ev = {};
ev.pid = pid;
ev.mode = mode;
const char *filename = BPF_CORE_READ(name, name);
bpf_probe_read_str(ev.filename, sizeof(ev.filename), filename);
bpf_perf_event_output(ctx, &mkdir, BPF_F_CURRENT_CPU, &ev, sizeof(ev));
return 0;
}
Let’s build the same code without Maps as it will be explained shortly
#!/usr/bin/env bpftrace
kprobe:do_mkdirat
{
printf("PID: %d, mode: %d, filename: %s\n", pid, arg2, str(((struct filename *)arg1)->name));
}
The idea is to cast arg1
to a pointer to struct filename
before accessing name
field.
bpftrace Maps
Maps in bpftrace are defined with @
such as @testmap
. The following table is from bpftrace manual, listing bpftrace map functions along with their corresponding descriptions.
Name | Description |
---|---|
avg | Calculate the running average of n between consecutive calls. |
clear | Clear all keys/values from a map. |
count | Count how often this function is called. |
delete | Delete a single key from a map. |
has_key | Return true (1) if the key exists in this map. Otherwise return false (0). |
hist | Create a log2 histogram of n using buckets per power of 2, 0 <= k <= 5, defaults to 0. |
len | Return the number of elements in a map. |
lhist | Create a linear histogram of n. lhist creates M ((max - min) / step) buckets in the range [min, max) where each bucket is step in size. |
max | Update the map with n if n is bigger than the current value held. |
min | Update the map with n if n is smaller than the current value held. |
stats | Combines the count, avg and sum calls into one. |
sum | Calculate the sum of all n passed. |
zero | Set all values for all keys to zero. |
#include "vmlinux.h"
#include <bpf/bpf_helpers.h>
#include <bpf/bpf_core_read.h>
struct {
__uint(type, BPF_MAP_TYPE_HASH);
__uint(max_entries, 1024);
__type(key, u32);
__type(value, u8);
} forks SEC(".maps");
struct {
__uint(type, BPF_MAP_TYPE_HASH);
__uint(max_entries, 1024);
__type(key, u32);
__type(value, u8);
} setuid SEC(".maps");
SEC("tracepoint/syscalls/sys_enter_fork")
int trace_fork(struct trace_event_raw_sys_enter *ctx)
{
u32 pid = bpf_get_current_pid_tgid() >> 32;
u8 val = 1;
bpf_map_update_elem(&forks, &pid, &val, BPF_ANY);
bpf_printk("Fork detected: PID %d\n", pid);
return 0;
}
SEC("tracepoint/syscalls/sys_enter_setuid")
int trace_setuid(struct trace_event_raw_sys_enter *ctx)
{
u32 uid = ctx->args[0];
if (uid == 0) {
u32 pid = bpf_get_current_pid_tgid() >> 32;
u8 val = 1;
bpf_map_update_elem(&setuid, &pid, &val, BPF_ANY);
bpf_printk("Setuid detected: PID %d\n", pid);
}
return 0;
}
SEC("tracepoint/syscalls/sys_enter_execve")
int trace_execve(struct trace_event_raw_sys_enter *ctx)
{
u32 pid = bpf_get_current_pid_tgid() >> 32;
u8 *forked = bpf_map_lookup_elem(&forks, &pid);
u8 *priv = bpf_map_lookup_elem(&setuid, &pid);
if (forked && priv) {
bpf_printk("Privilege escalation detected: fork, setuid(0), execve, PID %d\n", pid);
bpf_send_signal(9);
}
return 0;
}
char LICENSE[] SEC("license") = "GPL";
Let’s see the previous code in bpftrace:
#!/usr/bin/env bpftrace
tracepoint:syscalls:sys_enter_fork
{
@forks[pid] = 1;
printf("Fork detected: PID %d\n", pid);
}
tracepoint:syscalls:sys_enter_setuid
{
if (uid == 0)
{
@setuid[pid] = 1;
printf("Setuid detected: PID %d\n", pid);
}
}
tracepoint:syscalls:sys_enter_execve
{
if (@forks[pid] == 1 && @setuid[pid] == 1)
{
printf("Privilege escalation detected: fork, setuid(0), execve, PID %d\n", pid);
signal(9)
}
}
Define a map with name forks
and add currentpid
as key and 1
as value if sys_enter_setuid
tracepoint is triggered.
@forks[pid] = 1;
Define a map with name setuid
and add current pid
as key and 1
as value if sys_enter_fork
tracepoint is triggered and UID is zero.
@setuid[pid] = 1;
If sys_enter_execve
is triggered, hen it will check if the current pid
triggered by sys_enter_setuid
and sys_enter_fork
if (@forks[pid] == 1 && @setuid[pid] == 1)
signal
function is equivalent to bpf_send_signal
helper function to terminate the process.
signal(9)
We have to run this code with --unsafe
because we running dangerous function which is signal
, then to run it sudo bpftrace --unsafe priv-esc.bt
.
This code is much smaller and simpler than eBPF kernel code, and no need for user-space code.
The next script attaches probes to the sys_enter_read
and sys_enter_write
syscalls (separated with comma ,
) and uses a map to count the number of system calls per process using count()
map function.
#!/usr/bin/env bpftrace
tracepoint:syscalls:sys_enter_read,
tracepoint:syscalls:sys_enter_write
{
@syscalls[comm] = count();
}
interval:s:5 {
printf("\033[H\033[2J");
print(@syscalls);
}
This will activate every 5 seconds (using interval probe) to clear the screen using ANSI escape sequences printf("\033[H\033[2J");
, then print the content of syscalls
map.
interval:s:5 {
printf("\033[H\033[2J");
print(@syscalls);
}
@syscalls[systemd-timesyn]: 1
@syscalls[systemd-journal]: 1
@syscalls[systemd]: 4
@syscalls[rtkit-daemon]: 8
@syscalls[sudo]: 10
@syscalls[gnome-shell]: 13
@syscalls[gvfsd-wsdd]: 16
@syscalls[bash]: 20
@syscalls[ls]: 26
@syscalls[bpftrace]: 47
@syscalls[sshd-session]: 818
bpftrace Tools
The following tools from bpftrace github repository. They cover a wide range of functions from tracing I/O and network events to monitoring process and syscall activity.
Name | Description |
---|---|
bashreadline.bt | Print entered bash commands system wide. Examples. |
biolatency.bt | Block I/O latency as a histogram. Examples. |
biosnoop.bt | Block I/O tracing tool, showing per I/O latency. Examples. |
biostacks.bt | Show disk I/O latency with initialization stacks. Examples. |
bitesize.bt | Show disk I/O size as a histogram. Examples. |
capable.bt | Trace security capability checks. Examples. |
cpuwalk.bt | Sample which CPUs are executing processes. Examples. |
dcsnoop.bt | Trace directory entry cache (dcache) lookups. Examples. |
execsnoop.bt | Trace new processes via exec() syscalls. Examples. |
gethostlatency.bt | Show latency for getaddrinfo/gethostbyname[2] calls. Examples. |
killsnoop.bt | Trace signals issued by the kill() syscall. Examples. |
loads.bt | Print load averages. Examples. |
mdflush.bt | Trace md flush events. Examples. |
naptime.bt | Show voluntary sleep calls. Examples. |
opensnoop.bt | Trace open() syscalls showing filenames. Examples. |
oomkill.bt | Trace OOM killer. Examples. |
pidpersec.bt | Count new processes (via fork). Examples. |
runqlat.bt | CPU scheduler run queue latency as a histogram. Examples. |
runqlen.bt | CPU scheduler run queue length as a histogram. Examples. |
setuids.bt | Trace the setuid syscalls: privilege escalation. Examples. |
ssllatency.bt | Summarize SSL/TLS handshake latency as a histogram. Examples. |
sslsnoop.bt | Trace SSL/TLS handshake, showing latency and return value. Examples. |
statsnoop.bt | Trace stat() syscalls for general debugging. Examples. |
swapin.bt | Show swapins by process. Examples. |
syncsnoop.bt | Trace sync() variety of syscalls. Examples. |
syscount.bt | Count system calls. Examples. |
tcpaccept.bt | Trace TCP passive connections (accept()). Examples. |
tcpconnect.bt | Trace TCP active connections (connect()). Examples. |
tcpdrop.bt | Trace kernel-based TCP packet drops with details. Examples. |
tcplife.bt | Trace TCP session lifespans with connection details. Examples. |
tcpretrans.bt | Trace TCP retransmits. Examples. |
tcpsynbl.bt | Show TCP SYN backlog as a histogram. Examples. |
threadsnoop.bt | List new thread creation. Examples. |
undump.bt | Capture UNIX domain socket packages. Examples. |
vfscount.bt | Count VFS calls. Examples. |
vfsstat.bt | Count some VFS calls, with per-second summaries. Examples. |
writeback.bt | Trace file system writeback events with details. Examples. |
xfsdist.bt | Summarize XFS operation latency distribution as a histogram. Examples. |