跟風來學一下最新的技術 eBPF(extended Berkeley Packet Filter),本來想看看對 debug kernel 有沒有什麼幫助,仔細研究了一下發現 eBPF 更多是在觀察跟統計 kernel 的行為上,對 debug kernel 的幫助不大。不過時間都花了,就稍微記綠一下,主要還是著重在對 debug 有幫助的東西上。

  1. 簡介 eBPF 能做什麼 目前 eBPF 支援以下幾種類型的程式,但是這邊只研究 kprobe 跟 tracepoint。

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    bool is_socket = strncmp(event, "socket", 6) == 0;		// a network packet filter
    bool is_kprobe = strncmp(event, "kprobe/", 7) == 0; // determine whether a kprobe should fire or not
    bool is_kretprobe = strncmp(event, "kretprobe/", 10) == 0; // determine whether a kretprobe should fire or not
    bool is_tracepoint = strncmp(event, "tracepoint/", 11) == 0; // determine whether a tracepoint should fire or not
    bool is_xdp = strncmp(event, "xdp", 3) == 0; // a network packet filter run from the device-driver receive path
    bool is_perf_event = strncmp(event, "perf_event", 10) == 0; // determine whether a perf event handler should fire or not
    bool is_cgroup_skb = strncmp(event, "cgroup/skb", 10) == 0; // a network packet filter for control groups
    bool is_cgroup_sk = strncmp(event, "cgroup/sock", 11) == 0; // a network packet filter for control groups that is allowed to modify socket options
    bool is_sockops = strncmp(event, "sockops", 7) == 0; // a program for setting socket parameters
    bool is_sk_skb = strncmp(event, "sk_skb", 6) == 0; // a network packet filter for forwarding packets between sockets
    userspace app 與 kernel module 溝通的資料結構有下列幾種
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    BPF_MAP_TYPE_HASH:		a hash table
    BPF_MAP_TYPE_ARRAY: an array map, optimized for fast lookup speeds, often used for counters
    BPF_MAP_TYPE_PROG_ARRAY: an array of file descriptors corresponding to eBPF programs; used to implement jump tables and sub-programs to handle specific packet protocols
    BPF_MAP_TYPE_PERCPU_ARRAY: a per-CPU array, used to implement histograms of latency
    BPF_MAP_TYPE_PERF_EVENT_ARRAY: stores pointers to struct perf_event, used to read and store perf event counters
    BPF_MAP_TYPE_CGROUP_ARRAY: stores pointers to control groups
    BPF_MAP_TYPE_PERCPU_HASH: a per-CPU hash table
    BPF_MAP_TYPE_LRU_HASH: a hash table that only retains the most recently used items
    BPF_MAP_TYPE_LRU_PERCPU_HASH: a per-CPU hash table that only retains the most recently used items
    BPF_MAP_TYPE_LPM_TRIE: a longest-prefix match trie, good for matching IP addresses to a range
    BPF_MAP_TYPE_STACK_TRACE: stores stack traces
    BPF_MAP_TYPE_ARRAY_OF_MAPS: a map-in-map data structure
    BPF_MAP_TYPE_HASH_OF_MAPS: a map-in-map data structure
    BPF_MAP_TYPE_DEVICE_MAP: for storing and looking up network device references
    BPF_MAP_TYPE_SOCKET_MAP: stores and looks up sockets and allows socket redirection with BPF helper functions

  2. 為什麼要用 eBPF

  • Q. kprobe module 直接寫就好了,為什麼要使用 eBPF 包起來寫?
  • A. eBPF code runs in vm, never panic the running kernel
  1. 實用工具 bcc, 可以在下面 reference 的 BPF Compiler Collection (BCC) 找到安裝方法,或是直接裝 snap package
    1
    2
    snap install bcc
    # and the tools are prefixed by bcc, e.g. sudo bcc.biotop

    1
    2
    3
    4
    5
    /*
    * This program traces functions and frequency counts them with their entire
    * stack trace, summarized in-kernel for efficiency.
    */
    sudo /usr/share/bcc/tools/stackcount -K hrtimer_init_sleeper

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    /* 
    * trace probes functions you specify and displays trace messages if a particular
    * condition is met. You can control the message format to display function
    * arguments and return values.
    */
    /*
    * "retval": "PT_REGS_RC(ctx)",
    * "arg1": "PT_REGS_PARM1(ctx)",
    * "arg2": "PT_REGS_PARM2(ctx)",
    * "arg3": "PT_REGS_PARM3(ctx)",
    * "arg4": "PT_REGS_PARM4(ctx)",
    * "arg5": "PT_REGS_PARM5(ctx)",
    * "arg6": "PT_REGS_PARM6(ctx)",
    * "$uid": "(unsigned)(bpf_get_current_uid_gid() & 0xffffffff)",
    * "$gid": "(unsigned)(bpf_get_current_uid_gid() >> 32)",
    * "$pid": "(unsigned)(bpf_get_current_pid_tgid() & 0xffffffff)",
    * "$tgid": "(unsigned)(bpf_get_current_pid_tgid() >> 32)",
    * "$cpu": "bpf_get_smp_processor_id()"
    */
    sudo /usr/share/bcc/tools/trace '::sys_execve "%s", arg1'
    PID COMM FUNC -
    4402 bash sys_execve /usr/bin/man
    4411 man sys_execve /usr/local/bin/less
    4411 man sys_execve /usr/bin/less
    4410 man sys_execve /usr/local/bin/nroff
    4410 man sys_execve /usr/bin/nroff
    4409 man sys_execve /usr/local/bin/tbl
    4409 man sys_execve /usr/bin/tbl
    4408 man sys_execve /usr/local/bin/preconv
    4408 man sys_execve /usr/bin/preconv
    4415 nroff sys_execve /usr/bin/locale
    4416 nroff sys_execve /usr/bin/groff
    4418 groff sys_execve /usr/bin/grotty
    4417 groff sys_execve /usr/bin/troff
    ^C
    1
    2
    3
    4
    5
    6
    sudo /usr/share/bcc/tools/trace 't:block:block_rq_complete "sectors=%d", args->nr_sector' -T
    TIME PID COMM FUNC -
    01:23:51 0 swapper/0 block_rq_complete sectors=8
    01:23:55 10017 kworker/u64: block_rq_complete sectors=1
    01:23:55 0 swapper/0 block_rq_complete sectors=8
    ^C
    1
    2
    sudo /usr/share/bcc/tools/trace 'r::__kmalloc (retval == 0) "kmalloc failed!"'
    Trace returns from __kmalloc which returned a null pointer
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    /*
    * This program traces functions, tracepoints, or USDT probes that match a
    * specified pattern, and when Ctrl-C is hit prints a summary of their count
    * while tracing.
    */
    sudo /usr/share/bcc/tools/funccount 'vfs_*'
    Tracing... Ctrl-C to end.
    ^C
    FUNC COUNT
    vfs_create 1
    vfs_rename 1
    vfs_fsync_range 2
    vfs_lock_file 30
    vfs_fstatat 152
    vfs_fstat 154
    vfs_write 166
    vfs_getattr_nosec 262
    vfs_getattr 262
    vfs_open 264
    vfs_read 470
    Detaching...
    1
    2
    3
    4
    5
    6
    7
    8
    9
    sudo /usr/share/bcc/tools/funccount t:block:*
    Tracing 19 functions for "t:block:*"... Hit Ctrl-C to end.
    ^C
    FUNC COUNT
    block:block_rq_complete 7
    block:block_rq_issue 7
    block:block_getrq 7
    block:block_rq_insert 7
    Detaching...
    1
    2
    3
    4
    5
    6
    7
    /*
    * This program traces hard interrupts (irqs), and stores timing statistics
    * in-kernel for efficiency.
    */
    sudo /usr/share/bcc/tools/hardirqs

    # -d : distribution histogram
    sudo /usr/share/bcc/tools/hardirqs -d
    1
    2
    3
    4
    5
    6
    7
    /*
    * This program traces soft interrupts (irqs), and stores timing statistics
    * in-kernel for efficiency.
    */
    sudo /usr/share/bcc/tools/softirqs

    # -d : distribution histogram
    sudo /usr/share/bcc/tools/softirqs -d

參考文件

  1. eBPF 簡史 - 非常全面的中文 eBPF 介紹。 Linux Enhanced BPF (eBPF) Tracing Tools - eBPF 的大本營。所以 eBPF 的學習資源都可以在這找到。 BPF Compiler Collection (BCC) - 使用 python 寫出的工具集,完美的將 kernel module 及 userspace app 整合到 python 裡,大幅降低 eBPF 的使用難度。 bcc Tutorial - 詳細的 BCC 工具程式說明 bcc Reference Guide - BCC API 文件 A dynamic tracer for Linux - 比 BCC 更精簡的一行式程式,目前可以使用 kprobe/kretprobe 等 kernel 提供的除錯函數。 BPF samples in Kernel - Linux Kernel 收集的一些 BPF 範例。