Linux:ftrace kprobetrace

编程入门行业动态更新时间:2024-10-08 08:23:03

Linux:ftrace kprobetrace

文章目录

参考
相关Linux配置
- CONFIG_TRACING
注意事项
- trace 文件里的pid
- mount
- ftrace
- 相关的sysctl配置
- - function_graph
  - function
  - 如果set_ftrace_filter 设置为空，就是所有的function 都做trace
- 实现机制
- dump stack
- event
- 问题
- - 设置 tracing_on 失败：
  - set_ftrace_filter里的内容不能删除
- 开放问题
- - 是否支持kernel module里的函数的跟踪
kprobetrace
- 简单的步骤
- - 错误write to 1 failed [No such file or directory]
  - 错误实例
  - 错误
- 简介
- 类型
mellanox firmware 相关的trace
性能影响
调查函数执行时间

参考

在/sys/kernel/debug/tracing/README 这个是一个简易的使用说明
详细的可以在这里找到
/
.html
/ ;An introduction to KProbes
/ ; systemtap与bpftrace的对比
/
这个上面说需要关闭这个配置项：STRICT_MEMORY_RWX；不知真假，先放着这里作为一个问题调试的参考。

.html
.18/trace/ftrace.html

注意事项

trace 文件里的pid

是一个父进程的id，比如如果在ksh界面执行了一个操作，这个操作下调用的函数，其实记录的是ksh进程的pid。而不是新命令clone出来的pid。

mount

mount -t tracefs nodev /sys/kernel/tracing
redhat8, 默认的目录是在 /sys/kernel/debug/tracing/

ftrace

实现机制

通过反汇编发现，每个函数都一开始就会调用函数 <fentry>

(gdb) disass __hrtimer_run_queues
Dump of assembler code for function __hrtimer_run_queues:0xffffffff8117d200 <+0>:     callq  0xffffffff81a01900 <__fentry__>0xffffffff8117d205 <+5>:     push   %r15

这个函数的定义在x86下，arch/x86/kernel/ftrace_32/64.S

# define function_hook	__fentry__
EXPORT_SYMBOL(__fentry__)#ifdef CONFIG_DYNAMIC_FTRACESYM_FUNC_START(function_hook)retq
SYM_FUNC_END(function_hook)#ifdef CONFIG_FUNCTION_GRAPH_TRACER
SYM_FUNC_START(ftrace_graph_caller)/* Saves rbp into %rdx and fills first parameter  */save_mcount_regsleaq MCOUNT_REG_SIZE+8(%rsp), %rsimovq $0, %rdx	/* No framepointers needed */call	prepare_ftrace_returnrestore_mcount_regsretq
SYM_FUNC_END(ftrace_graph_caller)/include/linux/linkage.h
/** The same as SYM_FUNC_START_ALIAS, but we will need to distinguish these two later.*/
#define SYM_FUNC_START(name)				\SYM_START(name, SYM_L_GLOBAL, SYM_A_ALIGN)
#endif/* SYM_START -- use only if you have to */
#ifndef SYM_START
#define SYM_START(name, linkage, align...)		\SYM_ENTRY(name, linkage, align)
#endif/* SYM_ENTRY -- use only if you have to for non-paired symbols */
#ifndef SYM_ENTRY
#define SYM_ENTRY(name, linkage, align...)		\linkage(name) ASM_NL				\align ASM_NL					\name:
#endif

最终会使用
#define MCOUNT_ADDR ((unsigned long)(fentry))
这个MCOUNT_ADDR 表示fentry的函数地址
#define MCOUNT_INSN_SIZE 5 /* sizeof mcount call */

ftrace_make_nop/ftrace_code_disable，在系统启动或者是加载module时，将这个地址调用指令，改成nop；在disable ftrace时也会改成nop指令

dump stack

echo :stacktrace > set_ftrace_filter

            triggers: a command to perform when function is hitFormat: <function>:<trigger>[:count]trigger: traceon, traceoffenable_event:<system>:<event>disable_event:<system>:<event>stacktracesnapshotdumpcpudumpexample: echo do_fault:traceoff > set_ftrace_filterecho do_trap:traceoff:3 > set_ftrace_filter

event

有些trace event，例如
下面的 trace_kfree_skb 就是一个trace点；可以通过
/sys/kernel/debug/tracing/events/skb/kfree_skb

[root@rms01-host01 kfree_skb]# echo 1 > enable
[root@rms01-host01 kfree_skb]# more format
name: kfree_skb
ID: 1278
format:
field:unsigned short common_type; offset:0; size:2; signed:0;
field:unsigned char common_flags; offset:2; size:1; signed:0;
field:unsigned char common_preempt_count; offset:3; size:1; signed:0;
field:int common_pid; offset:4; size:4; signed:1;

    field:void * skbaddr;   offset:8;       size:8; signed:0;field:void * location;  offset:16;      size:8; signed:0;field:unsigned short protocol;  offset:24;      size:2; signed:0;

print fmt: “skbaddr=%p protocol=%u location=%p”, REC->skbaddr, REC->protocol, REC->location

void kfree_skb(struct sk_buff *skb)
{if (!skb_unref(skb))return;trace_kfree_skb(skb, __builtin_return_address(0));__kfree_skb(skb);
}

问题

设置 tracing_on 失败：

sec01-fw-a:root/sys/kernel/debug/tracing:
#echo 1 > tracing_on
sec01-fw-a:root/sys/kernel/debug/tracing:
#more tracing_on
0

如果使用tail -f trace 命令，打开trace文件时，怎么设置都不成功。

set_ftrace_filter里的内容不能删除

在这个文件里设置：icmp_reply.constprop.35:stacktrace 这个之后，怎么也删不掉这一行。
解决办法就是：echo ‘!skb_release_data:stacktrace’ > set_ftrace_filter
在前面加叹号，然后需要使用单引号引起来这个内容，echo到这个文件，就好了。

开放问题

是否支持kernel module里的函数的跟踪

其实可以使用kprobe 或者user probe的功能更，查看kernel module里的行为；还没实践过。

kprobetrace

简单的步骤

如果执行命令出现错误，可用使用dmesg查看错误原因。
echo ‘r:myretprobe do_sys_open $retval’ >> /sys/kernel/debug/tracing/kprobe_events
echo 1 > /sys/kernel/debug/tracing/events/kprobes/myretprobe/enable
echo 1 > /sys/kernel/debug/tracing/events/kprobes/enable
echo 1 > /sys/kernel/debug/tracing/tracing_on
echo 0 > /sys/kernel/debug/tracing/tracing_on
cat /sys/kernel/debug/tracing/trace

echo ‘p:myretprobe1 unix_dgram_sendmsg msg=%ax’ >> /sys/kernel/debug/tracing/kprobe_events
echo ‘p:myretprobe1 unix_dgram_sendmsg arg3=%ax’ >> /sys/kernel/debug/tracing/kprobe_events
这里可以将msg变量打印出来，以及第三个参数。
static int unix_dgram_sendmsg(struct socket *sock, struct msghdr *msg, size_t len)

错误write to 1 failed [No such file or directory]

#echo ‘r:myretprobe acl_permission_check $retval’ >> /sys/kernel/debug/tracing/kprobe_events
-ksh: echo: write to 1 failed [No such file or directory]
这个错误的原因是，acl_permission_check 这个函数不可以做kprobe；
需要查看 available_filter_functions 看函数是否可用kprobe

错误实例

这个错误的原因是，当从这个网页复制命令到Console时，单引号其实是中文格式，需要改一改。
[443041.985255] Probe definition must be started with ‘p’, ‘r’ or ‘-’.
[443078.589547] Probe definition must be started with ‘p’, ‘r’ or ‘-’.

错误

#echo ‘r:myretprobe __sched_setscheduler $retval’ >> /sys/kernel/debug/tracing/kprobe_events
-ksh: echo: write to 1 failed [Device or resource busy]
需要将真正tracing的事件停掉；才能修改这个probe

简介

这个kprobetrace类似基于事件的tracepoint（跟踪点）。不过这个是基于kprobes(kprobe and kretprobe)。所有kprobe可以探测的地方，都可以使用这个kprobetrace功能。也就意味着，除了带有__kprobes/nokprobe_inline 关键字的函数，还有带有标记NOKPROBE_SYMBOL的函数外，所有的其他函数都可以做kprobetrace。而且是可以动态做探测，不像tracepoint，是静态编译在内核模块里。
需要打开设置：CONFIG_KPROBE_EVENTS=y.
像事件跟踪器，kprobetrace，也不是通过current_tracer激活。而是通过向 /sys/kernel/debug/tracing/kprobe_events文件添加跟踪点。然后通过文件/sys/kernel/debug/tracing/events/kprobes//enable激活跟踪。
除了kprobe_events，也可以使用文件：/sys/kernel/debug/tracing/dynamic_events。T这个统一的接口可以访问其他的动态事件。
语法：

 p[:[GRP/]EVENT] [MOD:]SYM[+offs]|MEMADDR [FETCHARGS]  : 设置一个探测r[MAXACTIVE][:[GRP/]EVENT] [MOD:]SYM[+0] [FETCHARGS]  : 设置一个返回值探测p:[GRP/]EVENT] [MOD:]SYM[+0]%return [FETCHARGS]       : 设置一个返回值探测-:[GRP/]EVENT                                         : 清楚探测GRP            : 组名称. 如果省略，默认使用"kprobes"。
EVENT          : 事件名，如果省略，通过SYM+offs或者MEMADDR生成。
MOD            : 提供SYM的模块名称。
SYM[+offs]     : Symbol+offset要插入的探测点。
SYM%return     : 符合的返回地址
MEMADDR        : 探测插入的地址。
MAXACTIVE      : Maximum number of instances of the specified function thatcan be probed simultaneously, or 0 for the default valueas defined in Documentation/trace/kprobes.rst section 1.3.1.

FETCHARGS      : Arguments. Each probe can have up to 128 args.%REG          : Fetch register REG@ADDR         : Fetch memory at ADDR (ADDR should be in kernel)@SYM[+|-offs] : Fetch memory at SYM +|- offs (SYM should be a data symbol)$stackN       : Fetch Nth entry of stack (N >= 0)$stack        : Fetch stack address.$argN         : Fetch the Nth function argument. (N >= 1) (\*1)$retval       : Fetch return value.(\*2)$comm         : Fetch current task comm.+|-[u]OFFS(FETCHARG) : Fetch memory at FETCHARG +|- OFFS address.(\*3)(\*4)\IMM          : Store an immediate value to the argument.NAME=FETCHARG : Set NAME as the argument name of FETCHARG.FETCHARG:TYPE : Set TYPE as the type of FETCHARG. Currently, basic types(u8/u16/u32/u64/s8/s16/s32/s64), hexadecimal types(x8/x16/x32/x64), "string", "ustring" and bitfieldare supported.(\*1) only for the probe on function entry (offs == 0).(\*2) only for return probe.(\*3) this is useful for fetching a field of data structures.(\*4) "u" means user-space dereference. See :ref:`user_mem_access`.

类型

Several types are supported for fetch-args. Kprobe tracer will access memory by given type. Prefix ‘s’ and ‘u’ means those types are signed and unsigned respectively. ‘x’ prefix implies it is unsigned. Traced arguments are shown in decimal (‘s’ and ‘u’) or hexadecimal (‘x’). Without type casting, ‘x32’ or ‘x64’ is used depends on the architecture (e.g. x86-32 uses x32, and x86-64 uses x64). These value types can be an array. To record array data, you can add ‘[N]’ (where N is a fixed number, less than 64) to the base type. E.g. ‘x16[4]’ means an array of x16 (2bytes hex) with 4 elements. Note that the array can be applied to memory type fetchargs, you can not apply it to registers/stack-entries etc. (for example, ‘ s t a c k 1 : x 8 [ 8 ] ’ i s w r o n g , b u t ‘ + 8 ( stack1:x8[8]’ is wrong, but ‘+8( stack1:x8[8]’iswrong,but‘+8(stack):x8[8]’ is OK.) String type is a special type, which fetches a “null-terminated” string from kernel space. This means it will fail and store NULL if the string container has been paged out. “ustring” type is an alternative of string for user-space. See User Memory Access for more info… The string array type is a bit different from other types. For other base types, [1] is equal to (e.g. +0(%di):x32[1] is same as +0(%di):x32.) But string[1] is not equal to string. The string type itself represents “char array”, but string array type represents “char * array”. So, for example, +0(%di):string[1] is equal to +0(+0(%di)):string. Bitfield is another special type, which takes 3 parameters, bit-width, bit- offset, and container-size (usually 32). The syntax is:

b@/

Symbol type(‘symbol’) is an alias of u32 or u64 type (depends on BITS_PER_LONG) which shows given pointer in “symbol+offset” style. For $comm, the default type is “string”; any other type is invalid.
User Memory Access

Kprobe events supports user-space memory access. For that purpose, you can use either user-space dereference syntax or ‘ustring’ type.

The user-space dereference syntax allows you to access a field of a data structure in user-space. This is done by adding the “u” prefix to the dereference syntax. For example, +u4(%si) means it will read memory from the address in the register %si offset by 4, and the memory is expected to be in user-space. You can use this for strings too, e.g. +u0(%si):string will read a string from the address in the register %si that is expected to be in user- space. ‘ustring’ is a shortcut way of performing the same task. That is, +0(%si):ustring is equivalent to +u0(%si):string.

Note that kprobe-event provides the user-memory access syntax but it doesn’t use it transparently. This means if you use normal dereference or string type for user memory, it might fail, and may always fail on some archs. The user has to carefully check if the target data is in kernel or user space.
Per-Probe Event Filtering

Per-probe event filtering feature allows you to set different filter on each probe and gives you what arguments will be shown in trace buffer. If an event name is specified right after ‘p:’ or ‘r:’ in kprobe_events, it adds an event under tracing/events/kprobes/, at the directory you can see ‘id’, ‘enable’, ‘format’, ‘filter’ and ‘trigger’.

enable:

You can enable/disable the probe by writing 1 or 0 on it.

format:

This shows the format of this probe event.

filter:

You can write filtering rules of this event.

id:

This shows the id of this probe event.

trigger:

This allows to install trigger commands which are executed when the event is hit (for details, see Event Tracing, section 6).

Event Profiling

You can check the total number of probe hits and probe miss-hits via /sys/kernel/debug/tracing/kprobe_profile. The first column is event name, the second is the number of probe hits, the third is the number of probe miss-hits.
Kernel Boot Parameter

You can add and enable new kprobe events when booting up the kernel by “kprobe_event=” parameter. The parameter accepts a semicolon-delimited kprobe events, which format is similar to the kprobe_events. The difference is that the probe definition parameters are comma-delimited instead of space. For example, adding myprobe event on do_sys_open like below

p:myprobe do_sys_open dfd=%ax filename=%dx flags=%cx mode=+4($stack)

should be below for kernel boot parameter (just replace spaces with comma)

p:myprobe,do_sys_open,dfd=%ax,filename=%dx,flags=%cx,mode=+4($stack)

Usage examples

To add a probe as a new event, write a new definition to kprobe_events as below:

echo ‘p:myprobe do_sys_open dfd=%ax filename=%dx flags=%cx mode=+4($stack)’ > /sys/kernel/debug/tracing/kprobe_events

This sets a kprobe on the top of do_sys_open() function with recording 1st to 4th arguments as “myprobe” event. Note, which register/stack entry is assigned to each function argument depends on arch-specific ABI. If you unsure the ABI, please try to use probe subcommand of perf-tools (you can find it under tools/perf/). As this example shows, users can choose more familiar names for each arguments.

echo ‘r:myretprobe do_sys_open $retval’ >> /sys/kernel/debug/tracing/kprobe_events

This sets a kretprobe on the return point of do_sys_open() function with recording return value as “myretprobe” event. You can see the format of these events via /sys/kernel/debug/tracing/events/kprobes//format.

cat /sys/kernel/debug/tracing/events/kprobes/myprobe/format
name: myprobe
ID: 780
format:
field:unsigned short common_type; offset:0; size:2; signed:0;
field:unsigned char common_flags; offset:2; size:1; signed:0;
field:unsigned char common_preempt_count; offset:3; size:1;signed:0;
field:int common_pid; offset:4; size:4; signed:1;

    field:unsigned long __probe_ip; offset:12;      size:4; signed:0;field:int __probe_nargs;        offset:16;      size:4; signed:1;field:unsigned long dfd;        offset:20;      size:4; signed:0;field:unsigned long filename;   offset:24;      size:4; signed:0;field:unsigned long flags;      offset:28;      size:4; signed:0;field:unsigned long mode;       offset:32;      size:4; signed:0;

print fmt: “(%lx) dfd=%lx filename=%lx flags=%lx mode=%lx”, REC->__probe_ip,
REC->dfd, REC->filename, REC->flags, REC->mode

You can see that the event has 4 arguments as in the expressions you specified.

下面的命令可以清除所有的探测点：
echo > /sys/kernel/debug/tracing/kprobe_events

或者，使用下面的命令定点清除：
清除kprobe 跟踪事件，下面这个命令可以清除选定的探测点。
echo -:myprobe >> kprobe_events

Right after definition, each event is disabled by default. For tracing these events, you need to enable it.
echo 1 > /sys/kernel/debug/tracing/events/kprobes/myprobe/enable
echo 1 > /sys/kernel/debug/tracing/events/kprobes/myretprobe/enable
开启跟踪：
#echo 1 > tracing_on
关闭：
#echo 0 > tracing_on
通过文件 /sys/kernel/debug/tracing/trace，查看收集到的日志。

cat /sys/kernel/debug/tracing/trace
#tracer: nop
#					TASK-PID    CPU#    TIMESTAMP  FUNCTION
#						| |       |          |         |<...>-1447  [001] 1038282.286878: myretprobe: (sys_openat+0xc/0xe <- do_sys_open) $retval=fffffffffffffffe<...>-1447  [001] 1038282.286885: myprobe: (do_sys_open+0x0/0xd6) dfd=ffffff9c filename=40413c flags=8000 mode=1b6<...>-1447  [001] 1038282.286915: myretprobe: (sys_open+0x1b/0x1d <- do_sys_open) $retval=3<...>-1447  [001] 1038282.286976: myretprobe: (sys_open+0x1b/0x1d <- do_sys_open) $retval=3

每一行显示了什么时候内核遇到了一个事件。“<- SYMBOL” 表示内核从SYMBOL退出。例如上面的例子，
“sys_open+0x1b/0x1d <- do_sys_open”就是说内核从do_sys_open退出，回到 sys_open+0x1b。

mellanox firmware 相关的trace

3.5.1.8Firmware Tracer
This mechanism allows for the device’s FW/HW to log important events into the event tracing system (/sys/kernel/debug/tracing) without requiring any Mellanox tool.
This feature is enabled by default, and can be controlled using sysfs commands.
To disable the feature:
To enable the feature:
To view FW traces using vim text editor:
[257480.719070] mlx4_core 0000:00:05.0: Internal error detected:
[257480.726019] mlx4_core 0000:00:05.0: buf[00]: 0fffffff
[257480.732082] mlx4_core 0000:00:05.0: buf[01]: 00000000
…
[257480.806531] mlx4_core 0000:00:05.0: buf[0f]: 00000000
[257480.811534] mlx4_core 0000:00:05.0: device is going to be reset
[257482.781154] mlx4_core 0000:00:05.0: crdump: Crash snapshot collected to /proc/driver/mlx4_core/crdump/0000:00:05.0
[257483.789230] mlx4_core 0000:00:05.0: device was reset successfully
In mlx4 driver, CRDUMP will not be collected if internal_err_reset module parameter is set to 0.
To be able to use this feature, trace points must be enabled in the kernel.
echo 0 > /sys/kernel/debug/tracing/events/mlx5/fw_tracer/enable
echo 1 > /sys/kernel/debug/tracing/events/mlx5/fw_tracer/enable
vim /sys/kernel/debug/tracing/trace

性能影响

如果使用iperf发包测试，cpu使用率上升一倍；
速率减小：9-10倍
perf的占用情况。

Samples: 50K of event 'cpu-clock', Event count (approx.): 12703250000Overhead  Command  Shared Object      Symbol        
-   10.75%  iperf3   [kernel.kallsyms]  [k] pvclock_clocksource_reada- pvclock_clocksource_read    - 97.88% kvm_sched_clock_read   - sched_clock           - 99.36% trace_clock_local- 99.25% rb_reserve_next_event               ring_buffer_lock_reserve                trace_buffer_lock_reserve               trace_function    function_trace_call  aftrace_ops_list_func a+ ftrace_call  + 0.75% rb_move_tail.isra.38                 + 0.64% local_clock  + 2.01% sched_clock        
-    9.48%  iperf3   [kernel.kallsyms]  [k] function_trace_call  - function_trace_call         - 58.23% ftrace_ops_list_func   - ftrace_call

调查函数执行时间

commit 8e1e1df29d837c589c8b4d7b49864481ff7795b8
Author: Byungchul Park <byungchul.park@lge>
Date:   Mon Nov 24 09:34:19 2014 +0900tracing: Add additional marks to signal very large time deltasCurrently, function graph tracer prints "!" or "+" just beforefunction execution time to signal a function overhead, dependingon the time. And some tracers tracing latency also print "!" or"+" just after time to signal overhead, depending on the intervalbetween events. Even it is usually enough to do that, we sometimesneed to signal for bigger execution time than 100 micro seconds.For example, I used function graph tracer to detect if there isany case that exit_mm() takes too much time. I did following stepsin /sys/kernel/debug/tracing. It was easier to detect very largeexcution time with patched kernel than with original kernel.$ echo exit_mm > set_graph_function$ echo function_graph > current_tracer$ echo > trace$ cat trace_pipe > $LOGFILE... (do something and terminate logging)$ grep "\\$" $LOGFILE3) $ 22082032 us |                      } /* kernel_map_pages */3) $ 22082040 us |                    } /* free_pages_prepare */3) $ 22082113 us |                  } /* free_hot_cold_page */3) $ 22083455 us |                } /* free_hot_cold_page_list */3) $ 22083895 us |              } /* release_pages */3) $ 22177873 us |            } /* free_pages_and_swap_cache */3) $ 22178929 us |          } /* unmap_single_vma */3) $ 22198885 us |        } /* unmap_vmas */3) $ 22206949 us |      } /* exit_mmap */3) $ 22207659 us |    } /* mmput */3) $ 22207793 us |  } /* exit_mm */And then, it was easy to find out that a schedule-out occured bysub_preempt_count() within kernel_map_pages().To detect very large function exection time caused by either problematicfunction implementation or scheduling issues, this patch can be useful.

#undef MARK
#define MARK(v, s) {.val = v, .sym = s}
/* trace overhead mark */
static const struct trace_mark {unsigned long long	val; /* unit: nsec */char			sym;
} mark[] = {MARK(1000000000ULL	, '$'), /* 1 sec */MARK(100000000ULL	, '@'), /* 100 msec */MARK(10000000ULL	, '*'), /* 10 msec */MARK(1000000ULL		, '#'), /* 1000 usecs */MARK(100000ULL		, '!'), /* 100 usecs */MARK(10000ULL		, '+'), /* 10 usecs */
};
#undef MARK

更多推荐

Linux:ftrace kprobetrace

本文发布于:2024-02-07 00:21:56，感谢您对本站的认可！

本文链接:https://www.elefans.com/category/jswz/34/1751738.html