[英]Profiling the FreeBSD kernel with DTrace
I'm looking to improve interface destruction time with FreeBSD.我希望通过 FreeBSD 改善界面破坏时间。 Destroying thousands of interfaces takes several minutes on my test machine running
-CURRENT
, and while -- admittedly -- my use case may be an unusual one, I'd like to understand what's taking the system so long.在运行
-CURRENT
测试机器上销毁数千个接口需要几分钟时间,虽然 - 不可否认 - 我的用例可能是一个不寻常的用例,但我想了解是什么导致了系统如此之久。
From my initial observations, I was able to establish that most of the time is spent waiting somewhere inside if_detach_internal()
.根据我最初的观察,我能够确定大部分时间都花在等待
if_detach_internal()
内部的某个地方。 So in an attempt to profile this function, I came up with the following DTrace script:因此,为了分析这个函数,我想出了以下 DTrace 脚本:
#!/usr/sbin/dtrace -s
#pragma D option quiet
#pragma D option dynvarsize=256m
fbt:kernel:if_detach_internal:entry
{
self->traceme = 1;
t[probefunc] = timestamp;
}
fbt:kernel:if_detach_internal:return
{
dt = timestamp - t[probefunc];
@ft[probefunc] = sum(dt);
t[probefunc] = 0;
self->traceme = 0;
}
fbt:kernel::entry
/self->traceme/
{
t[probefunc] = timestamp;
}
fbt:kernel::return
/self->traceme/
{
dt = timestamp - t[probefunc];
@ft[probefunc] = sum(dt);
t[probefunc] = 0;
}
By hooking to the entry
and return
fbt probes, I'm expecting to get a list of function names and cumulative execution times for every function called by if_detach_internal()
(no matter the stack depth), and filter out anything else.通过钩住
entry
并return
fbt 探测器,我希望获得if_detach_internal()
调用的每个函数的函数名称和累积执行时间列表(无论堆栈深度如何),并过滤掉其他任何内容。
What I'm getting, however, looks like this (destroying 250 interfaces):然而,我得到的是这样的(破坏 250 个接口):
callout_when 1676 sched_load 1779 if_rele 1801 [...] rt_unlinkrte 10296062843 sched_switch 10408456866 rt_checkdelroute 11562396547 rn_walktree 12404143265 rib_walk_del 12553013469if_detach_internal 24335505097 uma_zfree_arg 25045046322788 intr_event_schedule_thread 58336370701120 swi_sched 83355263713937 spinlock_enter 116681093870088 [...] spinlock_exit 4492719328120735 cpu_search_lowest 16750701670277714
Timing information for at least some of the functions seems to make sense, but I would expect if_detach_internal()
to be the last entry in the list, with nothing taking longer than that, since this function is at the top of the call tree I'm trying to profile.至少一些函数的计时信息似乎是有道理的,但我希望
if_detach_internal()
是列表中的最后一个条目,没有什么比这更长的时间了,因为这个函数位于调用树的顶部我我正在尝试配置文件。
Clearly, it is not the case, as I'm also getting measurements for other functions ( uma_zfree_arg()
, swi_sched()
, etc...) with seemingly crazy execution times.显然,情况并非如此,因为我还获得了其他函数(
uma_zfree_arg()
、 swi_sched()
等)的测量值,其执行时间似乎很疯狂。 These results completely annihilate my trust in everything else DTrace tells me here.这些结果完全摧毁了我对 DTrace 在这里告诉我的其他一切的信任。
What am I missing ?我错过了什么? It this approach sound at all ?
这种方法听起来好吗?
I'll prefix my comments with the fact that I've not used DTrace on FreeBSD, only on macOS/OS X. So there might be something platform-specific at play here that I'm not aware of.我将在我的评论前加上我没有在 FreeBSD 上使用 DTrace,只在 macOS/OS X 上使用过的事实。所以这里可能有一些我不知道的特定于平台的东西在起作用。 With that out of the way:
顺便说一句:
t
.t
感到有些不安。 You might want to make that thread-local ( self->t
), because as it stands, your code can produce junk results if if_detach_internal
is called from multiple threads simultaneously.self->t
),因为就目前而言,如果从多个线程同时调用if_detach_internal
,您的代码可能会产生垃圾结果。dt
variable is similarly dangerous and thread-unsafe.dt
变量的使用同样危险且线程不安全。 This really should be this->dt
everywhere (a clause-local variable).this->dt
无处不在(子句局部变量)。fbt:kernel::entry /self->traceme/
will be invoked for if_detach_internal
itself .fbt:kernel::entry /self->traceme/
将被调用if_detach_internal
本身。 This is because the latter function of course matches the wildcard, and actions are executed in the order in which they appear in the script, so by the time the predicate on the wildcard entry
action is checked, the non-wildcard action will have set self->traceme = 1;
entry
动作的谓词时,非通配符动作将设置self->traceme = 1;
Double-setting the timestamp like this should cause no ill effects, but judging by the way the code is written, you may have been unaware that this is in fact what it does, which could cause problems if you make further changes down the line. Unfortunately, DTrace scoping rules are rather unintuitive, in that everything is global and thread-unsafe by default.不幸的是,DTrace 范围规则相当不直观,因为默认情况下一切都是全局的且线程不安全的。 And yes, this still bites me every now and then, even after having written a fair amount of DTrace script code.
是的,即使在编写了大量的 DTrace 脚本代码之后,这仍然时不时地困扰着我。
I don't know if following the above advice will fix your problem entirely;我不知道遵循上述建议是否可以完全解决您的问题; if not, please update your question accordingly and drop me a comment below and I'll take another look.
如果没有,请相应地更新您的问题,并在下面给我留言,我再看看。
This is another variation of a really simple but extremely useful dTrace script that I've often used to find out where any kernel is actually spending most of its time:这是一个非常简单但非常有用的 dTrace 脚本的另一个变体,我经常使用它来找出任何内核实际上花费大部分时间的地方:
#!/usr/sbin/dtrace -s
profile:::profile-1001hz
/arg0/
{
@[ stack() ] = count();
}
That profiles the kernel's stack traces, and when the script exits via CTRL-C
or some other method it will print something like this:这会分析内核的堆栈跟踪,当脚本通过
CTRL-C
或其他一些方法退出时,它将打印如下内容:
.
.
.
unix`z_compress_level+0x9a
zfs`zfs_gzip_compress+0x4e
zfs`zfs_compress_data+0x8c
zfs`zio_compress+0x9f
zfs`zio_write_bp_init+0x2b4
zfs`zio_execute+0xc2
genunix`taskq_thread+0x3ad
unix`thread_start+0x8
703
unix`deflate_slow+0x8a
unix`z_deflate+0x75a
unix`z_compress_level+0x9a
zfs`zfs_gzip_compress+0x4e
zfs`zfs_compress_data+0x8c
zfs`zio_compress+0x9f
zfs`zio_write_bp_init+0x2b4
zfs`zio_execute+0xc2
genunix`taskq_thread+0x3ad
unix`thread_start+0x8
1708
unix`i86_mwait+0xd
unix`cpu_idle_mwait+0x1f3
unix`idle+0x111
unix`thread_start+0x8
86200
That's an example set of stack traces and the number of times that stack trace was sampled.这是一组示例堆栈跟踪以及对堆栈跟踪进行采样的次数。 Note that it prints the most frequent stack traces last.
请注意,它最后打印最频繁的堆栈跟踪。
So you can immediately see the stack trace(s) most frequently sampled - which is going to be where the kernel is spending a lot of time.因此,您可以立即看到最常采样的堆栈跟踪 - 这将是内核花费大量时间的地方。
Note also that the stack traces are printed in what you may think is reverse order - the outer, topmost call is printed last.另请注意,堆栈跟踪以您可能认为是相反的顺序打印 - 最后打印最外层的调用。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.