[英]only 2 PERF_TYPE_HW_CACHE events in perf event group
Working on a custom implementation on top of perf_event_open
I need to monitor multiple PERF_TYPE_HW_CACHE
concurrently.在perf_event_open
之上进行自定义实现,我需要同时监控多个PERF_TYPE_HW_CACHE
。
The Intel manual states that there are 4 programmable counters per thread (or 8 if HyperThreading is disabled) for my CPU's architecture.英特尔手册指出,对于我的 CPU 架构,每个线程有 4 个可编程计数器(如果禁用超线程,则为 8 个)。 So I grouped the PERF_TYPE_HW_CACHE
events of choice into 1 perf event group containing PERF_TYPE_HW_CACHE
4 events ( LLC_GROUP
).因此,我将选择的PERF_TYPE_HW_CACHE
事件分组到 1 个 perf 事件组中,其中包含PERF_TYPE_HW_CACHE
4 个事件 ( LLC_GROUP
)。
I run a first experiment and I got the following results:我进行了第一个实验,得到了以下结果:
LLC_GROUP of thread 2 | time Enabled: 3190370379, time Running: 3017
HW_CACHE_LLC_READ_MISSES = 0
HW_CACHE_LLC_WRITE_MISSES = 0
HW_CACHE_LLC_READS = 0
HW_CACHE_LLC_WRITES = 0
From the above results, it is clear that the PMU does not "fit" all the 4 events.从上面的结果可以清楚地看出,PMU 并不“适合”所有 4 个事件。 We also observe a "strange" multiplexing without actual results..我们还观察到没有实际结果的“奇怪”多路复用。
So, as a next move, I split the 4-events group into 2 groups of 2 events/group ( LLC_GROUP
, LLC2_GROUP
) and the result I got are the following:因此,作为下一步,我将 4 事件组分成 2 组,每组 2 个事件/组( LLC_GROUP
, LLC2_GROUP
),我得到的结果如下:
LLC_GROUP of thread 2 | time Enabled: 2772569406, time Running: 1396022331
HW_CACHE_LLC_READ_MISSES = 102117
HW_CACHE_LLC_WRITE_MISSES = 9624295
LLC2_GROUP of thread 2 | time Enabled: 2772571024, time Running: 1376575096
HW_CACHE_LLC_READS = 22020658
HW_CACHE_LLC_WRITES = 18156060
With this configuration, we observe again that the PMU doesn't "fit" 4 PERF_TYPE_HW_CACHE
concurrently but this time the (expected) multiplexing is happening.通过这种配置,我们再次观察到 PMU 不能同时“适配”4 个PERF_TYPE_HW_CACHE
,但这次(预期的)多路复用正在发生。
Does anyone have any explanation?有人有什么解释吗?
This behaviour looks very strange to me since I'm able to monitor multiple PERF_TYPE_HARDWARE
events (up to 6) without multiplexing and I would expect the same to be happening for the PERF_TYPE_HW_CACHE
events as well.这种行为对我来说看起来很奇怪,因为我能够在不进行多路复用的情况下监控多个PERF_TYPE_HARDWARE
事件(最多 6 个),并且我希望PERF_TYPE_HW_CACHE
事件也会发生同样的情况。
Note that, perf
does allow measuring more than 2 PERF_TYPE_HW_CACHE events at the same time, the exception being the measurement of LLC-cache
events.请注意, perf
确实允许同时测量超过 2 个 PERF_TYPE_HW_CACHE 事件,但LLC-cache
事件的测量除外。
The expectation is that, when there are 4 general-purpose and 3 fixed-purpose hardware counters, 4 HW cache events (which default to RAW
events) in perf can be measured without multiplexing, with hyper-threading ON .期望是,当有 4 个通用硬件计数器和 3 个固定用途硬件计数器时,可以在 perf 中测量 4 个 HW 缓存事件(默认为RAW
事件),无需多路复用,超线程 ON 。
sudo perf stat -e L1-icache-load-misses,L1-dcache-stores,L1-dcache-load-misses,dTLB-load-misses sleep 2
Performance counter stats for 'sleep 2':
26,893 L1-icache-load-misses
98,999 L1-dcache-stores
14,037 L1-dcache-load-misses
723 dTLB-load-misses
2.001732771 seconds time elapsed
0.001217000 seconds user
0.000000000 seconds sys
The problem appears when you try to measure events targeting the LLC-cache
.当您尝试测量针对LLC-cache
的事件时,就会出现问题。 It seems to be measuring only 2 LLC-cache
specific events, concurrently, without multiplexing.它似乎只同时测量了 2 个LLC-cache
特定事件,没有多路复用。
sudo perf stat -e LLC-load-misses,LLC-stores,LLC-store-misses,LLC-loads sleep 2
Performance counter stats for 'sleep 2':
2,419 LLC-load-misses # 0.00% of all LL-cache hits
2,963 LLC-stores
<not counted> LLC-store-misses (0.00%)
<not counted> LLC-loads (0.00%)
2.001486710 seconds time elapsed
0.001137000 seconds user
0.000000000 seconds sys
CPUs belonging to the skylake/kaby lake
family of microarchitectures and some others, allow you to measure OFFCORE RESPONSE
events.属于skylake/kaby lake
微架构家族和其他一些微架构的 CPU 允许您测量OFFCORE RESPONSE
事件。 Monitoring OFFCORE_RESPONSE
events requires programming extra MSRs, specifically, MSR_OFFCORE_RSP0
(MSR address 1A6H) and MSR_OFFCORE_RSP1
(MSR address 1A7H), in addition to programming the pair of IA32_PERFEVTSELx
and IA32_PMCx
registers.除了对IA32_PERFEVTSELx
和IA32_PMCx
寄存器进行编程外,监视OFFCORE_RESPONSE
事件还需要编程额外的 MSR,特别是MSR_OFFCORE_RSP0
(MSR 地址 1A6H)和MSR_OFFCORE_RSP1
(MSR 地址 1A7H)。
Each pair of IA32_PERFEVTSELx
and IA32_PMCx
register will be associated with one of the above MSRs to measure LLC-cache events.每对IA32_PERFEVTSELx
和IA32_PMCx
寄存器将与上述 MSR 之一相关联,以测量 LLC 缓存事件。
The definition of the OFFCORE_RESPONSE
MSRs can be seen here .可以在此处查看OFFCORE_RESPONSE
MSR 的定义。
static struct extra_reg intel_skl_extra_regs[] __read_mostly = {
INTEL_UEVENT_EXTRA_REG(0x01b7, MSR_OFFCORE_RSP_0, 0x3fffff8fffull, RSP_0),
INTEL_UEVENT_EXTRA_REG(0x01bb, MSR_OFFCORE_RSP_1, 0x3fffff8fffull, RSP_1),
........
}
0x01b7
in the INTEL_UEVENT_EXTRA_REG
call refers to event-code b7
and umask 01
. 0x01b7
调用中的INTEL_UEVENT_EXTRA_REG
指的是事件代码b7
和 umask 01
。 This event code 0x01b7
maps to LLC-cache events, as can be seen here -此事件代码0x01b7
映射到 LLC-cache 事件,如下所示-
[ C(LL ) ] = {
[ C(OP_READ) ] = {
[ C(RESULT_ACCESS) ] = 0x1b7, /* OFFCORE_RESPONSE */
[ C(RESULT_MISS) ] = 0x1b7, /* OFFCORE_RESPONSE */
},
[ C(OP_WRITE) ] = {
[ C(RESULT_ACCESS) ] = 0x1b7, /* OFFCORE_RESPONSE */
[ C(RESULT_MISS) ] = 0x1b7, /* OFFCORE_RESPONSE */
},
[ C(OP_PREFETCH) ] = {
[ C(RESULT_ACCESS) ] = 0x0,
[ C(RESULT_MISS) ] = 0x0,
},
},
The event 0x01b7
will always map to MSR_OFFCORE_RSP_0
, as can be seen here .事件0x01b7
将始终从 map 到MSR_OFFCORE_RSP_0
,如下所示。 The function, specified above, loops through the array of all the "extra registers" and associates the event->config(which contains the raw event id) with the offcore response MSR.上面指定的 function 循环遍历所有“额外寄存器”的数组,并将 event->config(包含原始事件 id)与核心响应 MSR 相关联。
So, this would mean only one event can be measured at a time, since only one MSR - MSR_OFFCORE_RSP_0
can be mapped to a LLC-cache
event.因此,这意味着一次只能测量一个事件,因为只有一个 MSR - MSR_OFFCORE_RSP_0
可以映射到LLC-cache
事件。 But, that is not the case!但事实并非如此!
The offcore registers are symmetric in nature, so when the first MSR - MSR_OFFCORE_RSP_0
register is busy, perf
uses the second alternative MSR, MSR_OFFCORE_RSP_1
for measuring another offcore LLC event.核心外寄存器本质上是对称的,因此当第一个 MSR - MSR_OFFCORE_RSP_0
寄存器忙时, perf
使用第二个替代 MSR MSR_OFFCORE_RSP_1
来测量另一个核心外 LLC 事件。 This function here helps in doing that. 这里的 function 有助于做到这一点。
static int intel_alt_er(int idx, u64 config)
{
int alt_idx = idx;
if (!(x86_pmu.flags & PMU_FL_HAS_RSP_1))
return idx;
if (idx == EXTRA_REG_RSP_0)
alt_idx = EXTRA_REG_RSP_1;
if (idx == EXTRA_REG_RSP_1)
alt_idx = EXTRA_REG_RSP_0;
if (config & ~x86_pmu.extra_regs[alt_idx].valid_mask)
return idx;
return alt_idx;
}
The presence of only 2 offcore registers, for Kaby-Lake
family of microrarchitectures hinder the ability to target more than 2 LLC-cache event measurement concurrently, without any multiplexing. Kaby-Lake
系列微架构仅存在 2 个内核外寄存器,这阻碍了在没有任何多路复用的情况下同时针对超过 2 个 LLC 缓存事件测量的能力。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.