简体   繁体   English

如何在Linux的i3 / i7核心中击败硬件预取器

[英]How to defeat hardware prefetcher in core i3/i7 in linux

I am trying to find a way to defeat the H/w prefetcher to detect the stream pattern and access 4KB data in a random order so that it is not detected and prefetched by H/w prefetcher. 我正在尝试找到一种方法来击败H / w预取器,以检测流模式并以随机顺序访问4KB数据,以便H / w预取器不检测和预取它。

Initially I was thinking to access all even index data in a random pattern as H/w prefetcher prefetch the next cache lines always (so when I access even index, next odd index data is already prefetched). 最初,我想使用H / w预取器始终预取下一个缓存行的方式来访问所有偶数索引数据(因此,当我访问偶数索引时,已经预取了下一个奇数索引数据)。

I wrote the code to access all even index data in a random pattern, however the results indicate that the prefetcher detected the pattern (don't know how ? There is no fixed stride, all are random stride ) 我编写了代码以随机模式访问所有偶数索引数据,但是结果表明预取器检测到该模式(不知道怎么做?没有固定的跨度,所有都是随机跨度)

I was investigating the reason-why this happened, then I found this article in Intel ; 我正在调查发生这种情况的原因,然后在Intel上找到了这篇文章。 https://software.intel.com/en-us/forums/topic/473493 https://software.intel.com/zh-CN/forums/topic/473493

According to John D. McCalpin, PhD, "Dr. Bandwidth, 根据John D. McCalpin博士所说,“带宽博士,

In section 2.2.5.4 of "Intel 64 and IA-32 Architectures Optimization Reference Manual" (document 248966-028, July 2013),it states that, 在“ Intel 64和IA-32体系结构优化参考手册”(文档248966-028,2013年7月)的2.2.5.4节中,

streamer prefetcher "[d]etects and maintains up to 32 streams of data accesses. For each 4K byte page, you can maintain one forward and one backward stream can be maintained. streamer prefetcher“ [d]检测并维护多达32个数据访问流。对于每个4K字节页,您可以维护一个前向流,而可以维护一个后向流。

This implies that the L2 hardware prefetcher tracks the 16 4KiB pages most recently accessed and remembers enough of the access patterns for those pages to track one forward stream and one backward stream. 这意味着L2硬件预取器跟踪最近访问的16个4KiB页面,并记住这些页面的足够访问模式以跟踪一个前向流和一个后向流。 So to defeat the L2 streamer prefetcher with "random" fetches, simply ensure that you access more than 15 other 4 KiB pages before you make a second reference to a previously referenced page. 因此,要通过“随机”取回来击败L2流媒体预取器,只需简单地确保在再次引用先前引用的页面之前,访问了15个其他4 KiB页面。 So a "random" sequences of fetches might be composed of a random permutation of more than 16 4 KiB page numbers with a random offset within each page. 因此,提取的“随机”序列可能由16个以上KiB页面编号的随机排列以及每个页面内的随机偏移组成。 (I typically use at least 32 pages in my permutation list.) (我通常在排列列表中至少使用32页。)

So it means in between accesses of two different random indexes of same 4KB pages we need to access atleast 16 4KB pages to defeat H/w prefetcher. 因此,这意味着在访问相同4KB页面的两个不同随机索引之间,我们需要访问至少16个4KB页面以击败硬件预取器。

I have implemented the concept suggested by John D. McCalpin , however the results again show the h/w prefetcher is not defeated. 我已经实现了John D. McCalpin提出的概念,但是结果再次表明,硬件预取器并没有被击败。 It is able to detect some pattern and prefetch data (see sample output) . 它能够检测到一些模式并预取数据(请参见示例输出)。 I have varied number of accessed pages from 20-40 4KB pages , but no improvement/change in result. 我访问的页面数从20-40个4KB页面不等,但结果没有改善/更改。

Here is my code : 这是我的代码:

#define _GNU_SOURCE             /* See feature_test_macros(7) */
#include <stdlib.h>
#include <stdio.h>
#include <unistd.h>
#include <sched.h>

#ifndef _POSIX_THREAD_PROCESS_SHARED
#error This system does not support process shared mutex
#endif

#define MAX_COUNT 3000
#define INDEX (40*1024) // size of DUMMY 40 4KB pages

inline void clflush(volatile void *p)
{
    asm volatile ("clflush (%0)" :: "r"(p));
}

unsigned long probe(char *adrs) {
  volatile unsigned long time;
  asm __volatile__ (
    " mfence              \n"
    " lfence              \n"
    " rdtsc               \n"
    " lfence              \n"
    " movl %%eax, %%esi \n"
    " movl (%1), %%eax     \n"
    " lfence              \n"
    " rdtsc               \n"
    " subl %%esi, %%eax \n"
    " clflush 0(%1)       \n"
    : "=a" (time)
    : "c" (adrs)
    : "%esi", "%edx");
  return time;
}

void shuffle(int *arr, size_t n)
{
    if (n > 1) 
    {
        size_t i;
        srand(time(NULL));
        for (i = 0; i < n - 1; i++) 
        {
          size_t j = i + rand() / (RAND_MAX / (n - i) + 1);
          int t = arr[j];
          arr[j] = arr[i];
          arr[i] = t;
        }
    }
}


static const int DATA[1024]={0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,103,104,105,106,107,108,109,110,111,112,113,114,115,116,117,118,119,120,121,122,123,124,125,126,127,128,129,130,131,132,133,134,135,136,137,138,139,140,141,142,143,144,145,146,147,148,149,150,151,152,153,154,155,156,157,158,159,160,161,162,163,164,165,166,167,168,169,170,171,172,173,174,175,176,177,178,179,180,181,182,183,184,185,186,187,188,189,190,191,192,193,194,195,196,197,198,199,200,201,202,203,204,205,206,207,208,209,210,211,212,213,214,215,216,217,218,219,220,221,222,223,224,225,226,227,228,229,230,231,232,233,234,235,236,237,238,239,240,241,242,243,244,245,246,247,248,249,250,251,252,253,254,255,256,257,258,259,260,261,262,263,264,265,266,267,268,269,270,271,272,273,274,275,276,277,278,279,280,281,282,283,284,285,286,287,288,289,290,291,292,293,294,295,296,297,298,299,300,301,302,303,304,305,306,307,308,309,310,311,312,313,314,315,316,317,318,319,320,321,322,323,324,325,326,327,328,329,330,331,332,333,334,335,336,337,338,339,340,341,342,343,344,345,346,347,348,349,350,351,352,353,354,355,356,357,358,359,360,361,362,363,364,365,366,367,368,369,370,371,372,373,374,375,376,377,378,379,380,381,382,383,384,385,386,387,388,389,390,391,392,393,394,395,396,397,398,399,400,401,402,403,404,405,406,407,408,409,410,411,412,413,414,415,416,417,418,419,420,421,422,423,424,425,426,427,428,429,430,431,432,433,434,435,436,437,438,439,440,441,442,443,444,445,446,447,448,449,450,451,452,453,454,455,456,457,458,459,460,461,462,463,464,465,466,467,468,469,470,471,472,473,474,475,476,477,478,479,480,481,482,483,484,485,486,487,488,489,490,491,492,493,494,495,496,497,498,499,500,501,502,503,504,505,506,507,508,509,510,511,512,513,514,515,516,517,518,519,520,521,522,523,524,525,526,527,528,529,530,531,532,533,534,535,536,537,538,539,540,541,542,543,544,545,546,547,548,549,550,551,552,553,554,555,556,557,558,559,560,561,562,563,564,565,566,567,568,569,570,571,572,573,574,575,576,577,578,579,580,581,582,583,584,585,586,587,588,589,590,591,592,593,594,595,596,597,598,599,600,601,602,603,604,605,606,607,608,609,610,611,612,613,614,615,616,617,618,619,620,621,622,623,624,625,626,627,628,629,630,631,632,633,634,635,636,637,638,639,640,641,642,643,644,645,646,647,648,649,650,651,652,653,654,655,656,657,658,659,660,661,662,663,664,665,666,667,668,669,670,671,672,673,674,675,676,677,678,679,680,681,682,683,684,685,686,687,688,689,690,691,692,693,694,695,696,697,698,699,700,701,702,703,704,705,706,707,708,709,710,711,712,713,714,715,716,717,718,719,720,721,722,723,724,725,726,727,728,729,730,731,732,733,734,735,736,737,738,739,740,741,742,743,744,745,746,747,748,749,750,751,752,753,754,755,756,757,758,759,760,761,762,763,764,765,766,767,768,769,770,771,772,773,774,775,776,777,778,779,780,781,782,783,784,785,786,787,788,789,790,791,792,793,794,795,796,797,798,799,800,801,802,803,804,805,806,807,808,809,810,811,812,813,814,815,816,817,818,819,820,821,822,823,824,825,826,827,828,829,830,831,832,833,834,835,836,837,838,839,840,841,842,843,844,845,846,847,848,849,850,851,852,853,854,855,856,857,858,859,860,861,862,863,864,865,866,867,868,869,870,871,872,873,874,875,876,877,878,879,880,881,882,883,884,885,886,887,888,889,890,891,892,893,894,895,896,897,898,899,900,901,902,903,904,905,906,907,908,909,910,911,912,913,914,915,916,917,918,919,920,921,922,923,924,925,926,927,928,929,930,931,932,933,934,935,936,937,938,939,940,941,942,943,944,945,946,947,948,949,950,951,952,953,954,955,956,957,958,959,960,961,962,963,964,965,966,967,968,969,970,971,972,973,974,975,976,977,978,979,980,981,982,983,984,985,986,987,988,989,990,991,992,993,994,995,996,997,998,999,1000,1001,1002,1003,1004,1005,1006,1007,1008,1009,1010,1011,1012,1013,1014,1015,1016,1017,1018,1019,1020,1021,1022,1023};

int main(int argc, char *argv[])
{

    int counter=0,k=0;
    unsigned long Access_Time[MAX_COUNT][64]={0};   
    int DUMMY[INDEX];// dummy array of 40 * 4KB ;  

    //Initialize
    for(k=0;k<INDEX;k++)
        DUMMY[k]=k;

    //access it to check segmentation fault is happening or not
    for(k=0;k<INDEX;k++)
        DUMMY[k]+=k;

    // even index in random order
    int index[32]={4,8,16,32,54,34,62,50,26,52,30,60,46,18,36,58,42,10,20,40,6,12,24,48,22,44,14,28,56,38,2,0};

    int TOTAL_RANDOM_PAGE=40;

    int i,PAGE[TOTAL_RANDOM_PAGE]; // PAGE will contain page no of 40 pages which will be accessed in random order to defeat prefetcher
        for (i=0; i<TOTAL_RANDOM_PAGE; i++)
    {
            PAGE[i] = i;
        }

    shuffle(PAGE, TOTAL_RANDOM_PAGE); // PAGE now have page no in random order

    FILE *fp2;
    int s,s1;
    int random_index=0,sum=0;

    const int *p0=&DATA[0];
    for (s=0;s<64;s++)
    {
        clflush((void *)(p0+s*16));
    }

    while(counter<MAX_COUNT)
    {               
        // Find Access time for Even Index
        for (s=0;s<32;s++)
        {

            // Access a random index
                Access_Time[counter][index[s]]=probe((char *)(p0+16*index[s]));

            //Now, access 40 other indexes belong to other 40 4KB page      
            shuffle(PAGE, TOTAL_RANDOM_PAGE); // random orderpage
            for(random_index=0;random_index<TOTAL_RANDOM_PAGE;random_index++)
            {
            DUMMY[1024*PAGE[random_index]+16*PAGE[random_index]]=2*DUMMY[1024*PAGE[random_index]+16*PAGE[random_index]];
            }

        }// end of for loop     

        // Flush all DATA from cache        
        for (s1=0;s1<64;s1++)
        {
            clflush((void *)(p0+s1*16));
        }
     counter++;

    }// end of while loop

    fp2=fopen("All_access_time.txt","a");

    int index4;
    for(counter=0;counter<MAX_COUNT;counter++)
    {
        for (index4=0;index4<64;index4++)
        {
            if(Access_Time[counter][index4]>0 && Access_Time[counter][index4]<200)
            fprintf(fp2,"%d,%d,%lu\n",counter,index4,Access_Time[counter][index4]);             
        }
    }

return 1;
}

Another interesting observation is , the access time of random indexes which were prefetched has access time around 35-70 ticks. 另一个有趣的观察结果是,预取的随机索引的访问时间约为35-70个滴答。 (see sample output) (请参阅示例输出)

In my system, the L1 access time 36-44 ticks, L2 access time 50-70 ticks, L3 access time = 90-120 ticks. 在我的系统中,L1访问时间为36-44滴答,L2访问时间为50-70滴答,L3访问时间= 90-120滴答。

Experiments were done on both Intel(R) Core(TM) i3-2100 CPU @ 3.10GHz and Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz, however results are similar. 实验是在3.10GHz的Intel®CoreTM i3-2100 CPU和3.40GHz的Intel®CoreTM i7-3770 CPU上进行的,但是结果相似。

Few internal details of system, 系统的内部细节很少,

 L1-D = 32KB, ways_of_associative=8 L1-I = 32KB, ways_of_associative=8 L2 = 256KB, ways_of_associative=8 L3 = 3072KB (Core-i3), ways_of_associative=12 L3 = 8192KB (Core-i7), ways_of_associative=16 Cache line size=64Bytes 

Can you please help me to understand why H/W prefetcher able to detect my random pattern ? 您能帮我了解为什么硬件预取器能够检测到我的随机模式吗? Where am I making mistakes? 我在哪里犯错?

How to do the coding so that I can defeat the prefetcher and h/w prefetcher unable to prefetch my data ? 如何进行编码,以便我可以打败预取器和无法预取数据的硬件预取器?

NOTE: I have disabled s/w prefetcher optimization while compiling using -O0 option with gcc. 注意:在对gcc使用-O0选项进行编译时,我禁用了软件预取器优化。

sample output : 样本输出:

(counter,index,access_time)
30,8,56
30,18,72
30,20,52
30,28,72
30,34,72
30,36,72
30,38,72
30,40,72
30,42,72
31,8,52
31,18,56
31,20,52
31,28,72
31,34,52
31,36,72
31,38,56
31,40,72
31,42,52
31,60,56
32,8,52
32,18,72
32,20,52
32,28,52
32,34,72
32,36,52
32,38,72
32,40,52
32,42,52
32,48,52
33,8,56
33,18,72
33,20,52
33,28,72
33,34,52
33,36,72
33,38,72
33,40,52
33,42,72
34,8,72
34,18,52
34,20,72
34,28,72
34,34,72
34,36,52
34,38,76
34,40,72
34,42,76
34,60,72

If you are brave enough to write a kernel module you can do what you want. 如果您足够勇敢地编写内核模块,则可以执行所需的操作。

As almost all features of the Core CPUs the hardware prefetching logic can be disabled for debugging purposes. 作为核心CPU的几乎所有功能,可以禁用硬件预取逻辑以进行调试。

Hardware prefetching is controlled by the Model Specific Register IA32_MISC_ENABLE (0x1a0). 硬件预取由型号专用寄存器IA32_MISC_ENABLE (0x1a0)控制。 Just set bit 9 of this register, and the prefetcher goes off. 只需设置该寄存器的第9位,预取器就会关闭。

For more information please check the "Intel® 64 and IA-32 Architectures Software Developer's Manual". 有关更多信息,请检查“Intel®64和IA-32体系结构软件开发人员手册”。 A search for IA32_MISC_ENABLE will bring you to the correct chapter. 搜索IA32_MISC_ENABLE将带您进入正确的章节。

Also a search on the Linux kernel source for the same keyword gives a few hits. 同样,在Linux内核源中搜索相同的关键字也会带来一些成功。 They aren't related to prefetching but for a different thing, but the code looks like a good boilerplate as it shows how to read and write the IA32_MISC_ENABLE register from the kernel. 它们与预取无关,但有不同之处,但是代码看起来很不错,因为它显示了如何从内核读取和写入IA32_MISC_ENABLE寄存器。

If you go this way, double and triple check what you're doing . 如果您采用这种方式, 请仔细检查您正在做什么 You don't want to accidently disable the thermal monitors. 您不想意外禁用温度监控器。 They are located in MISC_ENABLE as well :-) 它们也位于MISC_ENABLE中:-)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 在运行时检测i3 / i5 / i7 Intel Core Family - Detect i3/i5/i7 Intel Core Family at runtime 英特尔酷睿i7处理器和缓存行为 - Intel Core i7 processor and cache behaviour 为 Core 2 或 Core i7 架构全面优化的 memcpy/memmove? - Fully optimized memcpy/memmove for Core 2 or Core i7 architecture? 如何获得Linux / Unix上的硬件信息? - How do I get hardware information on Linux/Unix? 为什么我的C ++应用程序比Core i7上的C应用程序(使用相同的库)更快 - Why is my C++ app faster than my C app (using the same library) on a Core i7 为什么我的应用程序无法达到核心i7 920峰值FP性能 - Why is my application not able to reach core i7 920 peak FP performance 在核心i7机器中找不到正确的L2缓存逐出行的访问时间 - Unable to find correct access time of evicted lines of L2 cache in core i7 machine 在i7和Xeon上使用OpenMP的结果出乎意料 - Unexpected results with OpenMP on i7 and Xeon 在Intel(i7)和Arm上是否存在不同的行为? - mmap different behavior on intel (i7) and arm? 计算 MPI 程序 Intel i7 第 8 代(6 核,每核 2 个线程)的变量等级和大小的有效值范围 - Calculate valid range of values for the variables rank and size for MPI program Intel i7 8th Generation (6 cores with 2 threads per core)
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM