简体   繁体   English

将功能转移到Intel Xeon Phi所需的时间

[英]Required time to offload a function to Intel Xeon Phi

Is there a predefined time that is required for offload call to transfer the data(parameters) of a function from host to Intel MIC(Xeon Phi coprocessor 3120 series)? 卸载调用将功能的数据(参数)从主机传输到Intel MIC(至强融核协处理器3120系列)是否需要预定义的时间?

Specifically I do offload call ("#pragma offload target(mic)") for a function that I want to be executed on MIC. 具体来说,我对要在MIC上执行的功能进行卸载调用(“ #pragma卸载目标(麦克风)”)。 The function has 15 parameters(pointers and variables) and I have already confirmed the right passing of the parameters on MIC. 该函数有15个参数(指针和变量),我已经确认了MIC上参数的正确传递。 However I have simplified the code with purpose to check the time for the passing of the parameters and so it contains just one simple "printf()" function. 但是,我已经简化了代码,目的是检查传递参数的时间,因此它仅包含一个简单的“ printf()”函数。 I use the "gettimeofday()" of "sys/time.h" header file for measuring time as it seems in the code below: 我使用“ sys / time.h”头文件的“ gettimeofday()”来测量时间,如下面的代码所示:

Some hardware informations for the host: Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz / CentOS release 6.8 / PCI Express Revision 2.0 主机的一些硬件信息: Intel®Core™i7-3770 CPU @ 3.40GHz / CentOS 6.8版/ PCI Express修订版2.0

main.c main.c中

#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/time.h>
#include <string.h>

__attribute__ (( target (mic))) unsigned long long ForSolution = 0;
__attribute__ (( target (mic))) unsigned long long sufficientSol = 1;
__attribute__ (( target (mic))) float timer = 0.0;

__attribute__ (( target (mic))) void function(float *grid, float *displ, unsigned long long *li, unsigned long long *repet, float *solution, unsigned long long dim, unsigned long long numOfa, unsigned long long numLoops, unsigned long long numBlock, unsigned long long thread, unsigned long long blockGrid, unsigned long long station, unsigned long long bytesSol, unsigned long long totalSol, volatile unsigned long long *prog);

   float    *grid, *displ, *solution;
   unsigned long long   *li,repet;
   volatile unsigned long long  *prog;
   unsigned long long dim = 10, grid_a = 3, numLoops = 2, numBlock = 0;
   unsigned long long thread = 220, blockGrid = 0, station = 12;
   unsigned long long station_at = 8, bytesSol, totalSol;

   bytesSol = dim*sizeof(float);
   totalSol = ((1024 * 1024 * 1024) / bytesSol) * bytesSol;



   /******** Some memcpy() functions here for the pointers*********/                   



gettimeofday(&start, NULL);

   #pragma offload target(mic) \
        in(grid:length(dim * grid_a * sizeof(float))) \
        in(displ:length(station * station_at * sizeof(float))) \
        in(li:length(dim * sizeof(unsigned long long))) \
        in(repet:length(dim * sizeof(unsigned long long))) \
        out(solution:length(totalSol/sizeof(float))) \
        in(dim,grid_a,numLoops,numBlock,thread,blockGrid,station,bytesSol,totalSol) \
        in(prog:length(sizeof(volatile unsigned long long))) \
        inout(ForSolution,sufficientSol,timer)
   {
        function(grid, displ, li, repet, solution, dim, grid_a, numLoops, numBlock, thread, blockGrid, station, bytesSol, totalSol, prog);
   }

    gettimeofday(&end, NULL);  

    printf("Time to tranfer data on Intel Xeon Phi: %f sec\n", (((end.tv_sec - start.tv_sec) * 1000000.0 + (end.tv_usec - start.tv_usec)) / 1000000.0) - timer);
    printf("Time for calculations: %f sec\n", timer);

function.c function.c

#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/time.h>
#include <string.h>
#include <omp.h>

void function(float *grid, float *displ, unsigned long long *li, unsigned long long *repet, float *solution, unsigned long long dim, unsigned long long numOfa, unsigned long long numLoops, unsigned long long numBlock, unsigned long long thread, unsigned long long blockGrid, unsigned long long station, unsigned long long bytesSol, unsigned long long totalSol, volatile unsigned long long *prog)
{
    struct timeval      timer_start, timer_end;

    gettimeofday(&timer_start, NULL);

printf("Hello World!!!\n");


    gettimeofday(&timer_end, NULL);

    timer = ((timer_end.tv_sec - timer_start.tv_sec) * 1000000.0 + (timer_end.tv_usec - timer_start.tv_usec)) / 1000000.0 ;  
}

Results of terminal: 终端结果:

Time to tranfer data on Intel Xeon Phi: 3.512706 sec
Time for calculations: 0.000002 sec
Hello World!!!

The code require 3.5 seconds to complete the "offload target". 该代码需要3.5秒才能完成“卸载目标”。 Is the above result normal? 以上结果正常吗? Is there any way to reduce that significant time delay of offload call? 有什么方法可以减少卸载呼叫的明显时间延迟?

Let's look at the steps here: 让我们看看这里的步骤:

a) For the very first #pragma offload the MIC is initialised; a)对于第一个#pragma offload ,初始化了MIC; which probably includes resetting it, booting a stripped down Linux (and waiting for it to start all the CPUs, initialise its memory management, start a psuedo-NIC driver, etc), and uploading your code to the device. 其中可能包括重置它,引导精简的Linux(并等待其启动所有CPU,初始化其内存管理,启动伪NIC驱动程序等),以及将代码上传到设备。 This probably takes multiple seconds alone. 仅这可能要花费几秒钟。

b) All the input data is uploaded to the MIC. b)所有输入数据都上传到MIC。

c) The function is executed. c)执行该功能。

d) All the output data is downloaded from the MIC. d)所有输出数据均从MIC下载。

For raw data transfers over PCI Express Revision 2.0 (x16) the max. 对于通过PCI Express修订版2.0(x16)进行的原始数据传输,最大 bandwidth is 8 GB/s; 带宽为8 GB / s; however you're not going to get max. 但是你不会得到最大。 bandwidth. 带宽。 From what I remember communication with the Phi involves shared ring buffers and "doorbell" IRQs with "pseudo NIC" drivers on both sides (on the host, and on the coprocessor's OS); 据我所知,与Phi的通信涉及共享的环形缓冲区和带有“伪NIC”驱动程序的“门铃” IRQ,双方(在主机上,以及在协处理器的OS上); and with all the handshaking and overhead I'd be surprised if you get half the max. 加上所有的握手和开销,如果您获得最大值的一半,我会感到惊讶。 bandwidth. 带宽。

I think that the total amount of code uploaded, data uploaded and data downloaded is well over 1 GiB (eg the out(solution:length(totalSol/sizeof(float))) is 1 GiB all by itself). 我认为上载的代码,上载的数据和下载的数据的总数远远超过1 GiB(例如, out(solution:length(totalSol/sizeof(float)))本身就是1 GiB)。 If we're assuming "about 4 GiB/s" that's at least another ~250 ms. 如果我们假设“大约4 GiB / s”,那至少还要再250毫秒左右。

My suggestion is to do everything twice; 我的建议是每件事都要做两次。 and measure the difference between the first time (which includes initialising everything) and the second time (when everything is already initialised) to determine how long it takes to initialise the coprocessor. 并测量第一次(包括初始化所有内容)和第二次(当所有内容都已初始化时)之间的差,以确定初始化协处理器需要多长时间。 The second measurement (minus the time to execute the function) will tell you how long the data transfers took. 第二次测量(减去执行功能的时间)将告诉您数据传输花费了多长时间。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM