简体   繁体   English

在连续内存分配的情况下,乘以大矩阵要慢得多

[英]Multiplying large matrices is much slower with contiguous memory allocation

While implementing a neural network, I noticed that if I allocate memory as a single contiguous block for the data set arrays, execution time increases several times. 在实现神经网络时,我注意到如果我将内存分配为数据集阵列的单个连续块,则执行时间会增加几倍。

Compare these two methods of memory allocation: 比较这两种内存分配方法:

float** alloc_2d_float(int rows, int cols, int contiguous)
{
    int i;
    float** array = malloc(rows * sizeof(float*));

    if(contiguous)
    {
        float* data = malloc(rows*cols*sizeof(float));
        assert(data && "Can't allocate contiguous memory");

        for(i=0; i<rows; i++)
            array[i] = &(data[cols * i]);
    }
    else
        for(i=0; i<rows; i++)
        {
            array[i] = malloc(cols * sizeof(float));
            assert(array[i] && "Can't allocate memory");
        }

    return array;
}

Here are the results when compiling with -march=native -Ofast (tried gcc and clang): 以下是使用-march=native -Ofast (尝试gcc和clang)进行编译时的结果:

michael@Pascal:~/NN$ time ./test 300 1 0

Multiplying (100000, 1000) and (300, 1000) arrays 1 times, noncontiguous memory allocation.

Allocating memory:    0.2 seconds
Initializing arrays: 0.8 seconds
Dot product:         3.3 seconds

real    0m4.296s
user    0m4.108s
sys     0m0.188s

michael@Pascal:~/NN$ time ./test 300 1 1

Multiplying (100000, 1000) and (300, 1000) arrays 1 times, contiguous memory allocation.

Allocating memory:    0.0 seconds
Initializing arrays: 40.3 seconds
Dot product:         13.5 seconds    

real    0m53.817s
user    0m4.204s
sys     0m49.664s

Here's the code: https://github.com/michaelklachko/NN/blob/master/test.c 这是代码: https//github.com/michaelklachko/NN/blob/master/test.c

Note that both initializing and dot product are much slower for contiguous memory. 请注意,对于连续内存,初始化和点积都要慢得多。

I expected the opposite - a contiguous block of memory should be more cache friendly than a large number of separate small blocks. 我预计相反 - 一个连续的内存块应该比大量独立的小块更加缓存友好。 Or at least they should be similar in performance (this machine has 64GB of RAM, and 90% of it is unused). 或者至少它们的性能应该相似(这台机器有64GB的RAM,其中90%未使用)。

EDIT: Here's the compressed self-contained code (I still recommend using the github version instead, which has measuring and formatting statements): 编辑:这是压缩的自包含代码(我仍然建议使用github版本,它具有测量和格式化语句):

#include <stdio.h>
#include <stdlib.h>
#include <time.h>

float** alloc_2d_float(int rows, int cols, int contiguous){
    int i;
    float** array = malloc(rows * sizeof(float*));
    if(contiguous){
        float* data = malloc(rows*cols*sizeof(float));
        for(i=0; i<rows; i++)
            array[i] = &(data[cols * i]);
    }
    else
    for(i=0; i<rows; i++)
        array[i] = malloc(cols * sizeof(float));
    return array;
}

void initialize(float** array, int dim1, int dim2){
    srand(time(NULL));
    int i, j;
    for(i=0; i<dim1; i++)
        for(j=0; j<dim2; j++)
            array[i][j] = rand()/RAND_MAX;
}

int main(){
    int i,j,k, dim1=100000, dim2=1000, dim3=300;
    int contiguous=0;
    float temp;

    float** array1 = alloc_2d_float(dim1, dim2, contiguous);
    float** array2 = alloc_2d_float(dim3, dim2, contiguous);
    float** result = alloc_2d_float(dim1, dim3, contiguous);

    initialize(array1, dim1, dim2);
    initialize(array2, dim3, dim2);

    for(i=0; i<dim1; i++)
        for(k=0; k<dim3; k++){
            temp = 0;
            for(j=0; j<dim2; j++)
                temp += array1[i][j] * array2[k][j];
            result[i][k] = temp;
    }
}

Looks like you've run into ability or disability of your compiler to run some vectorisation of your code. 看起来您已经遇到编译器的能力或残疾,无法运行代码的矢量化。 I've tried to repeat your experiment with no succeed - 我试图重复你的实验没有成功 -

mick@mick-laptop:~/Загрузки$ ./a.out 100 1 0 mick @ mick-laptop:〜/Загрузки$。/ a.out 100 1 0

Multiplying (100000, 1000) and (100, 1000) arrays 1 times, noncontiguous memory allocation. 将(100000,1000)和(100,1000)个数组乘以1次,不连续的内存分配。

Initializing arrays... 初始化数组......

Multiplying arrays... 乘法数组......

Execution Time: Allocating memory: 0.1 seconds Initializing arrays: 0.9 seconds Dot product: 44.8 seconds 执行时间:分配内存:0.1秒初始化数组:0.9秒点积:44.8秒

mick@mick-laptop:~/Загрузки$ ./a.out 100 1 1 mick @ mick-laptop:〜/Загрузки$。/ a.out 100 1 1

Multiplying (100000, 1000) and (100, 1000) arrays 1 times, contiguous memory allocation. 将(100000,1000)和(100,1000)个数组相乘1次,连续分配内存。

Initializing arrays... 初始化数组......

Multiplying arrays... 乘法数组......

Execution Time: Allocating memory: 0.0 seconds Initializing arrays: 1.0 seconds Dot product: 46.3 seconds 执行时间:分配内存:0.0秒初始化数组:1.0秒点积:46.3秒

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM