简体   繁体   English

通过空矩阵乘法初始化数组的更快方法? (Matlab)

[英]Faster way to initialize arrays via empty matrix multiplication? (Matlab)

I've stumbled upon the weird way (in my view) that Matlab is dealing with empty matrices . 在我看来,我偶然发现Matlab处理空矩阵的怪异方式。 For example, if two empty matrices are multiplied the result is: 例如,如果两个空矩阵相乘,结果为:

zeros(3,0)*zeros(0,3)
ans =

 0     0     0
 0     0     0
 0     0     0

Now, this already took me by surprise, however, a quick search got me to the link above, and I got an explanation of the somewhat twisted logic of why this is happening. 现在,这已经让我感到惊讶,但是,快速搜索将我带到了上面的链接,并且我对了为什么发生这种情况的某种扭曲逻辑进行了解释。

However , nothing prepared me for the following observation. 但是 ,没有什么准备让我进行以下观察。 I asked myself, how efficient is this type of multiplication vs just using zeros(n) function, say for the purpose of initialization? 我问自己,为了进行初始化,这种乘法与仅使用zeros(n)函数相比,效率如何? I've used timeit to answer this: 我用timeit来回答这个问题:

f=@() zeros(1000)
timeit(f)
ans =
    0.0033

vs: vs:

g=@() zeros(1000,0)*zeros(0,1000)
timeit(g)
ans =
    9.2048e-06

Both have the same outcome of 1000x1000 matrix of zeros of class double , but the empty matrix multiplication one is ~350 times faster! 两者具有类double的1000x1000零矩阵的结果相同,但是空矩阵乘法1的速度快〜350倍! (a similar result happens using tic and toc and a loop) (使用tictoc以及循环会发生类似的结果)

How can this be? 怎么会这样? are timeit or tic,toc bluffing or have I found a faster way to initialize matrices? timeittic,toc还是我发现了一种更快的初始化矩阵的方法? (this was done with matlab 2012a, on a win7-64 machine, intel-i5 650 3.2Ghz...) (这是使用matlab 2012a在Win7-64计算机intel-i5 650 3.2Ghz上完成的...)

EDIT: 编辑:

After reading your feedback, I have looked more carefully into this peculiarity, and tested on 2 different computers (same matlab ver though 2012a) a code that examine the run time vs the size of matrix n. 在阅读您的反馈后,我更加仔细地研究了这种特性,并在2台不同的计算机(2012a相同的matlab版本)上进行了测试,该代码检查了运行时间与矩阵n的大小。 This is what I get: 这是我得到的:

在此处输入图片说明

The code to generate this used timeit as before, but a loop with tic and toc will look the same. 生成此使用的timeit的代码与以前一样,但是带有tictoc的循环看起来相同。 So, for small sizes, zeros(n) is comparable. 因此,对于小尺寸, zeros(n)是可比较的。 However, around n=400 there is a jump in performance for the empty matrix multiplication. 但是,在n=400左右,空矩阵乘法的性能会有所提高。 The code I've used to generate that plot was: 我用来生成该图的代码是:

n=unique(round(logspace(0,4,200)));
for k=1:length(n)
    f=@() zeros(n(k));
    t1(k)=timeit(f);

    g=@() zeros(n(k),0)*zeros(0,n(k));
    t2(k)=timeit(g);
end

loglog(n,t1,'b',n,t2,'r');
legend('zeros(n)','zeros(n,0)*zeros(0,n)',2);
xlabel('matrix size (n)'); ylabel('time [sec]');

Are any of you experience this too? 你们中有人也经历过吗?

EDIT #2: 编辑#2:

Incidentally, empty matrix multiplication is not needed to get this effect. 顺便说一句,不需要空矩阵乘法即可获得此效果。 One can simply do: 一个人可以简单地做到:

z(n,n)=0;

where n> some threshold matrix size seen in the previous graph, and get the exact efficiency profile as with empty matrix multiplication (again using timeit). 其中n>上图所示的某个阈值矩阵大小,并获得与空矩阵乘法(再次使用timeit)相同的准确效率曲线。

在此处输入图片说明

Here's an example where it improves efficiency of a code: 这是一个提高代码效率的示例:

n = 1e4;
clear z1
tic
z1 = zeros( n ); 
for cc = 1 : n
    z1(:,cc)=cc;
end
toc % Elapsed time is 0.445780 seconds.

%%
clear z0
tic
z0 = zeros(n,0)*zeros(0,n);
for cc = 1 : n
    z0(:,cc)=cc;
end
toc % Elapsed time is 0.297953 seconds.

However, using z(n,n)=0; 但是,使用z(n,n)=0; instead yields similar results to the zeros(n) case. 而是产生与zeros(n)情况相似的结果。

This is strange, I am seeing f being faster while g being slower than what you are seeing. 这很奇怪,我看到f更快,而g却慢于您所看到的。 But both of them are identical for me. 但是他们对我来说都是一样的。 Perhaps a different version of MATLAB ? 也许是不同版本的MATLAB?

>> g = @() zeros(1000, 0) * zeros(0, 1000);
>> f = @() zeros(1000)
f =     
    @()zeros(1000)
>> timeit(f)  
ans =    
   8.5019e-04
>> timeit(f)  
ans =    
   8.4627e-04
>> timeit(g)  
ans =    
   8.4627e-04

EDIT can you add + 1 for the end of f and g, and see what times you are getting. 编辑是否可以为f和g的末尾加上+1,并查看得到的时间。

EDIT Jan 6, 2013 7:42 EST 编辑2013年1月6日,美国东部时间7:42

I am using a machine remotely, so sorry about the low quality graphs (had to generate them blind). 我正在远程使用机器,因此对低质量的图形感到抱歉(不得不将其生成为盲图)。

Machine config: 机器配置:

i7 920. 2.653 GHz. i7920。2.653GHz。 Linux. Linux。 12 GB RAM. 12 GB RAM。 8MB cache. 8MB缓存。

在i7 920上生成的图形

It looks like even the machine I have access to shows this behavior, except at a larger size (somewhere between 1979 and 2073). 看起来甚至我可以访问的机器都显示了此行为,除了更大的尺寸(在1979年和2073年之间的某个地方)。 There is no reason I can think of right now for the empty matrix multiplication to be faster at larger sizes. 我没有理由立即想到在更大尺寸下空矩阵乘法会更快。

I will be investigating a little bit more before coming back. 回来之前,我将进行更多调查。

EDIT Jan 11, 2013 编辑2013年1月11日

After @EitanT's post, I wanted to do a little bit more of digging. 在@EitanT的帖子之后,我想做更多的挖掘工作。 I wrote some C code to see how matlab may be creating a zeros matrix. 我编写了一些C代码,以了解matlab如何创建零矩阵。 Here is the c++ code that I used. 这是我使用的c ++代码。

int main(int argc, char **argv)
{
    for (int i = 1975; i <= 2100; i+=25) {
    timer::start();
    double *foo = (double *)malloc(i * i * sizeof(double));
    for (int k = 0; k < i * i; k++) foo[k]  = 0;
    double mftime = timer::stop();
    free(foo);

    timer::start();
    double *bar = (double *)malloc(i * i * sizeof(double));
    memset(bar, 0, i * i * sizeof(double));
    double mmtime = timer::stop();
    free(bar);

    timer::start();
    double *baz = (double *)calloc(i * i, sizeof(double));
    double catime = timer::stop();
    free(baz);

    printf("%d, %lf, %lf, %lf\n", i, mftime, mmtime, catime);
    }
}

Here are the results. 这是结果。

$ ./test
1975, 0.013812, 0.013578, 0.003321
2000, 0.014144, 0.013879, 0.003408
2025, 0.014396, 0.014219, 0.003490
2050, 0.014732, 0.013784, 0.000043
2075, 0.015022, 0.014122, 0.000045
2100, 0.014606, 0.014480, 0.000045

As you can see calloc (4th column) seems to be the fastest method. 如您所见, calloc (第4列)似乎是最快的方法。 It is also getting significantly faster between 2025 and 2050 (I'd assume it would at around 2048 ?). 在2025年至2050年之间,它的速度也将显着加快(我认为它会在2048年左右)。

Now I went back to matlab to check for the same. 现在我回到matlab进行检查。 Here are the results. 这是结果。

>> test
1975, 0.003296, 0.003297
2000, 0.003377, 0.003385
2025, 0.003465, 0.003464
2050, 0.015987, 0.000019
2075, 0.016373, 0.000019
2100, 0.016762, 0.000020

It looks like both f() and g() are using calloc at smaller sizes (<2048 ?). 似乎f()和g()都在较小的尺寸(<2048?)上使用calloc But at larger sizes f() (zeros(m, n)) starts to use malloc + memset , while g() (zeros(m, 0) * zeros(0, n)) keeps using calloc . 但是在更大的尺寸下,f()(zeros(m,n))开始使用malloc + memset ,而g()(zeros(m,0)* zeros(0,n))继续使用calloc

So the divergence is explained by the following 因此,以下解释了差异

  • zeros(..) begins to use a different (slower ?) scheme at larger sizes. zeros(..)在大尺寸时开始使用其他(慢速?)方案。
  • calloc also behaves somewhat unexpectedly, leading to an improvement in performance. calloc行为也有些出乎意料,从而导致性能提高。

This is the behavior on Linux. 这是Linux上的行为。 Can someone do the same experiment on a different machine (and perhaps a different OS) and see if the experiment holds ? 有人可以在不同的机器(也许是不同的操作系统)上进行相同的实验,看看该实验是否成立吗?

The results might be a bit misleading. 结果可能会产生误导。 When you multiply two empty matrices, the resulting matrix is not immediately "allocated" and "initialized", rather this is postponed until you first use it (sort of like a lazy evaluation). 当您将两个空矩阵相乘时,生成的矩阵不会立即“分配”和“初始化”,而是会推迟到您第一次使用它时(有点像惰性计算)。

The same applies when indexing out of bounds to grow a variable, which in the case of numeric arrays fills out any missing entries with zeros (I discuss afterwards the non-numeric case). 索引超出范围以增长变量时,情况也是如此,在数字数组的情况下,该变量将用零填充所有缺失的条目(我将在后面讨论非数字的情况)。 Of course growing the matrix this way does not overwrite existing elements. 当然,以这种方式生长矩阵不会覆盖现有元素。

So while it may seem faster, you are just delaying the allocation time until you actually first use the matrix. 因此,虽然看起来更快,但是您只是在延迟分配时间,直到您真正开始使用矩阵。 In the end you'll have similar timings as if you did the allocation from the start. 最后,您将拥有与开始时一样的分配时间。

Example to show this behavior, compared to a few other alternatives : 其他几种选择相比,展示此行为的示例:

N = 1000;

clear z
tic, z = zeros(N,N); toc
tic, z = z + 1; toc
assert(isequal(z,ones(N)))

clear z
tic, z = zeros(N,0)*zeros(0,N); toc
tic, z = z + 1; toc
assert(isequal(z,ones(N)))

clear z
tic, z(N,N) = 0; toc
tic, z = z + 1; toc
assert(isequal(z,ones(N)))

clear z
tic, z = full(spalloc(N,N,0)); toc
tic, z = z + 1; toc
assert(isequal(z,ones(N)))

clear z
tic, z(1:N,1:N) = 0; toc
tic, z = z + 1; toc
assert(isequal(z,ones(N)))

clear z
val = 0;
tic, z = val(ones(N)); toc
tic, z = z + 1; toc
assert(isequal(z,ones(N)))

clear z
tic, z = repmat(0, [N N]); toc
tic, z = z + 1; toc
assert(isequal(z,ones(N)))

The result shows that if you sum the elapsed time for both instructions in each case, you end up with similar total timings: 结果显示,如果对两种情况下的两条指令的经过时间进行求和,最终将得到相似的总计时:

// zeros(N,N)
Elapsed time is 0.004525 seconds.
Elapsed time is 0.000792 seconds.

// zeros(N,0)*zeros(0,N)
Elapsed time is 0.000052 seconds.
Elapsed time is 0.004365 seconds.

// z(N,N) = 0
Elapsed time is 0.000053 seconds.
Elapsed time is 0.004119 seconds.

The other timings were: 其他时间是:

// full(spalloc(N,N,0))
Elapsed time is 0.001463 seconds.
Elapsed time is 0.003751 seconds.

// z(1:N,1:N) = 0
Elapsed time is 0.006820 seconds.
Elapsed time is 0.000647 seconds.

// val(ones(N))
Elapsed time is 0.034880 seconds.
Elapsed time is 0.000911 seconds.

// repmat(0, [N N])
Elapsed time is 0.001320 seconds.
Elapsed time is 0.003749 seconds.

These measurements are too small in the milliseconds and might not be very accurate, so you might wanna run these commands in a loop a few thousand times and take the average. 这些测量值在毫秒内太小,可能不太准确,因此您可能想循环运行这些命令数千次并取平均值。 Also sometimes running saved M-functions is faster than running scripts or on the command prompt, as certain optimizations only happen that way... 同样有时运行保存的M函数比运行脚本或在命令提示符下运行更快,因为某些优化只会以这种方式发生...

Either way allocation is usually done once, so who cares if it takes an extra 30ms :) 无论哪种方式,分配通常都会执行一次,因此谁在乎是否需要额外的30毫秒:)


A similar behavior can be seen with cell arrays or arrays of structures. 单元格阵列或结构阵列可以看到类似的行为。 Consider the following example: 考虑以下示例:

N = 1000;

tic, a = cell(N,N); toc
tic, b = repmat({[]}, [N,N]); toc
tic, c{N,N} = []; toc

which gives: 这使:

Elapsed time is 0.001245 seconds.
Elapsed time is 0.040698 seconds.
Elapsed time is 0.004846 seconds.

Note that even if they are all equal, they occupy different amount of memory: 请注意,即使它们都相等,它们占用的内存量也不同:

>> assert(isequal(a,b,c))
>> whos a b c
  Name         Size                  Bytes  Class    Attributes

  a         1000x1000              8000000  cell               
  b         1000x1000            112000000  cell               
  c         1000x1000              8000104  cell               

In fact the situation is a bit more complicated here, since MATLAB is probably sharing the same empty matrix for all the cells, rather than creating multiple copies. 实际上,这里的情况更加复杂,因为MATLAB可能为所有单元共享相同的空矩阵,而不是创建多个副本。

The cell array a is in fact an array of uninitialized cells (an array of NULL pointers), while b is a cell array where each cell is an empty array [] (internally and because of data sharing, only the first cell b{1} points to [] while all the rest have a reference to the first cell). 单元格数组a实际上是未初始化单元格的数组(NULL指针的数组),而b是单元格数组,其中每个单元格都是一个空数组[] (内部并且由于数据共享,只有第一个单元格b{1}指向[]而其余所有引用第一个单元格。 The final array c is similar to a (uninitialized cells), but with the last one containing an empty numeric matrix [] . 最终数组ca (未初始化的单元格)相似,但最后一个数组包含一个空数值矩阵[]


I looked around the list of exported C functions from the libmx.dll (using Dependency Walker tool), and I found a few interesting things. 我浏览了libmx.dll (使用Dependency Walker工具)导出的C函数的列表,发现了一些有趣的东西。

  • there are undocumented functions for creating uninitialized arrays: mxCreateUninitDoubleMatrix , mxCreateUninitNumericArray , and mxCreateUninitNumericMatrix . 有一些未记录的函数可以创建未初始化的数组: mxCreateUninitDoubleMatrixmxCreateUninitNumericArraymxCreateUninitNumericMatrix In fact there is a submission on the File Exchange makes use of these functions to provide a faster alternative to zeros function. 实际上, 文件交换上有一个提交使用这些功能来提供更快的替代zeros功能的方式。

  • there exist an undocumented function called mxFastZeros . 存在一个未记录的函数,称为mxFastZeros Googling online, I can see you cross-posted this question on MATLAB Answers as well, with some excellent answers over there. 在线上进行谷歌搜索,我还可以看到您也将此问题交叉发布在MATLAB Answers上,那里有一些出色的答案。 James Tursa (same author of UNINIT from before) gave an example on how to use this undocumented function. James Tursa(之前是UNINIT的同一作者)举了一个有关如何使用此未记录功能的示例

  • libmx.dll is linked against tbbmalloc.dll shared library. libmx.dlltbbmalloc.dll共享库链接。 This is Intel TBB scalable memory allocator. 这是Intel TBB可扩展内存分配器。 This library provides equivalent memory allocation functions ( malloc , calloc , free ) optimized for parallel applications. 该库提供了等效的内存分配功能( malloccallocfree ),这些功能针对并行应用程序进行了优化。 Remember that many MATLAB functions are automatically multithreaded , so I wouldn't be surprised if zeros(..) is multithreaded and is using Intel's memory allocator once the matrix size is large enough (here is recent comment by Loren Shure that confirms this fact). 请记住,许多MATLAB函数是自动多线程的 ,因此,如果矩阵大小足够大,如果zeros(..)是多线程的并且使用Intel的内存分配器,我不会感到惊讶(这是Loren Shure的最新评论证实了这一事实) 。

Regarding the last point about the memory allocator, you could write a similar benchmark in C/C++ similar to what @PavanYalamanchili did, and compare the various allocators available. 关于内存分配器的最后一点,您可以在C / C ++中编写类似于@PavanYalamanchili的基准,并比较可用的各种分配器。 Something like this . 这样的东西。 Remember that MEX-files have a slightly higher memory management overhead, since MATLAB automatically frees any memory that was allocated in MEX-files using the mxCalloc , mxMalloc , or mxRealloc functions. 请记住,MEX文件的内存管理开销略高,因为MATLAB使用mxCallocmxMallocmxRealloc函数自动释放在MEX文件中分配的所有内存。 For what it's worth, it used to be possible to change the internal memory manager in older versions. 值得一试的是,以前可以在旧版本中更改内部内存管理器


EDIT: 编辑:

Here is a more thorough benchmark to compare the discussed alternatives. 这是一个比较详尽的基准,用于比较讨论的替代方案。 It specifically shows that once you stress the use of the entire allocated matrix, all three methods are on equal footing, and the difference is negligible. 它具体表明,一旦强调使用整个分配的矩阵,所有这三种方法都处于平等地位,并且差异可以忽略不计。

function compare_zeros_init()
    iter = 100;
    for N = 512.*(1:8)
        % ZEROS(N,N)
        t = zeros(iter,3);
        for i=1:iter
            clear z
            tic, z = zeros(N,N); t(i,1) = toc;
            tic, z(:) = 9; t(i,2) = toc;
            tic, z = z + 1; t(i,3) = toc;
        end
        fprintf('N = %4d, ZEROS = %.9f\n', N, mean(sum(t,2)))

        % z(N,N)=0
        t = zeros(iter,3);
        for i=1:iter
            clear z
            tic, z(N,N) = 0; t(i,1) = toc;
            tic, z(:) = 9; t(i,2) = toc;
            tic, z = z + 1; t(i,3) = toc;
        end
        fprintf('N = %4d, GROW  = %.9f\n', N, mean(sum(t,2)))

        % ZEROS(N,0)*ZEROS(0,N)
        t = zeros(iter,3);
        for i=1:iter
            clear z
            tic, z = zeros(N,0)*zeros(0,N); t(i,1) = toc;
            tic, z(:) = 9; t(i,2) = toc;
            tic, z = z + 1; t(i,3) = toc;
        end
        fprintf('N = %4d, MULT  = %.9f\n\n', N, mean(sum(t,2)))
    end
end

Below are the timings averaged over 100 iterations in terms of increasing matrix size. 下面是根据矩阵大小增加而在100次迭代中平均的时序。 I performed the tests in R2013a. 我在R2013a中进行了测试。

>> compare_zeros_init
N =  512, ZEROS = 0.001560168
N =  512, GROW  = 0.001479991
N =  512, MULT  = 0.001457031

N = 1024, ZEROS = 0.005744873
N = 1024, GROW  = 0.005352638
N = 1024, MULT  = 0.005359236

N = 1536, ZEROS = 0.011950846
N = 1536, GROW  = 0.009051589
N = 1536, MULT  = 0.008418878

N = 2048, ZEROS = 0.012154002
N = 2048, GROW  = 0.010996315
N = 2048, MULT  = 0.011002169

N = 2560, ZEROS = 0.017940950
N = 2560, GROW  = 0.017641046
N = 2560, MULT  = 0.017640323

N = 3072, ZEROS = 0.025657999
N = 3072, GROW  = 0.025836506
N = 3072, MULT  = 0.051533432

N = 3584, ZEROS = 0.074739924
N = 3584, GROW  = 0.070486857
N = 3584, MULT  = 0.072822335

N = 4096, ZEROS = 0.098791732
N = 4096, GROW  = 0.095849788
N = 4096, MULT  = 0.102148452

After doing some research, I've found this article in "Undocumented Matlab" , in which Mr. Yair Altman had already come to the conclusion that MathWork's way of preallocating matrices using zeros(M, N) is indeed not the most efficient way. 经过研究之后,我在“ Undocumented Matlab”中找到了这篇文章Yair Altman先生已经得出结论, MathWork使用zeros(M, N) 预分配矩阵的方法的确不是最有效的方法。

He timed x = zeros(M,N) vs. clear x, x(M,N) = 0 and found that the latter is ~500 times faster. 他给x = zeros(M,N)clear x, x(M,N) = 0计时,发现后者快了约500倍。 According to his explanation, the second method simply creates an M-by-N matrix, the elements of which being automatically initialized to 0. The first method however, creates x (with x having automatic zero elements) and then assigns a zero to every element in x again, and that is a redundant operation that takes more time. 根据他的解释,第二种方法只是创建一个M×N矩阵,其元素会自动初始化为0。但是,第一种方法会创建x (其中x具有自动零元素),然后为每个矩阵分配一个零。再次在x元素,这是一个多余的操作,需要花费更多时间。

In the case of empty matrix multiplication, such as what you've shown in your question, MATLAB expects the product to be an M×N matrix, and therefore it allocates an M×N matrix. 对于空矩阵乘法,例如您在问题中所显示的,MATLAB期望乘积为M×N矩阵,因此将分配一个M×N矩阵。 Consequently, the output matrix is automatically initialized to zeroes. 因此,输出矩阵会自动初始化为零。 Since the original matrices are empty, no further calculations are performed, and hence the elements in the output matrix remain unchanged and equal to zero. 由于原始矩阵为空,因此不执行进一步的计算,因此输出矩阵中的元素保持不变并等于零。

Interesting question, apparently there are several ways to 'beat' the built-in zeros function. 有趣的问题是,显然有几种方法可以击败内置的zeros功能。 My only guess as to why this is happening would be that it could be more memory efficient (after all, zeros(LargeNumer) will sooner cause Matlab to hit the memory limit than form a devestating speed bottleneck in most code), or more robust somehow. 我对此的唯一猜测是,它可能具有更高的内存效率(毕竟, zeros(LargeNumer)将使Matlab达到内存极限,而不是在大多数代码中形成破坏性的速度瓶颈),或者以某种方式更健壮。

Here is another fast allocation method using a sparse matrix, i have added the regular zeros function as a benchmark: 这是另一种使用稀疏矩阵的快速分配方法,我添加了常规零函数作为基准:

tic; x=zeros(1000,1000); toc
Elapsed time is 0.002863 seconds.

tic; clear x; x(1000,1000)=0; toc
Elapsed time is 0.000282 seconds.

tic; x=full(spalloc(1000,1000,0)); toc
Elapsed time is 0.000273 seconds.

tic; x=spalloc(1000,1000,1000000); toc %Is this the same for practical purposes?
Elapsed time is 0.000281 seconds.

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM