简体   繁体   English

为什么这些矩阵乘法的性能如此不同?

[英]Why is the performance of these matrix multiplications so different?

I wrote two matrix classes in Java just to compare the performance of their matrix multiplications. 我用Java编写了两个矩阵类,只是为了比较矩阵乘法的性能。 One class (Mat1) stores a double[][] A member where row i of the matrix is A[i] . 一个类(Mat1)存储double[][] A成员,其中矩阵的行iA[i] The other class (Mat2) stores A and T where T is the transpose of A . 另一个类(Mat2)存储AT ,其中TA的转置。

Let's say we have a square matrix M and we want the product of M.mult(M) . 假设我们有一个方阵M,我们想要M.mult(M)的乘积。 Call the product P . 拨打产品P

When M is a Mat1 instance the algorithm used was the straightforward one: 当M是Mat1实例时,使用的算法是直截了当的:

P[i][j] += M.A[i][k] * M.A[k][j]
    for k in range(0, M.A.length)

In the case where M is a Mat2 I used: 在M是我使用的Mat2的情况下:

P[i][j] += M.A[i][k] * M.T[j][k]

which is the same algorithm because T[j][k]==A[k][j] . 这是相同的算法,因为T[j][k]==A[k][j] On 1000x1000 matrices the second algorithm takes about 1.2 seconds on my machine, while the first one takes at least 25 seconds. 在1000x1000矩阵上,第二个算法在我的机器上花费大约1.2秒,而第一个算法花费至少25秒。 I was expecting the second one to be faster, but not by this much. 我期待第二个更快,但不是这么多。 The question is, why is it this much faster? 问题是,为什么这么快?

My only guess is that the second one makes better use of the CPU caches, since data is pulled into the caches in chunks larger than 1 word, and the second algorithm benefits from this by traversing only rows, while the first ignores the data pulled into the caches by going immediately to the row below (which is ~1000 words in memory, because arrays are stored in row major order), none of the data for which is cached. 我唯一的猜测是第二个更好地利用了CPU缓存,因为数据以大于1个字的块的形式被拉入缓存,第二个算法通过仅遍历行来获益,而第一个算法忽略了拉入的数据缓存通过立即到达下面的行(在内存中大约1000个字,因为数组以行主要顺序存储),没有缓存的数据。

I asked someone and he thought it was because of friendlier memory access patterns (ie that the second version would result in fewer TLB soft faults). 我问了一个人,他认为这是因为更友好的内存访问模式(即第二个版本会导致更少的TLB软故障)。 I didn't think of this at all but I can sort of see how it results in fewer TLB faults. 我根本没有想到这一点,但我可以看到它如何导致更少的TLB故障。

So, which is it? 那么,这是什么? Or is there some other reason for the performance difference? 还是有其他原因导致性能差异?

This because of locality of your data. 这是因为您的数据的位置。

In RAM a matrix, although bidimensional from your point of view, it's of course stored as a contiguous array of bytes. 在RAM中,矩阵虽然从您的角度来看是二维的,但它当然存储为一个连续的字节数组。 The only difference from a 1D array is that the offset is calculated by interpolating both indices that you use. 与1D数组的唯一区别在于,通过插入您使用的两个索引来计算偏移量。

This means that if you access element at position x,y it will calculate x*row_length + y and this will be the offset used to reference to the element at position specified. 这意味着如果访问位置x,y处的元素,它将计算x*row_length + y ,这将是用于引用指定位置的元素的偏移量。

What happens is that a big matrix isn't stored in just a page of memory (this is how you OS manages the RAM, by splitting it into chunks) so it has to load inside CPU cache the correct page if you try to access an element that is not already present. 会发生的事情是,一个大矩阵不会存储在一个内存页面中(这是操作系统管理RAM的方式,通过将其拆分为块),因此如果您尝试访问一个内存,它必须在CPU缓存中加载正确的页面尚未出现的元素。

As long as you go contiguously doing your multiplication you don't create any problems, since you mainly use all coefficients of a page and then switch to the next one but if you invert indices what happens is that every single element may be contained in a different memory page so everytime it needs to ask to RAM a different page, this almost for every single multiplication you do, this is why the difference is so neat. 只要你连续地进行乘法运算就不会产生任何问题,因为你主要使用页面的所有系数然后切换到下一个系数,但是如果你反转索引,那么每个元素都可以包含在不同的内存页面,所以每次它需​​要向RAM请求一个不同的页面,这几乎是你做的每一次乘法,这就是为什么差异如此整洁。

(I rather simplified the whole explaination, it's just to give you the basic idea around this problem) (我宁愿简化整个解释,只是为了给你解决这个问题的基本思路)

In any case I don't think this is caused by JVM by itself. 无论如何,我不认为这是由JVM本身引起的。 It maybe related in how your OS manages the memory of the Java process.. 它可能与您的操作系统如何管理Java进程的内存有关。

The cache and TLB hypotheses are both reasonable, but I'd like to see the complete code of your benchmark ... not just pseudo-code snippets. 缓存和TLB假设都是合理的,但我希望看到基准测试的完整代码......而不仅仅是伪代码片段。

Another possibility is that performance difference is a result of your application using 50% more memory for the data arrays in the version with the transpose. 另一种可能性是性能差异是由于您的应用程序使用转置版本的数据阵列使用了50%以上的内存。 If your JVM's heap size is small, it is possible that this is causing the GC to run too often. 如果JVM的堆大小很小,则可能导致GC过于频繁地运行。 This could well be a result of using the default heap size. 这很可能是使用默认堆大小的结果。 (Three lots of 1000 x 1000 x 8 bytes is ~24Mb) (三批1000 x 1000 x 8字节~24Mb)

Try setting the initial and max heap sizes to (say) double the current max size. 尝试将初始和最大堆大小设置为(比方说)当前最大大小的两倍。 If that makes no difference, then this is not a simple heap size issue. 如果这没有区别,那么这不是一个简单的堆大小问题。

It's easy to guess that the problem might be locality, and maybe it is, but that's still a guess. 很容易猜到问题可能是地方,也许是,但这仍然是猜测。

It's not necessary to guess. 没有必要猜测。 Two techniques might give you the answer - single stepping and random pausing. 两种技术可能会给你答案 - 单步和随机暂停。

If you single-step the slow code you might find out that it's doing a lot of stuff you never dreamed of. 如果你单步执行慢速代码,你可能会发现它正在做很多你梦寐以求的事情。 Such as, you ask? 比如,你问? Try it and find out. 试一试,找出答案。 What you should see it doing, at the machine-language level, is efficiently stepping through the inner loop with no waste motion. 应该看到它在机器语言层面上有效地通过内循环而没有浪费运动。

If it actually is stepping through the inner loop with no waste motion, then random pausing will give you information. 如果它实际上是在没有浪费运动的情况下踩过内环,那么随机暂停将为您提供信息。 Since the slow one is taking 20 times longer than the fast one, that implies 95% of the time it is doing something it doesn't have to. 由于速度较慢的速度比快速速度长20倍,这意味着95%的时间它正在做一些它不必要的事情。 So see what it is. 所以看看它是什么。 Each time you pause it, the chance is 95% that you will see what that is, and why. 每次你暂停它,你有95%的机会看到它是什么,为什么。

If in the slow case, the instructions it is executing appear just as efficient as the fast case, then cache locality is a reasonable guess of why it is slow. 如果在慢速情况下,它正在执行的指令看起来和快速情况一样有效,那么缓存局部性是对它缓慢的合理猜测。 I'm sure, once you've eliminated any other silliness that may be going on, that cache locality will dominate. 我敢肯定,一旦你消除了可能进行的任何其他愚蠢,那缓存局部性将占主导地位。

您可以尝试比较JDK6和OpenJDK7之间的性能,给出这组结果 ......

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM