简体繁体 English

用于矢量化计算的Java最佳实践

[英]Java best practices for vectorized computations

原文 2016-12-27 17:05:46 3 1 java/ blas/ nd4j

I'm researching methods for computing expensive vector operations in Java, eg dot-products or multiplications between large matrices. 我正在研究用Java计算昂贵的向量运算的方法，例如点积或大矩阵之间的乘法。 There are a few good threads on here on this topic, like this and this . 关于这个主题，这里有一些好的主题，比如这个和这个。

It appears that there is no reliable way of having the JIT compile code to use CPU vector instructions (SSE2, AVX, MMX...). 似乎没有可靠的方法让JIT编译代码使用CPU向量指令（SSE2，AVX，MMX ......）。 Moreover, high-performance linear algebra libraries (ND4J, jblas, ...) do in fact make JNI calls to BLAS/LAPACK libraries for the core routines. 此外，高性能线性代数库（ND4J，jblas，...）实际上对核心例程进行了对BLAS / LAPACK库的JNI调用。 And I understand BLAS/LAPACK packages to be the de facto standard choices for native linear algebra computations. 我理解BLAS / LAPACK包是本机线性代数计算的事实上的标准选择。
On the other hand others (JAMA, ...) implement algorithms in pure Java without native calls. 另一方面，其他人（JAMA，...）在没有native调用的情况下在纯Java中实现算法。

My questions are: 我的问题是：

What are the best practices here? 这里的最佳做法是什么？
Is making native calls to BLAS/LAPACK actually a recommended choice? 是否真的建议选择BLAS / LAPACK native调用？ Are there other libraries worth considering? 还有其他值得考虑的图书馆吗？
Is the overhead of JNI calls negligible compared to the performance gain? 与性能增益相比，JNI调用的开销是否可忽略不计？ Does anyone have experience as to where the threshold lies (eg how small an input should be to make JNI calls more expensive than a pure Java routine?) 有没有人有关于阈值所在的经验（例如，输入应该使JNI调用比纯Java例程更昂贵？）
How big is the portability tradeoff? 可移植性权衡有多大？

I hope this question could be of help both for those who develop their own computation routines, and for those who just want to make an educated choice between different implementations. 我希望这个问题既可以帮助那些开发自己的计算程序的人，也可以帮助那些只想在不同实现之间做出明智选择的人。

Insights are appreciated! 深刻见解！

1 个解决方案

There are no clear best practices for every case. 每个案例都没有明确的最佳做法。 Whether you could/should use a pure Java solution (not using SIMD instructions) or (optimized with SIMD) native code through JNI depends on your particular application and specifically the size of your arrays and possible restrictions on the target system. 是否可以/应该使用纯Java解决方案（不使用SIMD指令）或（使用SIMD优化）本机代码通过JNI取决于您的特定应用程序，特别是阵列的大小和对目标系统的可能限制。

There could be a requirement that you are not allowed to install specific native libraries in the target system and BLAS is not already installed. 可能要求您不允许在目标系统中安装特定的本机库，并且尚未安装BLAS。 In that case you simply have to use a Java library. 在这种情况下，您只需使用Java库。
Pure Java libraries tend to perform better for arrays with length much smaller than 100 and at some point after that you get better performance using native libraries through JNI. 对于长度远小于100的数组，纯Java库往往表现更好，之后在某些时候，通过JNI使用本机库可以获得更好的性能。 As always, your mileage may vary. 一如既往，您的里程可能会有所不同

Pertinent benchmarks have been performed (in random order): 已执行相关基准测试（按随机顺序）：

These benchmarks can be confusing as they are informative. 这些基准可能令人困惑，因为它们提供了丰富的信息。 One library may be faster for some operation and slower for some other. 对于某些操作，一个库可能更快，而对于其他操作则更慢。 Also keep in mind that there may be more than one implementation of BLAS available for your system. 另请注意，您的系统可能有多个BLAS实现可用。 I currently have 3 installed on my system blas, atlas and openblas. 我目前在我的系统blas，atlas和openblas上安装了3个。 Apart from choosing a Java library wrapping a BLAS implementation you also have to choose the underlying BLAS implementation. 除了选择包装BLAS实现的Java库之外，还必须选择基础BLAS实现。

This answer has a fairly up to date list except it doesn't mention nd4j that is rather new. 这个答案有一个相当新的列表，除了它没有提到相当新的nd4j。 Keep in mind that jeigen depends on eigen so not on BLAS. 请记住，jeigen取决于本征，因此不取决于BLAS。