简体繁体 English

如何并行优化大数据操作

[英]How to optimize large data manipulation in parallel

原文 2012-07-28 09:32:09 0 1 c++/ c/ multithreading

I'm developing a C/C++ application to manipulate large quantities of data in a generic way (aggregation/selection/transformation). 我正在开发一个C / C ++应用程序来以通用方式处理大量数据（聚合/选择/转换）。 I'm using a AMD Phenom II X4 965 Black Edition, so with decent amount of different caches. 我正在使用AMD Phenom II X4 965黑色版，所以有不同的缓存。

I've developed both ST and MT version of the functions to perform all the single operations and, not surprisingly, in the best case the MT version are 2x faster than the ST, even when using 4 cores. 我已经开发了ST和MT版本的功能来执行所有单一操作，毫不奇怪，在最好的情况下，MT版本比ST快2倍，即使使用4个内核。

Given I'm a fan of using 100% of available resources, I was pissed about the fact just 2x, I'd want 4x. 鉴于我是使用100％可用资源的粉丝，我很生气只有2倍，我想要4倍。
For this reason I've spent already quite a considerable amount of time with -pg and valgrind , using the cache simulator and callgraph. 出于这个原因，我已经花了相当多的时间使用-pg和valgrind ，使用缓存模拟器和调用图。 The program is working as expected and cores are sharing the input process data (ie operations to apply on data) and the cache misses are reported (as expected sic.) when the different threads load the data to be processed (millions of entities or rows if now you have an idea what I'm trying to do :-) ). 该程序正在按预期工作，并且内核正在共享输入过程数据（即应用于数据的操作），并且当不同的线程加载要处理的数据（数百万个实体或行）时，报告缓存未命中（如预期的那样）。如果现在你知道我想要做什么:-)）。 Eventually I've used different compilers, g++ and clang++, with -O3 both, and performance is identical. 最后我使用了不同的编译器，g ++和clang ++，- O3两者，性能相同。

My conclusion is that due to the large amount of data (GB of data) to process, given the fact the data has got to be loaded eventually in the CPU, this is real wait time. 我的结论是，由于需要处理大量数据（GB数据），因此最终必须在CPU中加载数据，这是真正的等待时间。 Can I further improve my software? 我可以进一步改进我的软件吗？ Have I hit a limit? 我达到了极限吗？

I'm using C/C++ on Linux x86-64, Ubuntu 11.10. 我在Linux x86-64，Ubuntu 11.10上使用C / C ++。 I'm all ears! 我全都耳朵！ :-) :-)

1 个解决方案

What kind of application is it? 它是什么类型的应用程序？ Could you show us some code? 你能告诉我们一些代码吗？

As I commented, you might have reached some hardware limit like RAM bandwidth. 正如我评论的那样，您可能已达到一些硬件限制，如RAM带宽。 If you did, no software trick could improve it. 如果你这样做，没有软件技巧可以改善它。

You might investigate using MPI, OpenMP, or OpenCL (on GPUs) but without an idea of your application we cannot help. 您可以使用MPI，OpenMP或OpenCL（在GPU上）进行调查，但如果不了解您的应用程序，我们将无法提供帮助。

If compiling with GCC and if you want to help the processor cache prefetching, consider using with care and parsimony __builtin_prefetch (but using it too much or badly would decrease performance). 如果使用GCC进行编译并且如果您想帮助处理器缓存预取，请考虑小心使用和简化__builtin_prefetch （但使用它太多或太糟糕会降低性能）。