简体   繁体   English

HPC的C ++ std :: vector?

[英]C++ std::vector for HPC?

I am translating a program that perform numeric simulations from FORTRAN to C++. 我正在翻译一个程序,执行从FORTRAN到C ++的数值模拟。

I have to deal with big matrices of double of the size of 800MB. 我必须处理大小为800MB的大矩阵。 This 这个

double M[100][100][100][100];

gives a segmentation error because the stack is not so big. 给出了分段错误,因为堆栈不是那么大。 Using new, delete is awkward because I need four for loops to allocate my array and even to deallocate it. 使用new,删除是很尴尬的,因为我需要四个for循环来分配我的数组甚至取消分配它。

std::array is in stack so it isn't good. std :: array在堆栈中,所以它不好。 std::vector would be a nice choice, so std :: vector将是一个不错的选择,所以

First question Is std::vector good for fast simulations or a 第一个问题 std :: vector适用于快速模拟或a

vector<vector<vector<vector<int,100>,100>,100>,100> 

would carry a lot of useless and heavy data? 会带来很多无用而繁重的数据吗?

Second question Do you know any data other structures that can I use? 第二个问题你知道我可以使用的其他结构的任何数据吗? Maybe there is something from boost. 也许有一些来自提升的东西。

For the moment I am simply using this solution: 目前我只是使用这个解决方案:

double * M = new double [100000000];

and I am accessing manually the entries that I need. 我正在手动访问我需要的条目。 If I don't find any other performant solution I will write a class that automatically manages this last method. 如果我找不到任何其他高性能解决方案,我会编写一个自动管理最后一个方法的类。

Third question Do you think that would decrease significatively the performance? 第三个问题你认为这会显着降低性能吗?

You may want to consider std::valarray which was designed to be competitive with FORTRAN. 您可能需要考虑std::valarray ,它旨在与FORTRAN竞争。 It stores elements as a flat array and supports math operations, as well as operations for slicing and indirect access. 它将元素存储为平面数组,并支持数学运算,以及切片和间接访问的操作。

Sounds like what you're planning on anyway. 无论如何,这听起来像你正在计划的。 Although even the manpage suggests there may be more flexible alternatives. 虽然连手册页表明可能会有更多的灵活的替代方案。

Using something on the stack is definetly much more efficient. 使用堆栈上的东西绝对更有效率。 the process memory is limited by the OS (stack+heap) and your issue is that you might exceed the memory allocated to the process in most of the cases. 进程内存受OS(堆栈+堆)限制,您的问题是在大多数情况下可能会超出分配给进程的内存。

To resolve the memory limitation, I would suggest you have a look at stxxl . 要解决内存限制,我建议你看一下stxxl It is a library wich implements most of STL containers and algorithms but using external memory when needed. 它是一个库,它实现了大多数STL容器和算法,但在需要时使用外部存储器。 Of course this will compromise performance... 当然这会影响性能......

Programmers tend to approach every problem by first writing more code. 程序员倾向于通过首先编写更多代码来解决每个问题。 Which then has to be maintained. 然后必须保持。 Every problem is not a nail... 每个问题都不是钉子......

More code is not the simplest, most effective solution here. 更多代码不是最简单,最有效的解决方案。 More code is also likely to produce an executable that's slower . 更多代码也可能产生更慢的可执行文件。

Stack memory is just memory - it's no different from heap memory. 堆栈内存只是内存 - 它与堆内存没有什么不同。 It's just managed by the process differently, and is subject to different resource limits. 它只是由流程管理的方式不同,并受到不同的资源限制。 There's no real difference to the OS whether a process uses 1 GB of memory on its stack, or 1 GB from its heap. 操作系统没有真正的区别,一个进程是否在其堆栈上使用1 GB内存,或者从堆中使用1 GB内存。

In this case, the stack size limit is likely an artificial configuration setting. 在这种情况下,堆栈大小限制可能是人工配置设置。 On a Linux system, the stack size limit can be reset for a shell and its child processes: 在Linux系统上,可以为shell及其子进程重置堆栈大小限制:

bash-4.1$ ulimit -s unlimited
bash-4.1$ ulimit -s
unlimited
bash-4.1$

See this question and its answers for more details. 有关详细信息,请参阅此问题及其答案

All POSIX-compliant systems should have similar features, as the stack-size limit is a POSIX-standard resource limit . 所有符合POSIX标准的系统都应具有类似的功能,因为堆栈大小限制是POSIX标准资源限制

Also, you can run a thread with an arbitrarily large stack quite easily: 此外,您可以非常轻松地运行具有任意大堆栈的线程:

#include <pthread.h>
#include <stdlib.h>
#include <string.h>
#include <sys/mman.h>
#include <stdio.h>

void *threadFunc( void *arg )
{
    double array[1024][1024][64];
    memset( array, 0, sizeof( array ) );
    return( NULL );
}
int main( int argc, char **argv )
{
    // create and memset the stack lest the child thread start thrashing
    // the machine with "try to run/page fault/map page" cycles
    // the memset will create the physical page mappings
    size_t stackSize = strtoul( argv[ 1 ] ? argv[ 1 ] : "1073741824",
        NULL, 0 );    
    void *stack = mmap( 0, stackSize, PROT_READ | PROT_WRITE,
        MAP_PRIVATE | MAP_ANON, -1, 0 );
    memset( stack, 0, stackSize );

    // create a large stack for the child thread
    pthread_attr_t attr;    
    pthread_attr_init( &attr );
    pthread_attr_setstacksize( &attr, stackSize );
    pthread_attr_setstackaddr( &attr, stack );

    pthread_t tid;
    pthread_create( &tid, &attr, threadFunc, NULL );
    void *result;
    pthread_join( tid, &result );
    return( 0 );
}

Error checking has been omitted. 错误检查已被省略。

This also works if you run ulimit -s unlimited before running the compiled program (and of course if the machine has enough virtual memory...): 如果在运行已编译的程序之前运行ulimit -s unlimited (当然如果机器有足够的虚拟内存......),这也有效:

#include <string.h>

int main( int argc, char **argv )
{
    double array[1024][1024][64];
    memset( array, 0, sizeof( array ) );

    return( 0 );
}

In some cases, it might be useful to cast the 1-D pointer to a 4-D rectangular array to enable Cartesian indexing (rather than linear indexing): 在某些情况下,将1-D指针转换为4-D矩形数组以启用笛卡尔索引(而不是线性索引)可能很有用:

#include <cstdio>

#define For( i, n ) for( int i = 0; i < n; i++ )

double getsum( double *A, int *n, int loop )
{
    // Cast to 4-D array.
    typedef double (* A4d_t)[ n[2] ][ n[1] ][ n[0] ];
    A4d_t A4d = (A4d_t) A;

    // Fill the array with linear indexing.
    int ntot = n[0] * n[1] * n[2] * n[3];
    For( k, ntot ) A[ k ] = 1.0 / (loop + k + 2);

    // Calc weighted sum with Cartesian indexing.
    double s = 0.0;
    For( i3, n[3] )
    For( i2, n[2] )
    For( i1, n[1] )
    For( i0, n[0] )
        s += A4d[ i3 ][ i2 ][ i1 ][ i0 ] * (i0 + i1 + i2 + i3 + 4);

    return s;
}

int main()
{
    int n[ 4 ] = { 100, 100, 100, 100 };

    double *A = new double [ n[0] * n[1] * n[2] * n[3] ];

    double ans = 0.0;
    For( loop, 10 )
    {
        printf( "loop = %d\n", loop );
        ans += getsum( A, n, loop );
    }
    printf( "ans = %30.20f\n", ans );
    return 0;
}

which takes 5.7 sec with g++-6.0 -O3 on Mac OSX 10.9. 在Mac OSX 10.9上使用g ++ - 6.0 -O3需要5.7秒。 It might be interesting to compare the performance with that based on vector<vector<...>> or a custom array view class. 将性能与基于vector<vector<...>>或自定义数组视图类的性能进行比较可能会很有趣。 (I tried the latter before, and at that time the above array casting was somewhat faster than my (naive) array class.) (之前我尝试过后者,当时上面的数组转换比我的(天真)数组类快一些。)

I think the answer of that question is quite opinion based. 我认为这个问题的答案是基于意见的。 The concept of " good " strictly depend on the usage of the data structure. ”的概念严格依赖于数据结构的用法。

Anyway if the number of elements does not change during the execution time, and your problem is practically a memory access, then the best solution, in my opinion, is a contiguous array of blocks. 无论如何,如果元素的数量在执行期间没有改变,并且你的问题实际上是一个内存访问,那么在我看来,最好的解决方案是一个连续的块数组。

Generally in those cases, my choice is a simple T* data = new T[SIZE]; 通常在这些情况下,我选择的是一个简单的T* data = new T[SIZE]; encapsulated into a class which handles the access correctly. 封装到一个正确处理访问的类中。

The usage of a pointer makes me feel a more little bit comfortable about the memory control, but actually a std::vector<T> is practically the same thing. 指针的使用让我对内存控制感觉更加舒服,但实际上std::vector<T>实际上是相同的。


That's all I can say from the knowledge you've provided in your question. 根据您在问题中提供的知识,我可以说这一切。 Anyway what I can additional suggest you is to take care about the usage of the data as well. 无论如何,我还可以建议你也要注意数据的使用。

For example: in order to maximize your performances application, try to exploit caches and to avoid miss. 例如:为了最大化您的性能应用程序,尝试利用缓存并避免错过。 So you could try to understand if there are some "access-patterns" to your data, or, even, if you can scale your problem thinking a multi-thread context. 因此,您可以尝试了解数据是否存在某些“访问模式” ,或者,即使您可以根据多线程上下文扩展您的问题。


To conclude, in my opinion generally a contiguous vector of double is the best choice. 总而言之,在我看来,通常一个连续的double是最好的选择。 That answers your question. 这回答了你的问题。 But if you care about performance, you should think about how to exploit as best as you can caches and processor mechanisms (like multi-threading). 但是如果你关心性能,你应该考虑如何尽可能地利用缓存处理器机制 (如多线程)。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM