简体   繁体   English

分配约10GB的向量-如何加快速度?

[英]Allocating ~10GB of vectors - how can I speed it up?

I'm loading about ~1000 files, each representing an array of ~3 million floats. 我正在加载约1000个文件,每个文件代表约300万个浮动数组。 I need to have them all in memory together as I need to do some calculations that involve all of them. 我需要将它们全部存储在内存中,因为我需要进行一些涉及它们全部的计算。

In the code below, I've broken out the memory allocation and file reading, so I can observe the speed of each separately. 在下面的代码中,我详细介绍了内存分配和文件读取,因此可以分别观察它们的速度。 I was a bit surprised to find the memory allocation taking much longer than the file reading. 我感到惊讶的是,发现内存分配花费的时间比读取文件的时间长得多。

  std::vector<std::vector<float> * > v(matrix_count);   
  for(int i=0; i < matrix_count; i++) {
    v[i] = new std::vector<float>(array_size);
  }

  for(int i=0; i < matrix_count; i++) {
    std::ifstream is(files[i]);
    is.read((char*) &((*v[i])[0]), size);
    is.close();
  }

Measuring the time, the allocating loop took 6.8s while file loading took 2.5s. 通过计算时间,分配循环花费了6.8s,而文件加载花费了2.5s。 It seems counter-intuitive that reading from the disk is almost 3x faster than just allocating space for it. 看起来与从磁盘读取空间比为磁盘分配空间快三倍,这是违反直觉的。

Is there something I could do to speed up the memory allocation? 我可以做些什么来加快内存分配吗? I tried allocating one large vector instead, but that failed with bad_malloc -- I guess a 10GB vector isn't ok. 我尝试分配一个较大的向量,但是由于bad_malloc失败,我猜不能使用10GB的向量。

Is there something I could do to speed up the memory allocation? I tried allocating one large vector instead, but that failed with bad_malloc -- I guess a 10GB vector isn't ok.

I mainly wanted to respond by addressing this one part: bad_alloc exceptions tend to be misunderstood. 我主要想通过解决这一部分来做出回应: bad_alloc异常易于被误解。 They're not the result of "running out of memory" -- they're the result of the system failing to find a contiguous block of unused pages. 它们不是“内存不足”的结果,而是系统找不到连续的未使用页面块的结果。 You could have plenty more than enough memory available and still get a bad_alloc if you get in the habit of trying to allocate massive blocks of contiguous memory, simply because the system can't find a contiguous set of pages that are free. 如果您习惯于尝试分配大量连续内存块的习惯,那么您可能拥有足够多的可用内存,并且仍然会得到bad_alloc ,这仅仅是因为系统无法找到一组空闲的连续页面。 You can't necessarily avoid bad_alloc by "making sure plenty of memory is free" as you might have already seen where having over 100 gigabytes of RAM can still make you vulnerable to them when trying to allocate a mere 10 GB block. 您不一定通过“确保有足够的可用内存”来避免bad_alloc ,因为您可能已经看到,尝试分配仅10 GB的块时,拥有超过100 GB的RAM仍会使您容易受到损坏。 The way to avoid them is to allocate memory in smaller chunks instead of one epic array. 避免它们的方法是在较小的块中分配内存,而不是分配一个史诗数组。 At a large enough scale, structures like unrolled lists can start to offer favorable performance over a gigantic array and a much lower (exponentially) probability of ever getting a bad_alloc exception unless you actually do exhaust all the memory available. 在足够大的规模上,展开列表之类的结构可以开始在巨大的数组上提供良好的性能,并且有一个更低的(呈指数形式)出现bad_alloc异常的可能性,除非您实际上确实耗尽了所有可用的内存。 There is actually a peak where contiguity and the locality of reference it provides ceases to become beneficial and may actually hinder memory performance at a large enough size (mainly due to paging, not caching). 实际上存在一个峰值,在该峰值处连续性和它提供的引用局部不再变得有用,并且实际上可能以足够大的尺寸(主要是由于分页而不是缓存)阻碍了内存性能。

For the kind of epic scale input you're handling, you might actually get better performance out of std::deque given the page-friendly nature of it (it's one of the few times where deque can really shine without need for push_front vs. vector ). 对于您要处理的史诗规模输入,鉴于其具有页面友好性,实际上您可能会从std::deque获得更好的性能(这是少数情况下deque可以真正发亮而无需push_front vs. vector )。 It's something to potentially try if you don't need perfect contiguity. 如果您不需要完美的连续性,可以尝试一下。

Naturally it's best if you measure this with an actual profiler. 自然,最好是使用实际的探查器进行测量。 It'll help us hone in on the actual problem, though it might not be completely shocking (surprising but maybe not shocking) that you might be bottlenecked by memory here instead of disk IO given the kind of "massive number of massive blocks" you're allocating (disk IO is slow but memory heap allocation can sometimes be expensive if you are really stressing the system). 它可以帮助我们深入研究实际的问题,尽管鉴于您“大量块”的种类,您可能在这里受到内存而不是磁盘IO的瓶颈可能并不完全令人震惊(令人惊讶,但也许并不令人震惊)。正在分配(磁盘IO速度很慢,但是如果您真正给系统施加压力,则内存堆分配有时可能会很昂贵)。 It depends a lot on the system's allocation strategy but even slab or buddy allocators can fall back to a much slower code branch if you allocate such epic blocks of memory and en masse, and allocations may even start to require something resembling a search or more access to secondary storage in those extreme cases (here I'm afraid I'm not sure exactly what goes on behind the hood when allocating so many massive blocks, but I have "felt" and measured these kinds of bottlenecks before but in a way where I never quite figured out what the OS was doing exactly -- this above paragraph is purely conjecture). 这在很大程度上取决于系统的分配策略,但是如果您分配这种史诗般的内存块并进行批量分配,那么即使是平板分配器或伙伴分配器也可能会退回到一个慢得多的代码分支,并且分配甚至可能开始需要类似于搜索或更多访问的东西在那些极端情况下迁移到二级存储(在这里,我不确定分配这么多大块时到底是怎么回事,但是我已经“感觉到”并测量了这些瓶颈,但是在某种程度上我从来没有弄清楚过OS到底在做什么-上面的段落纯粹是推测。

Here it's kind of counter-intuitive but you can often get better performance allocating a larger number of smaller blocks. 在这里这有点违反直觉,但是分配大量较小的块通常可以提高性能。 Typically that makes things worse, but if we're talking about 3 million floats per memory block and a thousand memory blocks like it, it might help to start allocating in, say, page-friendly 4k chunks. 通常情况会使情况变得更糟,但是如果我们要谈论的是每个内存块300万个浮点数和类似的1000个内存块,则可能有助于开始分配页面友好的4k块。 Typically it's cheaper to pre-allocate memory in large blocks in advance and pool it, but "large" in this case is more like 4 kilobyte blocks, not 10 gigabyte blocks. 通常,预先在大块中预先分配内存并进行池化比较便宜,但是这种情况下的“大”更像是4 KB块,而不是10 GB块。

std::deque will typically do this kind of thing for you so it might be the quickest thing to try out to see if it helps. std::deque通常会为您执行这种操作,因此尝试查看是否有帮助可能是最快的操作。 With std::deque , you should be able to make a single one for all 10 GB worth of contents without splitting it into smaller ones to avoid bad_alloc . 使用std::deque ,您应该能够为所有10 GB的内容制作一个文件,而不必将其拆分成较小的内容,以避免bad_alloc It also doesn't have the zero-initialization overhead of the entire contents that some cited, and push_backs to it are constant-time even in the worst-case scenario (not amortized constant time as with std::vector ), so I would try std::deque with actually push_back instead of pre-sizing it and using operator[] . 它也没有某些引用的全部内容的零初始化开销,即使在最坏的情况下,它的push_backs也是固定时间的(不像std::vector那样摊销固定时间),所以我会请使用实际push_back尝试std::deque ,而不是对其进行预先大小调整并使用operator[] You could read the file contents in small chunks at a time (ex: using 4k byte buffers) and just push back the floats. 您可以一次读取一小块文件的内容(例如:使用4k字节缓冲区),而只需推回浮点数即可。 It's something to try anyway. 无论如何都可以尝试。

Anyway, these are all just educated guesses without code and profiling measurements, but these are some things to try out after your measurements. 无论如何,这些都是未经培训的猜测,没有代码和配置文件度量,但是这些都是在度量之后可以尝试的一些方法。

MMFs may also be the ideal solution for this scenario. MMF也可能是此方案的理想解决方案。 Let the OS handle all the tricky details of what it takes to access the file's contents. 让操作系统处理访问文件内容所需的所有棘手细节。

Use multiple threads for both memory allocation and reading files. 使用多个线程进行内存分配和读取文件。 You can create a set of say 15 threads and let each thread pick up the next available job. 您可以创建一组说15条线程,然后让每个线程接下一个可用的作业。

When you dig deeper, you will see that opening the file also has a considerable overhead which gets reduced substantially by using multiple threads. 当您深入研究时,您会发现打开文件也有相当大的开销,使用多个线程可以大大减少开销。

You don't need to handle all the data in memory. 您不需要处理内存中的所有数据。 Instead of that, you should use something like virtual vector which loads required data when needed. 取而代之的是,您应该使用虚拟矢量之类的东西,在需要时加载所需的数据。 Using that approach saves the memory and don't brings your to side effects of huge memory allocation. 使用这种方法可以节省内存,并且不会给您带来大量内存分配的副作用。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM