Allocating ~10GB of vectors - how can I speed it up?

Question

I'm loading about ~1000 files, each representing an array of ~3 million floats. I need to have them all in memory together as I need to do some calculations that involve all of them.

In the code below, I've broken out the memory allocation and file reading, so I can observe the speed of each separately. I was a bit surprised to find the memory allocation taking much longer than the file reading.

  std::vector<std::vector<float> * > v(matrix_count);   
  for(int i=0; i < matrix_count; i++) {
    v[i] = new std::vector<float>(array_size);
  }

  for(int i=0; i < matrix_count; i++) {
    std::ifstream is(files[i]);
    is.read((char*) &((*v[i])[0]), size);
    is.close();
  }

Measuring the time, the allocating loop took 6.8s while file loading took 2.5s. It seems counter-intuitive that reading from the disk is almost 3x faster than just allocating space for it.

Is there something I could do to speed up the memory allocation? I tried allocating one large vector instead, but that failed with bad_malloc -- I guess a 10GB vector isn't ok.

Answer 1

Is there something I could do to speed up the memory allocation? I tried allocating one large vector instead, but that failed with bad_malloc -- I guess a 10GB vector isn't ok.

I mainly wanted to respond by addressing this one part: bad_alloc exceptions tend to be misunderstood. They're not the result of "running out of memory" -- they're the result of the system failing to find a contiguous block of unused pages. You could have plenty more than enough memory available and still get a bad_alloc if you get in the habit of trying to allocate massive blocks of contiguous memory, simply because the system can't find a contiguous set of pages that are free. You can't necessarily avoid bad_alloc by "making sure plenty of memory is free" as you might have already seen where having over 100 gigabytes of RAM can still make you vulnerable to them when trying to allocate a mere 10 GB block. The way to avoid them is to allocate memory in smaller chunks instead of one epic array. At a large enough scale, structures like unrolled lists can start to offer favorable performance over a gigantic array and a much lower (exponentially) probability of ever getting a bad_alloc exception unless you actually do exhaust all the memory available. There is actually a peak where contiguity and the locality of reference it provides ceases to become beneficial and may actually hinder memory performance at a large enough size (mainly due to paging, not caching).

For the kind of epic scale input you're handling, you might actually get better performance out of std::deque given the page-friendly nature of it (it's one of the few times where deque can really shine without need for push_front vs. vector ). It's something to potentially try if you don't need perfect contiguity.

Naturally it's best if you measure this with an actual profiler. It'll help us hone in on the actual problem, though it might not be completely shocking (surprising but maybe not shocking) that you might be bottlenecked by memory here instead of disk IO given the kind of "massive number of massive blocks" you're allocating (disk IO is slow but memory heap allocation can sometimes be expensive if you are really stressing the system). It depends a lot on the system's allocation strategy but even slab or buddy allocators can fall back to a much slower code branch if you allocate such epic blocks of memory and en masse, and allocations may even start to require something resembling a search or more access to secondary storage in those extreme cases (here I'm afraid I'm not sure exactly what goes on behind the hood when allocating so many massive blocks, but I have "felt" and measured these kinds of bottlenecks before but in a way where I never quite figured out what the OS was doing exactly -- this above paragraph is purely conjecture).

Here it's kind of counter-intuitive but you can often get better performance allocating a larger number of smaller blocks. Typically that makes things worse, but if we're talking about 3 million floats per memory block and a thousand memory blocks like it, it might help to start allocating in, say, page-friendly 4k chunks. Typically it's cheaper to pre-allocate memory in large blocks in advance and pool it, but "large" in this case is more like 4 kilobyte blocks, not 10 gigabyte blocks.

std::deque will typically do this kind of thing for you so it might be the quickest thing to try out to see if it helps. With std::deque , you should be able to make a single one for all 10 GB worth of contents without splitting it into smaller ones to avoid bad_alloc . It also doesn't have the zero-initialization overhead of the entire contents that some cited, and push_backs to it are constant-time even in the worst-case scenario (not amortized constant time as with std::vector ), so I would try std::deque with actually push_back instead of pre-sizing it and using operator[] . You could read the file contents in small chunks at a time (ex: using 4k byte buffers) and just push back the floats. It's something to try anyway.

Anyway, these are all just educated guesses without code and profiling measurements, but these are some things to try out after your measurements.

MMFs may also be the ideal solution for this scenario. Let the OS handle all the tricky details of what it takes to access the file's contents.

Answer 2

Use multiple threads for both memory allocation and reading files. You can create a set of say 15 threads and let each thread pick up the next available job.

When you dig deeper, you will see that opening the file also has a considerable overhead which gets reduced substantially by using multiple threads.

Answer 3

You don't need to handle all the data in memory. Instead of that, you should use something like virtual vector which loads required data when needed. Using that approach saves the memory and don't brings your to side effects of huge memory allocation.

Allocating ~10GB of vectors - how can I speed it up?

Question

3 answers

solution1
2 ACCPTED 2015-11-12 06:33:38

solution2
1 2015-11-11 20:14:56

solution3
0 2015-11-12 06:59:32

Allocating ~10GB of vectors - how can I speed it up?

Question

3 answers

solution1 2 ACCPTED 2015-11-12 06:33:38

solution2 1 2015-11-11 20:14:56

solution3 0 2015-11-12 06:59:32

solution1
2 ACCPTED 2015-11-12 06:33:38

solution2
1 2015-11-11 20:14:56

solution3
0 2015-11-12 06:59:32