简体   繁体   中英

Most memory efficient way to transpose a large file in C++

I have an input file, which is 40,000 columns by 2 million rows. This file is roughly 70GB in memory and thus to large to fit in memory at one go.

I need to effectively transpose this file, however there are some lines which are junk and should not be added to the output.

How I have currently implemented this is using ifstream and a nested get line, which effectively reads the whole file into memory (and thus lets the OS handle memory management), and then outputs the transpose like this. This works in an acceptable timescale however obviously has a large memory footprint for the application.

I now have to run this program on a cluster which makes me specify memory requirements ahead of time, and thus a large memory footprint increases job queuing time in the cluster.

I feel there has to be a more memory efficient approach to doing this. One thought I had was using mmap, which would allow me to do the transposition without reading the file into memory at all. Are there any other alternatives?

To be clear, I am happy to use any language and any method that can do this in a reasonable amount of time (my current program takes around 4 minutes on this large file on a local workstation).

Thanks

I would probably do this with a pre-processing pass over the file, that only needs to have a line at a time in its working set.

Filter the junk and make every line the same (binary) size.

Now, you can memory map the temp file, and stride the columns as rows for the output.

I think that the best way for you to do this would be to instead parse each line and find out whether it is junk or not. After this, you could put the remaining lines into output. This may take more time, but it would save a lot of memory and save you from using so much for lines which are completely useless to any text you are trying to print. However, using an mmap would also be a great way to achieve your goal

Hope this helps!!

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM