简体   繁体   中英

How to compare differences in very large csv files

I have to compare two csv files with a size of 2-3 GB each, contained in Windows platform.

I've tried to put the first one in a HashMap to compare it with the second one, but the result (as expected) is a very high memory cosumption.

The target is to get the differences in another file.

The lines may appear in diffent order, and maybe missed also.

Any suggetions?

Assuming you wish to do this in Java, via programming, the answers are different.

Are both of the files ordered? If so, then you don't need to read in whole files, you simply start at the beginning of both files, and

  1. If the entries match, advance the "current" line in both files.
  2. If the entries don't match, determine which file's line would come first, display that line, and advance the current line in that file .

If you don't have ordered files, then perhaps you could order the files prior to the diff. Again, since you need a low memory solution, don't read the entire file in to sort it. Chop the file up into manageable chunks, and then sort each chunk. Then use insertion sort to combine the chunks.

The unix command diff can work for exact matches.

You can also run it with the -b flag to ignore whitespace only differences.

Use uniVocity-parsers as it comes with the fastest CSV parser for Java. You can process files as big as 100 GB without any issue and very quickly.

For comparison of large CSV files, I suggest you to use your own implementation of RowProcessor and wrap it in a ConcurrentRowProcessor .

Disclosure: I am the author of this library. It's open-source and free (Apache V2.0 license).

I suggest you compare line by line and not to upload the entire file into memory. Or try uploading just a group of lines.

There is a java library OpenCSV for parsing CSV files. Lazy loading of the file can be built. Check this article . Hope it helps.

Here is another similar post on Stack Overflow in which I have given the outline of a solution which requires only the smaller of the two files to be stored in memory:

How to compare two large CSV files and get the difference file

This is the general solution which doesn't require the files to be ordered, as you are stating in the question that order of lines may be different.

Anyway, even that can be avoided. I don't want to repeat the solution here, but the idea is to index one file and then walk through the other file. You can avoid storing entire smaller file in memory by only holding the hash table and location of each row in the index. In that way, you will have to touch the file many times on disk, but you won't have to keep it in memory.

Running time of the algorithm is O(N + M). Memory consumption is O(min(N, M)).

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM