简体繁体中英

How to compare differences in very large csv files

原文 2012-05-17 19:30:11 3 6 java/ csv/ large-files

I have to compare two csv files with a size of 2-3 GB each, contained in Windows platform.

I've tried to put the first one in a HashMap to compare it with the second one, but the result (as expected) is a very high memory cosumption.

The target is to get the differences in another file.

The lines may appear in diffent order, and maybe missed also.

Any suggetions?

6 answers

Assuming you wish to do this in Java, via programming, the answers are different.

Are both of the files ordered? If so, then you don't need to read in whole files, you simply start at the beginning of both files, and

If the entries match, advance the "current" line in both files.
If the entries don't match, determine which file's line would come first, display that line, and advance the current line in that file .

If you don't have ordered files, then perhaps you could order the files prior to the diff. Again, since you need a low memory solution, don't read the entire file in to sort it. Chop the file up into manageable chunks, and then sort each chunk. Then use insertion sort to combine the chunks.

The unix command diff can work for exact matches.

You can also run it with the -b flag to ignore whitespace only differences.

Use uniVocity-parsers as it comes with the fastest CSV parser for Java. You can process files as big as 100 GB without any issue and very quickly.

For comparison of large CSV files, I suggest you to use your own implementation of RowProcessor and wrap it in a ConcurrentRowProcessor .

Disclosure: I am the author of this library. It's open-source and free (Apache V2.0 license).

I suggest you compare line by line and not to upload the entire file into memory. Or try uploading just a group of lines.

There is a java library OpenCSV for parsing CSV files. Lazy loading of the file can be built. Check this article . Hope it helps.

Here is another similar post on Stack Overflow in which I have given the outline of a solution which requires only the smaller of the two files to be stored in memory:

How to compare two large CSV files and get the difference file

This is the general solution which doesn't require the files to be ordered, as you are stating in the question that order of lines may be different.

Anyway, even that can be avoided. I don't want to repeat the solution here, but the idea is to index one file and then walk through the other file. You can avoid storing entire smaller file in memory by only holding the hash table and location of each row in the index. In that way, you will have to touch the file many times on disk, but you won't have to keep it in memory.

Running time of the algorithm is O(N + M). Memory consumption is O(min(N, M)).

Best way to compare two very large XML files record by record

How to compare large text files?

What is the best approach to processing very large CSV files with apache Camel?

Parsing CSV files to arrays from very large sources in java

How to Compare two large CSV file in java

How do I sort very large files

Compare very large tables in java

How compare .csv files or Strings and find similarity?

How to compare 2 csv files, perform operations and output into a new csv file

The best way to compare two very large lists

暂无

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Question Best way to compare two very large XML files record by record How to compare large text files? What is the best approach to processing very large CSV files with apache Camel? Parsing CSV files to arrays from very large sources in java How to Compare two large CSV file in java How do I sort very large files Compare very large tables in java How compare .csv files or Strings and find similarity? How to compare 2 csv files, perform operations and output into a new csv file The best way to compare two very large lists

Related Tags

How to compare differences in very large csv files

Question

6 answers

solution1
3 ACCPTED 2012-05-17 19:52:07

solution2
2 2012-05-17 19:47:50

solution3
2 2015-05-20 10:37:31

solution4
1 2012-05-17 19:40:27

solution5
1 2015-05-19 12:04:37

solution6
0 2016-06-30 11:52:04

How to compare differences in very large csv files

Question

6 answers

solution1 3 ACCPTED 2012-05-17 19:52:07

solution2 2 2012-05-17 19:47:50

solution3 2 2015-05-20 10:37:31

solution4 1 2012-05-17 19:40:27

solution5 1 2015-05-19 12:04:37

solution6 0 2016-06-30 11:52:04

solution1
3 ACCPTED 2012-05-17 19:52:07

solution2
2 2012-05-17 19:47:50

solution3
2 2015-05-20 10:37:31

solution4
1 2012-05-17 19:40:27

solution5
1 2015-05-19 12:04:37

solution6
0 2016-06-30 11:52:04