简体   繁体   English

如何比较 java 中的两个大 CSV 文件

[英]How to Compare two large CSV file in java

I need compare two large csv files and find differences.我需要比较两个大的 csv 文件并找出不同之处。

First CSV file will be like:第一个 CSV 文件将是这样的:

c71f55b6c18248b8915d8a26
64b7d2d4eab74d7999a967c0
ceb792ad21054fe0a27ec410
95319566f9424c57ba2145f9
682a4fe26c154050b8f5c6f1
88e0209e2af74049ad9bf2bd
5c462b42763d41d7bb67029f
0ee74c227fc84e39a9ecc1da
66f7ab6f56374ba08d2fb92d
3ed793e35f9441b58562c9ba
baad81ac8ba54188afe63fb8
...

Each row has just one id, and total row count is approximately 5 Millions.每行只有一个 id,总行数约为 500 万。 Second CSV file will be like First one with total row count 3 Millions.第二个 CSV 文件将像第一个文件一样,总行数为 300 万。

I need to remove ids of the second csv from the first csv and put them into a MongoDb. When i take all lines into memory then compare both CSVs file, I got out of memory error.我需要从第一个 csv 中删除第二个 csv 的 ID,并将它们放入 MongoDb 中。当我将所有行放入 memory 中然后比较两个 CSV 文件时,我得到了 memory 错误。 I have 512Mb memory space and I will get at least 30 request in a day.我有 512Mb memory 空间,我一天至少会收到 30 个请求。 Rows of CSV is changing 1Million-10Million. CSV 的行正在改变 100 万到 1000 万。 I can receive two request at same time and do same things simultaneously.我可以同时收到两个请求并同时做同样的事情。

Is there any other way on this?在这方面还有其他办法吗?

Thanks.谢谢。

If you need to manage the data in java you can use a Set as basic data structure to hold your data:如果您需要管理 java 中的数据,您可以使用Set作为基本数据结构来保存您的数据:

A collection that contains no duplicate elements不包含重复元素的集合

in particular in your case the best will to use an HashSet of strings because:特别是在您的情况下,最好使用字符串的HashSet集,因为:

This class offers constant time performance for the basic operations (add, remove, contains and size)这个 class为基本操作(添加、删除、包含和大小)提供恒定的时间性能

it means that adding and removing items from an HashSet is not dependent on the number of items present in the HashSet .这意味着在HashSet HashSet存在的项目数。 Holding 10.000.000 or strings of 24 characters can be done with about half giga of ram, so you can hold all in memory, but consider that 10.000.000 is your upper limit if you are limited by half giga of ram.保存 10.000.000 或 24 个字符的字符串可以用大约半 GB 的 ram 来完成,所以你可以在 memory 中保存所有内容,但是如果你被 half giga 的 ram 限制,请考虑 10.000.000 是你的上限。

The code can be something like代码可以是这样的

Set<String> items = new HashSet<>();

...
// For each item in the first file (may be a loop or stream)
items.add(item);
...
// For each item in the second file (may be a loop or stream)
items.remove(item);
...
// Here the set contains all items of the first csv without items present also 
// in the second csv

For performance reasons, you should keep a representation of the second file in memory, so you can loop through the first file, check whether the entry is contained in the second one, and if not, insert the entry into MongoDB.出于性能原因,您应该在 memory 中保留第二个文件的表示,这样您就可以遍历第一个文件,检查该条目是否包含在第二个文件中,如果不包含,则将该条目插入 MongoDB。

The representation of the second file should:第二个文件的表示应该:

  • be compact not to consume too much memory,紧凑不要消耗太多memory,
  • allow for a fast "contains" check.允许快速“包含”检查。

Your data entries all seem to consist of exactly 24 hex digits.您的数据条目似乎都由 24 个十六进制数字组成。 If that's true, you can represent them as 96-bit numbers instead of Strings.如果是这样,您可以将它们表示为 96 位数字而不是字符串。 The most straightforward approach is最直接的方法是

String entry = ...
BigInteger value = new BigInteger(entry,16);

Then, you use a Set<BigInteger> instead of a Set<String> , with considerably lower memory consumption.然后,您使用Set<BigInteger>而不是Set<String> ,消耗的 memory 显着降低。 I'd try both HashSet and TreeSet , but I'm concerned that their memory overhead per entry might still be too much.我会同时尝试HashSetTreeSet ,但我担心每个条目的 memory 开销可能仍然太多。

So, it might be necessary to create your own data structure, eg using the first (highest) 16 bits as an index into a size-65536 array where each element is a List of the file-two BigInteger s starting with that 16-bit value.因此,可能有必要创建您自己的数据结构,例如,使用前(最高)16 位作为大小为 65536 的数组的索引,其中每个元素都是文件的List - 以该 16 位开头的两个BigInteger价值。 This should give a low memory overhead, a decent contains() performance, and need at most 50 lines of code.这应该会产生较低的 memory 开销、不错的contains()性能,并且最多需要 50 行代码。

You need to delete data from the first CSV file that also exist in the second CSV file.您需要从第一个 CSV 文件中删除数据,这些数据也存在于第二个 CSV 文件中。 As both CSVs are very large, they cannot be wholly loaded into the memory. Java will produce a very long piece of code to do this.由于两个 CSV 都非常大,因此无法将它们全部加载到 memory 中。Java 将生成很长的一段代码来执行此操作。

It is rather simple to get this done in SPL, the open-source Java package. Only one line of code is sufficient:在 SPL 中实现这一点相当简单,开源的 Java package。只需一行代码就足够了:

A一种
1 1个 =file("result.csv").export([file("csv1.csv").cursor@i().sortx(~),file("csv2.csv").cursor@i().sortx(~)].mergex@d()) =file("result.csv").export([file("csv1.csv").cursor@i().sortx(~),file("csv2.csv").cursor@i().sortx( ~)].mergex@d())

SPL offers JDBC driver to be invoked by Java. Just store the above SPL script as diff.splx and invoke it in Java as you call a stored procedure: SPL 提供 JDBC 驱动程序,由 Java 调用。只需将上述 SPL 脚本存储为 diff.splx 并在调用存储过程时在 Java 中调用它:

…    
Class.forName("com.esproc.jdbc.InternalDriver");    
con= DriverManager.getConnection("jdbc:esproc:local://");    
st=con.prepareCall("call diff()");    
st.execute();    
…

Or execute the SPL string within a Java program as we execute a SQL statement:或者在我们执行 SQL 语句时在 Java 程序中执行 SPL 字符串:

…
st = con.prepareStatement("==file(\"result.csv\").     
     export([file(\"csv1.csv\").cursor@i().sortx(~),file(\"csv2.csv\")
    .cursor@i().sortx(~)].mergex@d ())");

st.execute();
…

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM