使用JAVA中的两个非常大的CSV文件进行性能比较和查询的性能

Question

I need to compare two csv files, each has around 500000 to 900000 lines (yes, they´re big), and I´d like to know which is the best way to do this. 我需要比较两个csv文件，每个文件大约有500000至900000行（是的，它们很大），我想知道哪种是最好的方法。

What I need to do 我需要做什么

Delete rows on CSV1 that are not in CSV2 using a key value (code) 使用键值（代码）删除CSV1上不在CSV2中的行
Delete rows in each side at some given hours 在给定时间删除每一侧的行
Show difference on some fields like "Quantity", filtering by some fields like "City" or "date" 在某些字段（如“数量”）上显示差异，按某些字段（如“城市”或“日期”）进行过滤

I could try to store each CSV file in a JAVA list, and create a Database (Using SQLite) with the final result (differences, and rows deleted), and then do querys against this database, like select only from one city, from some dates/hours or codes (or even all of them at the same time, the final user will apply the filters from the interface using checkboxes, or comboboxes) 我可以尝试将每个CSV文件存储在JAVA列表中，并创建具有最终结果（差异和删除的行）的数据库（使用SQLite），然后针对该数据库进行查询，例如仅从一个城市中选择，从某些城市中选择日期/小时或代码（甚至同时使用所有这些，最终用户将使用复选框或组合框从界面应用过滤器）

Each CSV File look something similar to this 每个CSV文件看起来都与此相似

CITY;       CODE;          DATETIME;       Quantity
city1; city_1_code_1; DD/MM/YYYY hh:mm:ss;   2500

Im not sure which is the best way to do this performance-wise. 我不确定哪种是执行此性能的最佳方法。 Should I keep the data in memory and just use the lists to do the comparisons? 我应该将数据保留在内存中，仅使用列表进行比较吗？ if not, using SQLite is good enough to do this? 如果不是，使用SQLite足以做到这一点？ or should I use something different? 还是我应该使用其他东西？ Am I missing a better way to do this operation? 我是否缺少更好的方法来执行此操作？

Im developing this using JavaFX, and results should be showed in a Table (that's not a problem at all, just to put you in context) 我正在使用JavaFX开发此程序，结果应显示在表格中（这完全不是问题，只是将您放在上下文中）

Thanks in advance, and let me know if you need to know anything 预先感谢，如果您需要了解任何信息，请告诉我

Answer 1

You'll never know for sure until you test for performance, but it seems like SQLite can handle a million rows easily. 在测试性能之前，您永远无法确定，但是SQLite似乎可以轻松处理一百万行。 Some Stack Overflow users seem to be able to work on much larger data sets . 一些Stack Overflow用户似乎能够处理更大的数据集。

From a maintainability perspective, using a database with proper indexing is the way to go if it's fast enough . 从可维护性的角度来看，如果数据库足够快 ，则使用具有适当索引的数据库是可行的方法。 If it's not fast enough for your needs, you may consider other, more complex approaches. 如果不够快不能满足您的需求，则可以考虑使用其他更复杂的方法。

If you decide to use in-memory lists, you may consider using one of the high performance collections libraries available in the Java ecosystem. 如果决定使用内存中列表，则可以考虑使用Java生态系统中可用的高性能集合库之一。 I can't recommend any, but you may take a look eg here to get an idea. 我不推荐任何东西，但是您可以例如在这里看看以获取想法。 Chances are though that, unless you operate in the entire collection very often, the SQLite approach might still be faster (again, testing is key). 不过，除非您经常在整个集合中进行操作，否则SQLite方法可能仍然会更快（再次，测试是关键）。

Finally, a middle-of-the-road approach would be to use an in-memory database . 最后，一种中间方法是使用内存数据库。

使用JAVA中的两个非常大的CSV文件进行性能比较和查询的性能

问题描述

1 个解决方案

解决方案1
1 已采纳 2019-06-16 21:32:41

使用JAVA中的两个非常大的CSV文件进行性能比较和查询的性能

问题描述

1 个解决方案

解决方案1 1 已采纳 2019-06-16 21:32:41

解决方案1
1 已采纳 2019-06-16 21:32:41