简体繁体 English

带有大型CSV的C＃交叉检查慢速数据库

[英]C# crosscheck slow database with large CSV

原文 2012-12-10 14:10:23 0 2 c#/ database/ csv/ streamreader

I have a database, that isn't really fast, and I have a big CSV of about 65000 rows. 我有一个数据库，运行速度并不快，并且有一个大约65000行的CSV大文件。 I need to crosscheck these for existence and to update the database when needed. 我需要对它们进行交叉检查，并在需要时更新数据库。

In the CSV, there is a column that contains the database IDs. 在CSV中，有一个包含数据库ID的列。 It is always a 1:1 relationship. 它始终是1：1的关系。
The CSV may hold new input for the database, so it can happen that there are no DB entries for it. CSV可能包含数据库的新输入，因此可能没有数据库条目。
I cannot loop through the CSV and check each row, because it is too slow. 我不能遍历CSV并检查每一行，因为它太慢了。
Getting all results from the database at first and storing them to loop through every time won't work, because that will pull alot of RAM. 首先从数据库中获取所有结果并将它们存储起来以遍历每次都行不通，因为这会占用大量RAM。

How can I do the following: 如何执行以下操作：

Check if a row in the CSV has a database entry. 检查CSV中的一行是否具有数据库条目。 If so, write it away to another CSV file. 如果是这样，请将其写到另一个CSV文件中。
If the row has no database entry, write it to a different file. 如果该行没有数据库条目，请将其写入另一个文件。
Keep the timespan within 5 minutes, preferably shorter. 将时间间隔保持在5分钟以内，最好缩短。

The CSV has alot of columns (for example 70), but I only need column 5 for crosschecking the IDs. CSV有很多列（例如70），但是我只需要第5列就可以对ID进行交叉检查。 I have tried to first loop through the CSV file and then check it with the database, but that is too slow. 我试图先遍历CSV文件，然后再与数据库进行检查，但这太慢了。 It can take over 10 minutes. 可能需要10分钟以上。 I have also tried to get all entries from the database, and loop through those. 我还尝试过从数据库中获取所有条目，并遍历这些条目。 Withing the loop, run through the CSV (using a BufferedStream ), and checking it. 通过循环，遍历CSV（使用BufferedStream ），并进行检查。 This does decrease the time significantly (5 min max.), but will not be able to record the entries that do not exist in the database. 这确实可以显着减少时间（最多5分钟），但是将无法记录数据库中不存在的条目。

Is there any way I can do this while keeping the speed up? 有什么办法可以保持速度吗？

2 个解决方案

There is not enough information to give you a proper analysis and end up with an iron clad solution to the problem. 没有足够的信息来给您适当的分析，并最终得出解决该问题的方法。 I can give some suggestions. 我可以给一些建议。 For the record, a CSV with 65,000 records is not that huge. 对于记录而言，具有65,000条记录的CSV并不是那么大。 I also disagree that walking a file is too slow as I have personally worked on using a streamreader to compare files that were gigabytes in size, which is more than likely an order of magnitude larger. 我也不同意走文件太慢，因为我亲自使用流读取器来比较大小为千兆字节的文件，而文件大小可能要大一个数量级。

First, you can consider turning the problem on its head. 首先，您可以考虑扭转问题。 Rather than pulling through the database as you run through the CSV, consider pulling the entire set into memory (not a great idea if you have a huge database, but an option if it is manageable). 与其在CSV中运行时不拉数据库，不如考虑将整个集合拉到内存中（如果您有庞大的数据库，这不是一个好主意，但如果可管理，则是一个选择）。 If a bit larger, you can even write out the database (assume this is a single table or view (or query that could be a view)) to a different CSV. 如果更大一点，您甚至可以将数据库写出来（假设这是一个表或视图（或可能是视图的查询））为不同的CSV。 The core focus here is getting the slow database out of the loop. 这里的核心焦点是使慢速数据库脱离循环。 NOTE: If this is a highly transactional system, and you need an "up to the minute (or 5 minute) accurate snapshot", this may not suffice. 注意：如果这是一个高度事务性的系统，并且您需要“最新（或5分钟）准确的快照”，这可能不足。 I find that an unrealistic expectation (the data now still represents 5 minutes ago, despite numerous edits, that is). 我发现这是一个不切实际的期望（也就是说，尽管进行了大量编辑，但数据现在仍代表5分钟前）。

Next, you can consider reducing set. 接下来，您可以考虑减少变形。 An easy way already mentioned in your question is reducing the working CSV from 70 columns to the 5 you need. 您的问题中已经提到的一种简单方法是将工作CSV从70列减少到所需的5列。 The same can be true if you pull the same data out of the database for comparison. 如果您将相同的数据从数据库中拉出来进行比较，则可能同样如此。 This will only work if the time to load is the bottleneck. 仅当加载时间成为瓶颈时，这才有效。 I seriously doubt that is the case, based on your description. 根据您的描述，我非常怀疑情况是否如此。

You can also consider putting the two bits of data into memory and calculating there. 您还可以考虑将数据的两位放入内存中并在那里进行计算。 Very fast. 非常快。 This won't work if you can't get the two items to compare into memory due to size, which is why filtering down to the columns you need is a useful exercise. 如果由于大小而无法将两个项目比较到内存中，则此方法将行不通，这就是为什么筛选所需的列是一个有用的练习的原因。

Since you mention database IDs, it sounds like the CSV checks more than one database. 由于您提到了数据库ID，因此听起来像CSV检查了多个数据库。 Consider ordering the CSV by database ID first. 考虑先按数据库ID排序CSV。 As mentioned, there are sort algorithms that are very fast and should be able to sort 65,000 records in a matter of seconds. 如前所述，有些排序算法非常快，应该能够在几秒钟内对65,000条记录进行排序。 The bottleneck with sort is generally the amount of memory and the speed of I/O (primarily disk speed). 排序的瓶颈通常是内存量和I / O的速度（主要是磁盘速度）。 You can then attack each database. 然后，您可以攻击每个数据库。

As I stated at the beginning, I only have enough information to give hints, not actual solutions, but hopefully this spurs some ideas. 正如我在开始时所说，我只有足够的信息来提供提示，而没有实际的解决方案，但希望这会激发一些想法。

Late answer, but I have fixed it this way: I am pulling the CSV columns that I need into a DataTable . 较晚的答案，但是我已经通过以下方式解决了：我正在将所需的CSV列拉入DataTable 。 Then I fetch all rows that I need to check (it has a certain number I can filter on), and run through those database rows. 然后，我获取所有需要检查的行（可以过滤的特定行数），并遍历这些数据库行。 Each row will check for the corresponding ID in the DataTable and put the data in a new CSV. 每行将检查DataTable的相应ID，并将DataTable放入新的CSV中。 After that, the row in the DataTable will be deleted. 之后，DataTable中的行将被删除。 In the end I have a CSV with rows that do exist and will be imported into the system, and a DataTable that will be exported to a CSV with rows that need to be added. 最后，我有一个包含确实存在的行的CSV，它将被导入到系统中，还有一个DataTable，它将被导出为具有需要添加的行的CSV。