简体   繁体   中英

C# crosscheck slow database with large CSV

I have a database, that isn't really fast, and I have a big CSV of about 65000 rows. I need to crosscheck these for existence and to update the database when needed.

  • In the CSV, there is a column that contains the database IDs. It is always a 1:1 relationship.
  • The CSV may hold new input for the database, so it can happen that there are no DB entries for it.
  • I cannot loop through the CSV and check each row, because it is too slow.
  • Getting all results from the database at first and storing them to loop through every time won't work, because that will pull alot of RAM.

How can I do the following:

  • Check if a row in the CSV has a database entry. If so, write it away to another CSV file.
  • If the row has no database entry, write it to a different file.
  • Keep the timespan within 5 minutes, preferably shorter.

The CSV has alot of columns (for example 70), but I only need column 5 for crosschecking the IDs. I have tried to first loop through the CSV file and then check it with the database, but that is too slow. It can take over 10 minutes. I have also tried to get all entries from the database, and loop through those. Withing the loop, run through the CSV (using a BufferedStream ), and checking it. This does decrease the time significantly (5 min max.), but will not be able to record the entries that do not exist in the database.

Is there any way I can do this while keeping the speed up?

There is not enough information to give you a proper analysis and end up with an iron clad solution to the problem. I can give some suggestions. For the record, a CSV with 65,000 records is not that huge. I also disagree that walking a file is too slow as I have personally worked on using a streamreader to compare files that were gigabytes in size, which is more than likely an order of magnitude larger.

First, you can consider turning the problem on its head. Rather than pulling through the database as you run through the CSV, consider pulling the entire set into memory (not a great idea if you have a huge database, but an option if it is manageable). If a bit larger, you can even write out the database (assume this is a single table or view (or query that could be a view)) to a different CSV. The core focus here is getting the slow database out of the loop. NOTE: If this is a highly transactional system, and you need an "up to the minute (or 5 minute) accurate snapshot", this may not suffice. I find that an unrealistic expectation (the data now still represents 5 minutes ago, despite numerous edits, that is).

Next, you can consider reducing set. An easy way already mentioned in your question is reducing the working CSV from 70 columns to the 5 you need. The same can be true if you pull the same data out of the database for comparison. This will only work if the time to load is the bottleneck. I seriously doubt that is the case, based on your description.

You can also consider putting the two bits of data into memory and calculating there. Very fast. This won't work if you can't get the two items to compare into memory due to size, which is why filtering down to the columns you need is a useful exercise.

Since you mention database IDs, it sounds like the CSV checks more than one database. Consider ordering the CSV by database ID first. As mentioned, there are sort algorithms that are very fast and should be able to sort 65,000 records in a matter of seconds. The bottleneck with sort is generally the amount of memory and the speed of I/O (primarily disk speed). You can then attack each database.

As I stated at the beginning, I only have enough information to give hints, not actual solutions, but hopefully this spurs some ideas.

Late answer, but I have fixed it this way: I am pulling the CSV columns that I need into a DataTable . Then I fetch all rows that I need to check (it has a certain number I can filter on), and run through those database rows. Each row will check for the corresponding ID in the DataTable and put the data in a new CSV. After that, the row in the DataTable will be deleted. In the end I have a CSV with rows that do exist and will be imported into the system, and a DataTable that will be exported to a CSV with rows that need to be added.

Thanks for Gregory for helping me getting on the right track.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM