简体   繁体   English

C#将大量数据从CSV导入数据库

[英]C# Importing Large Volume of Data from CSV to Database

What's the most efficient method to load large volumes of data from CSV (3 million + rows) to a database. 从CSV(300万+行)向数据库加载大量数据的最有效方法是什么。

  • The data needs to be formatted(eg name column needs to be split into first name and last name, etc.) 需要格式化数据(例如,名称列需要分为名字和姓氏等)
  • I need to do this in a efficiently as possible ie time constraints 我需要尽可能有效地做到这一点,即时间限制

I am siding with the option of reading, transforming and loading the data using a C# application row-by-row? 我正在选择使用C#应用程序逐行读取,转换和加载数据? Is this ideal, if not, what are my options? 这是理想的,如果没有,我有什么选择? Should I use multithreading? 我应该使用多线程吗?

You will be I/O bound, so multithreading will not necessarily make it run any faster. 您将受I / O限制,因此多线程不一定会使其运行得更快。

Last time I did this, it was about a dozen lines of C#. 上次我这样做,它是十几行C#。 In one thread it ran the hard disk as fast as it could read data from the platters. 在一个线程中,它运行硬盘的速度与从盘片读取数据的速度一样快。 I read one line at a time from the source file. 我从源文件中一次读取一行。

If you're not keen on writing it yourself, you could try the FileHelpers libraries. 如果您不热衷于自己编写,可以尝试使用FileHelpers库。 You might also want to have a look at Sébastien Lorion's work . 您可能还想看看SébastienLorion的作品 His CSV reader is written specifically to deal with performance issues. 他的CSV读取器专门用于处理性能问题。

You could use the csvreader to quickly read the CSV. 您可以使用csvreader快速读取CSV。

Assuming you're using SQL Server, you use csvreader's CachedCsvReader to read the data into a DataTable which you can use with SqlBulkCopy to load into SQL Server. 假设您正在使用SQL Server,您可以使用csvreader的CachedCsvReader将数据读入DataTable,您可以将其与SqlBulkCopy一起加载到SQL Server中。

I would agree with your solution. 我同意你的解决方案。 Reading the file one line at a time should avoid the overhead of reading the whole file into memory at once, which should make the application run quickly and efficiently, primarily taking time to read from the file (which is relatively quick) and parse the lines. 一次读取一行文件应避免一次将整个文件读入内存的开销,这应该使应用程序快速有效地运行,主要是花时间从文件中读取(相对较快)并解析行。 The one note of caution I have for you is to watch out if you have embedded newlines in your CSV. 我要注意的一点是要注意你是否在CSV中嵌入了换行符。 I don't know if the specific CSV format you're using might actually output newlines between quotes in the data, but that could confuse this algorithm, of course. 我不知道您使用的特定CSV格式是否实际上可以在数据中的引号之间输出换行符,但这当然会混淆此算法。

Also, I would suggest batching the insert statements (include many insert statements in one string) before sending them to the database if this doesn't present problems in retrieving generated key values that you need to use for subsequent foreign keys (hopefully you don't need to retrieve any generated key values). 另外,我建议在将插入语句(包括许多插入语句包含在一个字符串中)之前将它们发送到数据库,如果这不会在检索生成的键值时出现问题,这些值需要用于后续的外键(希望你不要需要检索任何生成的键值)。 Keep in mind that SQL Server (if that's what you're using) can only handle 2200 parameters per batch, so limit your batch size to account for that. 请记住,SQL Server(如果这是您正在使用的)只能处理每批2200个参数,因此请将批量大小限制为此。 And I would recommend using parameterized TSQL statements to perform the inserts. 我建议使用参数化TSQL语句来执行插入。 I suspect more time will be spent inserting records than reading them from the file. 我怀疑插入记录所花费的时间比从文件中读取记录要多。

You don't state which database you're using, but given the language you mention is C# I'm going to assume SQL Server. 你没有说明你正在使用哪个数据库,但鉴于你提到的语言是C#,我将假设SQL Server。

If the data can't be imported using BCP (which it sounds like it can't if it needs significant processing) then SSIS is likely to be the next fastest option. 如果无法使用BCP导入数据(如果需要进行大量处理,则听起来不可能),那么SSIS可能是下一个最快的选择。 It's not the nicest development platform in the world, but it is extremely fast. 它不是世界上最好的开发平台,但速度非常快。 Certainly faster than any application you could write yourself in any reasonable timeframe. 当然比任何应用程序都快,你可以在任何合理的时间内写自己。

BCP is pretty quick so I'd use that for loading the data. BCP很快,所以我用它来加载数据。 For string manipulation I'd go with a CLR function on SQL once the data is there. 对于字符串操作,一旦数据存在,我就会在SQL上使用CLR函数。 Multi-threading won't help in this scenario except to add complexity and hurt performance. 除了增加复杂性和损害性能之外,多线程在这种情况下无济于事。

read the contents of the CSV file line by line into a in memory DataTable. 逐行读取CSV文件的内容到内存DataTable中。 You can manipulate the data (ie: split the first name and last name) etc as the DataTable is being populated. 您可以在填充DataTable时操纵数据(即:拆分名字和姓氏)等。

Once the CSV data has been loaded in memory then use SqlBulkCopy to send the data to the database. 将CSV数据加载到内存后,使用SqlBulkCopy将数据发送到数据库。

See http://msdn.microsoft.com/en-us/library/system.data.sqlclient.sqlbulkcopy.writetoserver.aspx for the documentation. 有关文档,请参阅http://msdn.microsoft.com/en-us/library/system.data.sqlclient.sqlbulkcopy.writetoserver.aspx

如果您真的想在C#中执行此操作,请创建并填充DataTable,截断目标db表,然后使用System.Data.SqlClient.SqlBulkCopy.WriteToServer(DataTable dt)。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM