简体   繁体   English

使用C#将XML文件加载到MySQL的最快方法是什么?

[英]What is the fastest way to load an XML file into MySQL using C#?

Question

What is the fastest way to dump a large (> 1GB) XML file into a MySQL database? 将大型(> 1GB)XML文件转储到MySQL数据库的最快方法是什么?

Data 数据

The data in question is the StackOverflow Creative Commons Data Dump. 有问题的数据是StackOverflow知识共享数据转储。

Purpose 目的

This will be used in an offline StackOverflow viewer I am building, since I am looking to do some studying/coding in places where I will not have access to the internet. 这将用于我正在构建的离线StackOverflow查看器,因为我希望在我无法访问互联网的地方进行一些学习/编码。

I would like to release this to the rest of the StackOverflow membership for their own use when the project is finished. 我希望在项目完成后将其发布到StackOverflow成员的其余部分以供自己使用。

Problem 问题

Originally, I was reading from XML/writing to DB one record at a time. 最初,我一次读取XML /写入DB一条记录。 This took about 10 hours to run on my machine. 这需要大约10个小时才能在我的机器上运行。 The hacktastic code I'm using now throws 500 records into an array, then creates an insertion query to load all 500 at once (eg. " INSERT INTO posts VALUES (...), (...), (...) ... ; "). 我正在使用的hacktastic代码现在将500个记录抛出到一个数组中,然后创建一个插入查询以一次加载所有500个(例如“ INSERT INTO posts VALUES (...), (...), (...) ... ; “)。 While this is faster, it still takes hours to run. 虽然速度更快,但仍需要数小时才能运行。 Clearly this is not the best way to go about it, so I'm hoping the big brains on this site will know of a better way. 显然这不是最好的方法,所以我希望这个网站上的大脑会知道更好的方法。

Constraints 约束

  • I am building the application using C# as a desktop application (ie WinForms). 我正在使用C#构建应用程序作为桌面应用程序(即WinForms)。
  • I am using MySQL 5.1 as my database. 我使用MySQL 5.1作为我的数据库。 This means that features such as " LOAD XML INFILE filename.xml " are not usable in this project, as this feature is only available in MySQL 5.4 and above. 这意味着诸如“ LOAD XML INFILE filename.xml ”之类的功能在此项目中不可用,因为此功能仅在MySQL 5.4及更高版本中可用。 This constraint is largely due to my hope that the project would be useful to people other than myself, and I'd rather not force people to use Beta versions of MySQL. 这种约束很大程度上是因为我希望该项目对我以外的人有用,而且我宁愿不强迫人们使用Beta版本的MySQL。
  • I'd like the data load to be built into my application (ie no instructions to "Load the dump into MySQL using 'foo' before running this application."). 我希望将数据加载到我的应用程序中(即没有指令“在运行此应用程序之前使用'foo'将转储加载到MySQL中。”)。
  • I'm using MySQL Connector/Net, so anything in the MySql.Data namespace is acceptable. 我正在使用MySQL Connector / Net,因此MySql.Data命名空间中的任何内容都是可以接受的。

Thanks for any pointers you can provide! 感谢您提供的任何指示!


Ideas so far 到目前为止的想法

stored procedure that loads an entire XML file into a column, then parses it using XPath 将整个XML文件加载到列中的存储过程,然后使用XPath对其进行解析

  • This didn't work since the file size is subject to the limitations of the max_allowed_packet variable, which is set to 1 MB by default. 这不起作用,因为文件大小受max_allowed_pa​​cket变量的限制,默认情况下设置为1 MB。 This is far below the size of the data dump files. 这远远低于数据转储文件的大小。

There are 2 parts to this: 这有两个部分:

  • reading the xml file 读取xml文件
  • writing to the database 写入数据库

For reading the xml file, this link http://csharptutorial.blogspot.com/2006/10/reading-xml-fast.html , shows that 1 MB can be read in 2.4 sec using stream reader, that would be 2400 seconds or 40 mins (if my maths is working this late) for 1 GB file. 对于读取xml文件,此链接http://csharptutorial.blogspot.com/2006/10/reading-xml-fast.html显示,使用流读取器可以在2.4秒内读取1 MB,即2400秒或对于1 GB文件,40分钟(如果我的数学工作这么晚)。

From what I have read the fastest way to get data into MySQL is to use LOAD DATA. 从我所读到的,获取数据到MySQL的最快方法是使用LOAD DATA。

http://dev.mysql.com/doc/refman/5.1/en/load-data.html http://dev.mysql.com/doc/refman/5.1/en/load-data.html

Therefore, if you can read the xml data, write it to files that can be used by LOAD DATA, then run LOAD DATA. 因此,如果您可以读取xml数据,将其写入LOAD DATA可以使用的文件,然后运行LOAD DATA。 The total time may be less than the hours that you are experiancing. 总时间可能少于您正在试验的小时数。

Ok, I'm going to be an idiot here and answer your question with a question. 好的,我将在这里成为一个白痴,并回答你的问题。

Why put it in a database? 为什么要把它放在数据库中?

What if ... just a what-if... you wrote the xml to files on local drive and, if needed, write some indexing information in the database. 如果...只是假设...你将xml写入本地驱动器上的文件,如果需要,在数据库中写入一些索引信息。 This should perform significantly faster than trying to load a database and would much more portable. 这应该比尝试加载数据库快得多,并且可以更加轻松。 All you would need on top of it is a way to search and a way to index relational references. 您需要的只是一种搜索方式和索引关系引用的方法。 There should be plenty of help with searching, and the relational aspect should be easy enough to build? 搜索应该有很多帮助,关系方面应该很容易构建? You might even consider re-writing the information so that each file contains a single post with all the answers and comments right there. 您甚至可以考虑重新编写信息,以便每个文件都包含一个包含所有答案和注释的帖子。

Anyway, just my two-cents (and that is not worth a dime). 无论如何,只是我的两美分(这不值一角钱)。

I have a few thoughts to help speed this up... 我有一些想法可以帮助加快这个速度......

  1. The size of the query may need to be tweaked, there's often a point where the big statement costs more in parsing time and so becomes slower. 可能需要调整查询的大小,通常有一点是大语句在解析时间上花费更多,因此变得更慢。 The 500 may be optimal, but perhaps it is not and you could tweak that a little (it could be more, it could be less). 500可能是最佳的,但也许它不是,你可以调整一点(它可能更多,它可能会更少)。

  2. Go multithreaded. 去多线程。 Assuming your system isn't already flatlined on the processing, you could make some gains by having breaking up the data in to chunks and having threads process them. 假设您的系统在处理过程中尚未完成,您可以通过将数据分解为块并让线程处理它们来获得一些收益。 Again, it's an experimentation thing to find the optimal number of threads, but a lot of people are using multicore machines and have CPU cycles to spare. 同样,找到最佳线程数是一个实验性的事情,但是很多人正在使用多核机器并且有多余的CPU周期。

  3. On the database front, make sure that the table is as bare as it can be. 在数据库前端,确保表格尽可能裸露。 Turn off any indexes and load the data before indexing it. 在索引之前关闭所有索引并加载数据。

SqlBulkCopy ROCKS. SqlBulkCopy ROCKS。 I used it to turn a 30 min function to 4 seconds. 我用它将30分钟的功能变为4秒。 However this is applicable only to MS SQL Server . 但是,这仅适用于MS SQL Server

Might I suggest you look at the constraints on your table you've created? 我建议你看一下你创建的桌子上的限制吗? If you drop all keys on the database, constraints etc, the database will do less work on your insertions and less recursive work. 如果删除数据库上的所有键,约束等,数据库将减少对插入的工作,减少递归工作。

Secondly setup the tables with big initial sizes to prevent your resizes if you are inserting into a blank database. 其次,设置具有较大初始大小的表,以防止在插入空白数据库时调整大小。

Finally see if there is a bulk copy style API for MySQL. 最后看看是否有适用于MySQL的批量复制样式API。 SQL Server basically formats the data as it would go down to disk and the SQL server links the stream up to the disk and you pump in data. SQL Server基本上格式化数据,因为它将下载到磁盘,SQL服务器将流链接到磁盘并且您输入数据。 It then performs one consistency check for all the data instead of one per insert, dramatically improving your performance. 然后,它会对所有数据执行一次一致性检查,而不是每次插入一次,从而显着提高性能。 Good luck ;) 祝好运 ;)

Do you need MySQL? 你需要MySQL吗? SQL Server makes your life easier if you are using Visual Studio and your database is low performance/size. 如果使用Visual Studio并且数据库性能/大小较低,SQL Server可以使您的工作更轻松。

Does this help at all? 这有帮助吗? It's a stored procedure that loads an entire XML file into a column, then parses it using XPath and creates a table / inserts the data from there. 它是一个存储过程,它将整个XML文件加载到列中,然后使用XPath对其进行解析并创建表/从中插入数据。 Seems kind of crazy, but it might work. 看起来有点疯狂,但它可能会奏效。

不是你想要的答案,但mysql c api有mysql_stmt_send_long_data函数。

I noticed in one of your comments above that you are considering MSSQL, so I thought I'd post this. 我在上面的一条评论中注意到你正在考虑MSSQL,所以我想我会发布这个。 SQL Server has a utility called SQMLXMLBulkLoad which is designed to import large amounts of XML Data into a SQL Server database. SQL Server有一个名为SQMLXMLBulkLoad的实用程序,用于将大量XML数据导入SQL Server数据库。 Here is the documentation for the SQL Sever 2008 version: 以下是SQL Sever 2008版本的文档:

http://msdn.microsoft.com/en-us/library/ms171993.aspx http://msdn.microsoft.com/en-us/library/ms171993.aspx

Earlier versions of SQL Server also have this utility 早期版本的SQL Server也有此实用程序

In PostgreSQL , the absolute fastest way to get bulk data in is to drop all indexes and triggers, use the equivalent of MySQL's LOAD DATA and then recreate your indexes/triggers. PostgreSQL中 ,获取批量数据的绝对最快方法是删除所有索引和触发器,使用相当于MySQL的LOAD DATA ,然后重新创建索引/触发器。 I use this technique to pull 5 GB of forum data into a PostgreSQL database in roughly 10 minutes. 我使用这种技术在大约10分钟内将5 GB的论坛数据提取到PostgreSQL数据库中。

Granted, this may not apply to MySQL, but it's worth a shot. 当然,这可能不适用于MySQL,但它值得一试。 Also, this SO question's answer suggests that this is in fact a viable strategy for MySQL. 此外, 这个SO问题的答案表明,这实际上是MySQL可行的策略。

A quick google turned up some tips on increasing the performance of MySQL's LOAD DATA . 一个快速的谷歌提出了一些提高MySQL的LOAD DATA性能的技巧。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM