简体   繁体   English

SSIS-插入大量行(亿万行)的最佳方法

[英]SSIS - Best way to insert large number of rows (hundred millions of rows)

So here's the scenario: I have an XML file, which is in size of 500GB, and with data of around 600 million rows (once on a database table). 这样的情况是:我有一个XML文件,大小为500GB,数据大约为6亿行(一次在数据库表上)。 I'm using SSIS for the operation, and since it consumes a REALLY large amount of memory if I am to use an SSIS component (ie: XML Source ), it might cause a timeout (correct me if I'm wrong, but as far as I know, using the components on SSIS loads the content of the XML into memory - with a file that big it will surely cause errors). 我正在使用SSIS进行该操作,由于如果我要使用SSIS组件(即XML Source ),它将消耗大量内存,因此可能会导致超时(如果我输入错了,请更正我,但是据我所知,使用SSIS上的组件会将XML的内容加载到内存中-带有太大的文件肯定会导致错误)。 My approach then is: 我的方法是:

  • Use a Script Task to parse the XML data using XML Reader (XML Reader by far is the best approach, since it parses the XML on a forward, non-cached approach) 使用Script Task使用XML Reader解析XML数据(到目前为止,XML Reader是最好的方法,因为它以一种向前的非缓存方法解析XML)
  • Insert the data on a DataTable 将数据插入DataTable
  • Every 500,000 rows on the DataTable , insert the contents to the database using SqlBulkCopy , then clear the contents of the DataTable DataTable上每隔500,000行,使用SqlBulkCopy将内容插入数据库,然后清除DataTable的内容

My problem is, currently, I tried it to parse another file with the size of 200GB, and it's running on around 13.5M / 1 hour - and I don't know if it's still fine with that run time. 我的问题是,当前,我尝试使用它来解析另一个大小为200GB的文件,并且该文件的运行时间约为13.5M / 1小时-我不知道该运行时间是否还可以。 It sure solves my problem - but it's not too elegant, I mean, there should be other ways. 它肯定可以解决我的问题-但这并不是太优雅,我的意思是,应该有其他方法。

I'm looking on other approaches, like: 我正在寻找其他方法,例如:

  • Dividing the large XML files into small pieces of CSVs (around 20GB) then use an SSIS Data Flow task 将大型XML文件分为几小段CSV(约20GB),然后使用SSIS Data Flow task
  • Use INSERT script every new rows 每隔新行使用INSERT脚本

Can you help me do decide which is best? 您能帮我决定哪个最好吗? Or suggest any other solutions. 或建议其他解决方案。

Every answer will be very much appreciated. 每个答案将不胜感激。

EDIT 编辑

I forgot to mention - my approach will be dynamic. 我忘了提-我的方法是动态的。 I mean, there are many tables that will be populated with large sized XML files. 我的意思是,有很多表将填充大型XML文件。 So, using a Script Component as source might be not so useful, since I still need to define the output columns. 因此,使用脚本组件作为源可能不是那么有用,因为我仍然需要定义输出列。 But still, will give it a try. 但是,仍然可以尝试一下。

EDIT 2015-07-28 编辑2015-07-28

The file is from our client, and we can't do anything on what source they want to send to us. 该文件来自我们的客户,我们无法对他们想要发送给我们的来源做任何事情。 XML, that's it. XML就是这样。 Here is a sample from the XML I am consuming: 这是我正在使用的XML的示例:

<?xml version="1.0" encoding="UTF-8"?>
<MFADISDCP>
  <ROW>
    <INVESTMENT_CODE>DATA</INVESTMENT_CODE>
    <DATE_OF_RECORD>DATA</DATE_OF_RECORD>
    <CAPITAL_GAIN_DISTR_RATE>DATA</CAPITAL_GAIN_DISTR_RATE>
    <INCOME_DISTR_RATE>DATA</INCOME_DISTR_RATE>
    <DISTR_PAYMENT_DATE>DATA</DISTR_PAYMENT_DATE>
    <CURRENCY>DATA</CURRENCY>
    <CONFIRM>DATA</CONFIRM>
    <EXPECTED_DISTRIBUTION_AMOUNT>DATA</EXPECTED_DISTRIBUTION_AMOUNT>
    <KEYING_STATUS>DATA</KEYING_STATUS>
    <DAF_RATE>DATA</DAF_RATE>
    <INCOME_START_DATE>DATA</INCOME_START_DATE>
    <ALLOCABLE_END_DATE>DATA</ALLOCABLE_END_DATE>
    <TRADE_DATE>DATA</TRADE_DATE>
    <OVR_CAPITAL_GAIN_DISTR_OPTION>DATA</OVR_CAPITAL_GAIN_DISTR_OPTION>
    <OVR_INCOME_DISTR_OPTION>DATA</OVR_INCOME_DISTR_OPTION>
    <BACKDATED_DISTRIBUTION>DATA</BACKDATED_DISTRIBUTION>
    <DATE_MODIFIED>DATA</DATE_MODIFIED>
  </ROW>
<!--AROUND 49M+ OF THIS ROWS-->
</MFADISDCP>

If I were to do this then I would break it down into the following tasks: 如果要执行此操作,则将其分解为以下任务:

  1. Convert XML file into a (tab or comma) delimited file. 将XML文件转换为(制表符或逗号)分隔的文件。 If your server has fast disks (SSD) then this should be very quick. 如果您的服务器具有快速磁盘(SSD),则该速度应该非常快。 Be careful of strings in your data that may contain special characters that may break the delimiter format. 注意数据中的字符串可能包含特殊字符,这些特殊字符可能会破坏定界符格式。 Don't use the DataTable object as it is slow. 不要使用DataTable对象,因为它很慢。 You could stream this so that you don't need to have the whole file in memory at one go (unless your server has several hundred gigs of memory) 您可以流式传输此文件,这样就不必一次将整个文件存储在内存中(除非您的服务器有几百个内存)
  2. Truncate the stage table in your database that you will use to load the data into. 截断数据库中用于将数据加载到其中的阶段表。
  3. Use SQL Server's bcp.exe to push the delimited file into a stage table on your database. 使用SQL Server的bcp.exe将带分隔符的文件推送到数据库的登台表中。 This is probably the fastest way to get a large amount of data into a database. 这可能是将大量数据放入数据库的最快方法。 A problem with this is that if it fails then it is very hard to find which row of data caused the failure. 这样做的问题是,如果失败,则很难找到导致失败的数据行。
  4. Delete the delimited file as you don't need them lying around taking up lots of space. 删除定界文件,因为您不需要它们占用大量空间。
  5. Create a SQL stored procedure to move the data from the stage table to wherever you will be using it. 创建一个SQL存储过程,将数据从阶段表移动到将要使用的任何位置。

You could use SSIS script tasks for this or you could write your own stand alone services. 您可以为此使用SSIS脚本任务,也可以编写自己的独立服务。

Note, this is all theoretical, there may be better ways of doing it but this may be a good starting point and to find out where your bottlenecks are. 请注意,所有这些都是理论上的,可能会有更好的方法,但这可能是一个很好的起点,并且可以找出瓶颈所在。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM