简体   繁体   English

使用Python和SQL Server的ETL过程需要很长时间才能加载

[英]ETL process with Python and SQL Server taking a really long time to load

I'm looking for a technique that will increase the performance of a csv file SQL Server database load process. 我正在寻找一种技术,可以提高csv文件SQL Server数据库加载过程的性能。 I've attempted various approaches but nothing I do seems to be able to break the 5.5 hour barrier. 我尝试了各种方法,但我所做的一切似乎都无法打破5.5小时的障碍。 That's just testing loading a year of data which is about 2 million records. 这只是测试加载一年的数据,大约200万条记录。 I have 20 years of data to load eventually so loading data for 4 days straight isn't going to work. 我有20年的数据最终加载所以加载数据连续4天是行不通的。

The challenge is, the data has to be enriched on load. 挑战在于,数据必须在负载上得到丰富。 I have to add some columns because that information isn't native to the file. 我必须添加一些列,因为该信息不是文件的原生信息。 So far I've tried: 到目前为止,我已经尝试过:

  1. Using petl to append columns to the data and then flush that to the database. 使用petl将列附加到数据,然后将其刷新到数据库。
  2. Using pandas to append columns to the data and then flushing the data frame to the database. 使用pandas将列附加到数据,然后将数据帧刷新到数据库。
  3. Using bulk load to load an intermediary staging table and then using T-SQL to populate the extra columns and then pushing that on to a final staging table. 使用批量加载来加载中间登台表,然后使用T-SQL填充额外的列,然后将其推送到最终的登台表。

Bulk load works REALLY fast but then I have to add the data for the extra columns and we're back to row level operations which I think is the bottleneck here. 批量加载工作真的很快但我必须添加额外列的数据,我们回到行级操作,我认为这是瓶颈。 I'm getting ready to try: 我正准备尝试:

  1. Appending the data with Pandas. 使用Pandas附加数据。
  2. Writing the data back out to a CSV. 将数据写回CSV。
  3. Bulk loading the CSV. 批量加载CSV。

This bothers me because I now have two I/O operations. 这让我感到困扰,因为我现在有两个I / O操作。 Read the file into pandas and write the file back out again. 将文件读入pandas并再次将文件写回。

I read somewhere that Pandas was written in C or something so it should be really fast. 我在某处读过Pandas是用C语写的,所以它应该非常快。 Flushing a dataframe to the database wasn't that fast. 将数据帧刷新到数据库并不是那么快。 At this point, I'm asking if anybody has a faster approach that they use in the real world. 在这一点上,我问是否有人在现实世界中使用更快的方法。 So far what i have is below: 到目前为止,我所拥有的是:

import pypyodbc
conn_str = "DSN=[dsn name];"
cnxn = pypyodbc.connect(conn_str)
crsr = cnxn.cursor()
sql = "BULK INSERT pre_stage_view FROM '[file path]' WITH (FIELDTERMINATOR = ',',ROWTERMINATOR = '\n')"
crsr.execute(sql)
cnxn.commit()
crsr.close()
cnxn.close()

This is the stored procedure get rid of headers: 这是存储过程摆脱标题:

DELETE FROM pre_stage_table WHERE Symbol = 'Symbol'


INSERT INTO stage_table(
[Symbol],
[Exchange],
[Date],
[Open],
[High],
[Low],
[Close],
[Volume],
[SourceSystem],
[RunDate]
)
SELECT
[Symbol],
@exchange, --passed in proc parameter
[Date],
[Open],
[High],
[Low],
[Close],
[Volume],
'EODData',
CURRENT_TIMESTAMP
FROM pre_stage_table


TRUNCATE TABLE pre_stage_table

Bulk load works REALLY fast but then I have to add the data for the extra columns and we're back to row level operations which I think is the bottleneck here. 批量加载工作真的很快但我必须添加额外列的数据,我们回到行级操作,我认为这是瓶颈。

Sorry but I do not understand why you have row level operations. 抱歉,但我不明白为什么你有行级操作。 Try: 尝试:

1) bulk load to stage table 1)批量加载到舞台表

2) MERGE stage table with target table 2) MERGE阶段表与目标表

You will still get set-based approach with presumably decent performance. 你仍然会得到基于集合的方法,可能性能不错。 Remember to disable triggers (if possible on target) plus you may drop indexes, load data and rebuild them after. 请记住禁用触发器(如果可能在目标上),您可以删除索引,加载数据并在之后重建它们。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM