[英]ETL process with Python and SQL Server taking a really long time to load
I'm looking for a technique that will increase the performance of a csv file SQL Server database load process. 我正在寻找一种技术,可以提高csv文件SQL Server数据库加载过程的性能。 I've attempted various approaches but nothing I do seems to be able to break the 5.5 hour barrier.
我尝试了各种方法,但我所做的一切似乎都无法打破5.5小时的障碍。 That's just testing loading a year of data which is about 2 million records.
这只是测试加载一年的数据,大约200万条记录。 I have 20 years of data to load eventually so loading data for 4 days straight isn't going to work.
我有20年的数据最终加载所以加载数据连续4天是行不通的。
The challenge is, the data has to be enriched on load. 挑战在于,数据必须在负载上得到丰富。 I have to add some columns because that information isn't native to the file.
我必须添加一些列,因为该信息不是文件的原生信息。 So far I've tried:
到目前为止,我已经尝试过:
Bulk load works REALLY fast but then I have to add the data for the extra columns and we're back to row level operations which I think is the bottleneck here. 批量加载工作真的很快但我必须添加额外列的数据,我们回到行级操作,我认为这是瓶颈。 I'm getting ready to try:
我正准备尝试:
This bothers me because I now have two I/O operations. 这让我感到困扰,因为我现在有两个I / O操作。 Read the file into pandas and write the file back out again.
将文件读入pandas并再次将文件写回。
I read somewhere that Pandas was written in C or something so it should be really fast. 我在某处读过Pandas是用C语写的,所以它应该非常快。 Flushing a dataframe to the database wasn't that fast.
将数据帧刷新到数据库并不是那么快。 At this point, I'm asking if anybody has a faster approach that they use in the real world.
在这一点上,我问是否有人在现实世界中使用更快的方法。 So far what i have is below:
到目前为止,我所拥有的是:
import pypyodbc
conn_str = "DSN=[dsn name];"
cnxn = pypyodbc.connect(conn_str)
crsr = cnxn.cursor()
sql = "BULK INSERT pre_stage_view FROM '[file path]' WITH (FIELDTERMINATOR = ',',ROWTERMINATOR = '\n')"
crsr.execute(sql)
cnxn.commit()
crsr.close()
cnxn.close()
This is the stored procedure get rid of headers: 这是存储过程摆脱标题:
DELETE FROM pre_stage_table WHERE Symbol = 'Symbol'
INSERT INTO stage_table(
[Symbol],
[Exchange],
[Date],
[Open],
[High],
[Low],
[Close],
[Volume],
[SourceSystem],
[RunDate]
)
SELECT
[Symbol],
@exchange, --passed in proc parameter
[Date],
[Open],
[High],
[Low],
[Close],
[Volume],
'EODData',
CURRENT_TIMESTAMP
FROM pre_stage_table
TRUNCATE TABLE pre_stage_table
Bulk load works REALLY fast but then I have to add the data for the extra columns and we're back to row level operations which I think is the bottleneck here.
批量加载工作真的很快但我必须添加额外列的数据,我们回到行级操作,我认为这是瓶颈。
Sorry but I do not understand why you have row level operations. 抱歉,但我不明白为什么你有行级操作。 Try:
尝试:
1) bulk load to stage table 1)批量加载到舞台表
2) MERGE
stage table with target table 2)
MERGE
阶段表与目标表
You will still get set-based approach with presumably decent performance. 你仍然会得到基于集合的方法,可能性能不错。 Remember to disable triggers (if possible on target) plus you may drop indexes, load data and rebuild them after.
请记住禁用触发器(如果可能在目标上),您可以删除索引,加载数据并在之后重建它们。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.