[英]Insert data into Postgresql from Python
≈105 seconds per 1 million rows to insert into Postgresql local database on table with 2 indexes and 4 columns it is slow or fast ? 每100万行≈105秒要插入具有2个索引和4列的表的Postgresql本地数据库中,是慢还是快?
Python Code: Python代码:
import os
import pandas as pd
from concurrent.futures import ThreadPoolExecutor, as_completed
from sqlalchemy import create_engine
num = 32473068
batch = 1000000
def main(data):
engine = create_engine('postgresql://***:****' + host + ':5432/kaggle')
data.to_sql(con=engine, name=tbl_name, if_exists='append', index=False)
for i in range(0, num, batch):
data = pd.read_csv(data_path+'app_events.csv', skiprows=i, nrows=batch)
data.columns = ['event_id', 'app_id', 'is_installed', 'is_active']
data = data.reset_index(drop=True)
batchSize = 10000
batchList = [data.iloc[x:x + batchSize].reset_index(drop=True) for x in range(0, len(data), batchSize)]
with ThreadPoolExecutor(max_workers=30) as executor:
future_to_url = {executor.submit(main, d): d for d in batchList}
for k, future in enumerate(as_completed(future_to_url)):
url = future_to_url[future]
It depends on your hardware too. 这也取决于您的硬件。 As a reference, my old I5 laptop with HDD uses ~300s to insert 0.1M rows(roughly 200-300 Mega bytes).
作为参考,我的旧HD5笔记本电脑使用〜300s插入0.1M行(大约200-300 Mega字节)。
I learned from other similar questions that to split big values into bulks when using insert() command could speed up. 我从其他类似的问题中学到了,在使用insert()命令时将大值拆分为大量数据可能会加快速度。 Since you're using Pandas I assume it has certain optimization already.
由于您使用的是熊猫,我认为它已经进行了某些优化。 But I suggest you to make a quick test to see if it helps too.
但我建议您进行快速测试,以查看是否也有帮助。
Pandas actually used non-optimized insert command. 熊猫实际上使用了非优化的插入命令。 See ( to_sql + sqlalchemy + copy from + postgresql engine? ).
请参见( to_sql + sqlalchemy +从+ postgresql引擎复制吗? )。 So bulk insert or other methods should be used to improve performance.
因此,应使用批量插入或其他方法来提高性能。
SQLalchemy 1.2 uses bulk insert when you initialize your engine with "use_batch_mode=True" parameter. 当您使用“ use_batch_mode = True”参数初始化引擎时,SQLalchemy 1.2使用批量插入。 I saw 100X speedup on my I5+HDD laptop!
我在I5 + HDD笔记本电脑上看到了100倍的加速! Meaning with 0.1M record, originally it took me 300s and now it's 3s!!.
具有0.1M记录的意义,最初花了我300秒钟,而现在是3秒钟!! If you computer is better than mine, I bet you could see this tremendous speedup with your 1M records.
如果您的计算机比我的计算机更好,我敢打赌,您的1M记录可以看到这种巨大的提速。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.