简体   繁体   English

从Python将数据插入Postgresql

[英]Insert data into Postgresql from Python

≈105 seconds per 1 million rows to insert into Postgresql local database on table with 2 indexes and 4 columns it is slow or fast ? 每100万行≈105秒要插入具有2个索引和4列的表的Postgresql本地数据库中,是慢还是快?

Python Code: Python代码:

import os 
import pandas as pd 
from concurrent.futures import ThreadPoolExecutor, as_completed
from sqlalchemy import create_engine

num =  32473068
batch = 1000000

def main(data):
    engine = create_engine('postgresql://***:****' + host + ':5432/kaggle')
    data.to_sql(con=engine, name=tbl_name, if_exists='append', index=False)

for i in range(0, num, batch):
    data = pd.read_csv(data_path+'app_events.csv', skiprows=i, nrows=batch)
    data.columns = ['event_id', 'app_id', 'is_installed', 'is_active']
    data = data.reset_index(drop=True)
    batchSize = 10000
    batchList = [data.iloc[x:x + batchSize].reset_index(drop=True) for x in range(0, len(data), batchSize)]
    with ThreadPoolExecutor(max_workers=30) as executor:
        future_to_url = {executor.submit(main, d): d for d in batchList}
        for k, future in enumerate(as_completed(future_to_url)):
            url = future_to_url[future]

It depends on your hardware too. 这也取决于您的硬件。 As a reference, my old I5 laptop with HDD uses ~300s to insert 0.1M rows(roughly 200-300 Mega bytes). 作为参考,我的旧HD5笔记本电脑使用〜300s插入0.1M行(大约200-300 Mega字节)。

I learned from other similar questions that to split big values into bulks when using insert() command could speed up. 我从其他类似的问题中学到了,在使用insert()命令时将大值拆分为大量数据可能会加快速度。 Since you're using Pandas I assume it has certain optimization already. 由于您使用的是熊猫,我认为它已经进行了某些优化。 But I suggest you to make a quick test to see if it helps too. 但我建议您进行快速测试,以查看是否也有帮助。

  • Pandas actually used non-optimized insert command. 熊猫实际上使用了非优化的插入命令。 See ( to_sql + sqlalchemy + copy from + postgresql engine? ). 请参见( to_sql + sqlalchemy +从+ postgresql引擎复制吗? )。 So bulk insert or other methods should be used to improve performance. 因此,应使用批量插入或其他方法来提高性能。

  • SQLalchemy 1.2 uses bulk insert when you initialize your engine with "use_batch_mode=True" parameter. 当您使用“ use_batch_mode = True”参数初始化引擎时,SQLalchemy 1.2使用批量插入。 I saw 100X speedup on my I5+HDD laptop! 我在I5 + HDD笔记本电脑上看到了100倍的加速! Meaning with 0.1M record, originally it took me 300s and now it's 3s!!. 具有0.1M记录的意义,最初花了我300秒钟,而现在是3秒钟!! If you computer is better than mine, I bet you could see this tremendous speedup with your 1M records. 如果您的计算机比我的计算机更好,我敢打赌,您的1M记录可以看到这种巨大的提速。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM