从Python将数据插入Postgresql

Question

≈105 seconds per 1 million rows to insert into Postgresql local database on table with 2 indexes and 4 columns it is slow or fast ? 每100万行≈105秒要插入具有2个索引和4列的表的Postgresql本地数据库中，是慢还是快？

Python Code: Python代码：

import os 
import pandas as pd 
from concurrent.futures import ThreadPoolExecutor, as_completed
from sqlalchemy import create_engine

num =  32473068
batch = 1000000

def main(data):
    engine = create_engine('postgresql://***:****' + host + ':5432/kaggle')
    data.to_sql(con=engine, name=tbl_name, if_exists='append', index=False)

for i in range(0, num, batch):
    data = pd.read_csv(data_path+'app_events.csv', skiprows=i, nrows=batch)
    data.columns = ['event_id', 'app_id', 'is_installed', 'is_active']
    data = data.reset_index(drop=True)
    batchSize = 10000
    batchList = [data.iloc[x:x + batchSize].reset_index(drop=True) for x in range(0, len(data), batchSize)]
    with ThreadPoolExecutor(max_workers=30) as executor:
        future_to_url = {executor.submit(main, d): d for d in batchList}
        for k, future in enumerate(as_completed(future_to_url)):
            url = future_to_url[future]

Answer 1

It depends on your hardware too. 这也取决于您的硬件。 As a reference, my old I5 laptop with HDD uses ~300s to insert 0.1M rows(roughly 200-300 Mega bytes). 作为参考，我的旧HD5笔记本电脑使用〜300s插入0.1M行（大约200-300 Mega字节）。

I learned from other similar questions that to split big values into bulks when using insert() command could speed up. 我从其他类似的问题中学到了，在使用insert（）命令时将大值拆分为大量数据可能会加快速度。 Since you're using Pandas I assume it has certain optimization already. 由于您使用的是熊猫，我认为它已经进行了某些优化。 But I suggest you to make a quick test to see if it helps too. 但我建议您进行快速测试，以查看是否也有帮助。

Pandas actually used non-optimized insert command. 熊猫实际上使用了非优化的插入命令。 See ( to_sql + sqlalchemy + copy from + postgresql engine? ). 请参见（ to_sql + sqlalchemy +从+ postgresql引擎复制吗？）。 So bulk insert or other methods should be used to improve performance. 因此，应使用批量插入或其他方法来提高性能。
SQLalchemy 1.2 uses bulk insert when you initialize your engine with "use_batch_mode=True" parameter. 当您使用“ use_batch_mode = True”参数初始化引擎时，SQLalchemy 1.2使用批量插入。 I saw 100X speedup on my I5+HDD laptop! 我在I5 + HDD笔记本电脑上看到了100倍的加速！ Meaning with 0.1M record, originally it took me 300s and now it's 3s!!. 具有0.1M记录的意义，最初花了我300秒钟，而现在是3秒钟！！ If you computer is better than mine, I bet you could see this tremendous speedup with your 1M records. 如果您的计算机比我的计算机更好，我敢打赌，您的1M记录可以看到这种巨大的提速。

从Python将数据插入Postgresql

问题描述

1 个解决方案

解决方案1
2 2017-12-29 02:05:49

从Python将数据插入Postgresql

问题描述

1 个解决方案

解决方案1 2 2017-12-29 02:05:49

解决方案1
2 2017-12-29 02:05:49