Python for 循环 csv 并发

Question

我在 80.000 行 + 的大文件中跌跌撞撞，我必须将其保存在我的数据库中。 将它全部推送到我的 mysql 数据库需要 20-30 分钟。 我有一个简单的 for 循环，它只是循环整个 csv。

import csv
import MySQLdb

# open the connection to the MySQL server.
# using MySQLdb
mydb = MySQLdb.connect(host='hst', user='usr', passwd='pwd', db='db')
cursor = mydb.cursor()
with open('product_de.csv') as csv_file:
    csv_reader = csv.reader(csv_file, delimiter=';')
# execute and insert the csv into the database.
    for row in csv_reader:
    if "PVP_BIG" and "DATE_ADD" in row:
        print "First line removed"
    else:
        print "Not found!"
        sql = "INSERT INTO big (SKU,Category,Attribute1,Attribute2,Value1,Value2,Brand,Price,PVP_BIG,PVD,EAN13,WIDTH,HEIGHT,DEPTH,WEIGHT,Stock) VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s)"
        val = (row[0], row[1],row[3],row[4], row[5],row[6], row[8], row[10], row[11], row[12], row[15], row[16], row[17], row[18], row[19], row[20])
        cursor.execute(sql, val)
        print row
#close the connection to the database.
#mydb.commit()
cursor.close()
print "CSV has been imported into the database"

有什么方法，我可以把它分成并发，所以根据计算机硬件可能需要3-5分钟？

Answer 1

首先，通过从内部循环中删除打印（行），您可以获得很大的加速。 程序中的其他一切都在等待这个动作，它是一个 IO 动作，可能比你想象的要长得多。 其次，通过批处理 INSERT 语句，您可能会发现显着的加速，即一次插入多于一行，比如 100 左右。 第三，最好的方法可能是涉及 asyncio，但我对此没有太多经验。 您可能会 IO 绑定到 DB 并从 csv 文件中获取数据，并且永远不会同时执行这两项操作，因此我将使用如下简单的双线程解决方案：

import csv
import MySQLdb
import threading 
from queue import Queue




def row_insert_thread(q: Queue, cursor, mydb):
    while True:
        command = q.get()
        if command is None:
            cursor.close()
            #mydb.commit()
            break
        cursor.execute(*command)

mydb = MySQLdb.connect(host='hst', user='usr', passwd='pwd', db='db')
cursor = mydb.cursor()
        
insert_q = Queue()

row_thread = Thread(target=row_insert_thread,args=(insert_q,cursor,mydb)
row_thread.start()


with open('product_de.csv') as csv_file:
    csv_reader = csv.reader(csv_file, delimiter=';')
# execute and insert the csv into the database.
    next(csv_reader) #skip the header row I'm assuming there is only one 
    for row in csv_reader:
        sql = "INSERT INTO big (SKU,Category,Attribute1,Attribute2,Value1,Value2,Brand,Price,PVP_BIG,PVD,EAN13,WIDTH,HEIGHT,DEPTH,WEIGHT,Stock) VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s)"
        val = (row[0], row[1],row[3],row[4], row[5],row[6], row[8], row[10], row[11], row[12], row[15], row[16], row[17], row[18], row[19], row[20])
        insert_q.put((sql, val))
        print row
#close the connection to the database.

insert_q.put(None)
row_thread.join()

print "CSV has been imported into the database"

对于插入语句，我不习惯 MySQL 从这里的 SQLite 经验开始，我认为这会起作用：

def insert_multiple_rows(cursor, rows:list):
    sql = f"INSERT INTO big (SKU,Category,Attribute1,Attribute2,Value1,Value2,Brand,Price,PVP_BIG,PVD,EAN13,WIDTH,HEIGHT,DEPTH,WEIGHT,Stock) VALUES {'(%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s),'*len(rows)}"[:-1]
    args = [col for col in [row for row in rows]]
    cursor.execute(sql,args)

我希望您可以将它集成到您的代码中，如果您想使用它，只需更改线程以获取一个列表，然后在主循环中将值添加到列表中，直到它达到您想要的任何数字或用完行，然后将列表到 insert_q

Python for 循环 csv 并发

问题描述

1 个解决方案

解决方案1
0 已采纳 2020-11-17 17:04:35

Python for 循环 csv 并发

问题描述

1 个解决方案

解决方案1 0 已采纳 2020-11-17 17:04:35

解决方案1
0 已采纳 2020-11-17 17:04:35