简体   繁体   English

Python for 循环 csv 并发

[英]Python for loop csv concurrency

I am stumbling accross large files 80.000 lines + which I do have to keep in my database.我在 80.000 行 + 的大文件中跌跌撞撞,我必须将其保存在我的数据库中。 It takes like 20-30 min to push it all to my mysql database.将它全部推送到我的 mysql 数据库需要 20-30 分钟。 I have a simple for loop, which just loops the whole csv.我有一个简单的 for 循环,它只是循环整个 csv。

import csv
import MySQLdb

# open the connection to the MySQL server.
# using MySQLdb
mydb = MySQLdb.connect(host='hst', user='usr', passwd='pwd', db='db')
cursor = mydb.cursor()
with open('product_de.csv') as csv_file:
    csv_reader = csv.reader(csv_file, delimiter=';')
# execute and insert the csv into the database.
    for row in csv_reader:
    if "PVP_BIG" and "DATE_ADD" in row:
        print "First line removed"
    else:
        print "Not found!"
        sql = "INSERT INTO big (SKU,Category,Attribute1,Attribute2,Value1,Value2,Brand,Price,PVP_BIG,PVD,EAN13,WIDTH,HEIGHT,DEPTH,WEIGHT,Stock) VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s)"
        val = (row[0], row[1],row[3],row[4], row[5],row[6], row[8], row[10], row[11], row[12], row[15], row[16], row[17], row[18], row[19], row[20])
        cursor.execute(sql, val)
        print row
#close the connection to the database.
#mydb.commit()
cursor.close()
print "CSV has been imported into the database"




    

Is there any method, I can divide it like to make it concurrent, so it will take like maybe 3-5 minutes based on the computer hardware?有什么方法,我可以把它分成并发,所以根据计算机硬件可能需要3-5分钟?

First thing you may get a big speedup by removing the print(row) from your inner loop.首先,通过从内部循环中删除打印(行),您可以获得很大的加速。 Everything else in the program waits on this action and it is an IO action that can take much longer than you might think.程序中的其他一切都在等待这个动作,它是一个 IO 动作,可能比你想象的要长得多。 Secondly you might find a significant speedup by batching your INSERT statements, ie inserting more than one row at a time, say 100 or so.其次,通过批处理 INSERT 语句,您可能会发现显着的加速,即一次插入多于一行,比如 100 左右。 Thirdly the best way to do this is probably something involving asyncio but I don't have much experience with it.第三,最好的方法可能是涉及 asyncio,但我对此没有太多经验。 You're likely IO bound talking to the DB and getting data from the csv file and never doing both at once so I'd go with a simple two thread solution like below:您可能会 IO 绑定到 DB 并从 csv 文件中获取数据,并且永远不会同时执行这两项操作,因此我将使用如下简单的双线程解决方案:

import csv
import MySQLdb
import threading 
from queue import Queue




def row_insert_thread(q: Queue, cursor, mydb):
    while True:
        command = q.get()
        if command is None:
            cursor.close()
            #mydb.commit()
            break
        cursor.execute(*command)

mydb = MySQLdb.connect(host='hst', user='usr', passwd='pwd', db='db')
cursor = mydb.cursor()
        
insert_q = Queue()

row_thread = Thread(target=row_insert_thread,args=(insert_q,cursor,mydb)
row_thread.start()


with open('product_de.csv') as csv_file:
    csv_reader = csv.reader(csv_file, delimiter=';')
# execute and insert the csv into the database.
    next(csv_reader) #skip the header row I'm assuming there is only one 
    for row in csv_reader:
        sql = "INSERT INTO big (SKU,Category,Attribute1,Attribute2,Value1,Value2,Brand,Price,PVP_BIG,PVD,EAN13,WIDTH,HEIGHT,DEPTH,WEIGHT,Stock) VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s)"
        val = (row[0], row[1],row[3],row[4], row[5],row[6], row[8], row[10], row[11], row[12], row[15], row[16], row[17], row[18], row[19], row[20])
        insert_q.put((sql, val))
        print row
#close the connection to the database.

insert_q.put(None)
row_thread.join()

print "CSV has been imported into the database"

    

For the insert statement I'm not used to MySQL going from sqlite experience here, I think this will work:对于插入语句,我不习惯 MySQL 从这里的 SQLite 经验开始,我认为这会起作用:

def insert_multiple_rows(cursor, rows:list):
    sql = f"INSERT INTO big (SKU,Category,Attribute1,Attribute2,Value1,Value2,Brand,Price,PVP_BIG,PVD,EAN13,WIDTH,HEIGHT,DEPTH,WEIGHT,Stock) VALUES {'(%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s),'*len(rows)}"[:-1]
    args = [col for col in [row for row in rows]]
    cursor.execute(sql,args)

I expect you can integrate this into your code if you want to use it just change the thread to take a list then in the main loop add values to the list until it reaches whatever number you want or you run out of rows, then put the list into the insert_q我希望您可以将它集成到您​​的代码中,如果您想使用它,只需更改线程以获取一个列表,然后在主循环中将值添加到列表中,直到它达到您想要的任何数字或用完行,然后将列表到 insert_q

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM