如何使用python更快地将行插入mysql表？

Question

I am trying to find a faster method to insert data into my table, the table should end up with over 100 million rows, I have been running my code for 24 hours nearly and the table currently only has 9 million rows entered and is still in progress. 我正在尝试寻找一种更快的方法将数据插入表中，该表最终将超过1亿行，我将代码运行了近24小时，而该表目前仅输入了900万行，并且仍在进展。

My code currently reads 300 csv files at a time, and stores the data in a list, it gets filtered for duplicate rows, then I use a for loop to place an entry in the list as a tuple and update the table one tuple at a time. 我的代码当前一次读取300个csv文件，并将数据存储在一个列表中，对重复的行进行过滤，然后使用for循环将一个条目作为一个元组放置在列表中，并在一个表中更新一个元组时间。 This method just takes too long, is there a way for me to bulk insert all rows? 这种方法花费的时间太长，有没有办法让我批量插入所有行？ I have tried looking online but the methods I am reading do not seem to help in my situation. 我尝试过在线查找，但是我正在阅读的方法似乎对我的情况没有帮助。

Many thanks, 非常感谢，

David 大卫

import glob
import os
import csv
import mysql.connector

# MYSQL logon
mydb = mysql.connector.connect(
    host="localhost",
    user="David",
    passwd="Sword",
    database="twitch"
)
mycursor = mydb.cursor()

# list for strean data file names
streamData=[]

# This function obtains file name list from a folder, this is to open files 
in other functions
def getFileNames():
    global streamData
    global topGames

    # the folders to be scanned
    #os.chdir("D://UG_Project_Data")
    os.chdir("E://UG_Project_Data")
    # obtains stream data file names
    for file in glob.glob("*streamD*"):
        streamData.append(file)
    return

# List to store stream data from csv files
sData = []
# Function to read all streamData csv files and store data in a list
def streamsToList():
    global streamData
    global sData

    # Same as gamesToList
    index = len(streamData)
    num = 0
    theFile = streamData[0]
    for x in range(index):
        if (num == 301):
            filterStreams(sData)
            num = 0
            sData.clear()
        try:
            theFile = streamData[x]
            timestamp = theFile[0:15]
            dateTime = timestamp[4:8]+"-"+timestamp[2:4]+"-"+timestamp[0:2]+"T"+timestamp[9:11]+":"+timestamp[11:13]+":"+timestamp[13:15]+"Z"
            with open (theFile, encoding="utf-8-sig") as f:
                reader = csv.reader(f)
                next(reader) # skip header
                for row in reader:
                    if (row != []):
                        col1 = row[0]
                        col2 = row[1]
                        col3 = row[2]
                        col4 = row[3]
                        col5 = row[4]
                        col6 = row[5]
                        col7 = row[6]
                        col8 = row[7]
                        col9 = row[8]
                        col10 = row[9]
                        col11 = row[10]
                        col12 = row[11]
                        col13 = dateTime
                        temp = col1, col2, col3, col4, col5, col6, col7, col8, col9, col10, col11, col12, col13
                        sData.append(temp)
        except:
            print("Problem file:")
            print(theFile)
        print(num)
        num +=1
    return

def filterStreams(self):
    sData = self
    dataSet = set(tuple(x) for x in sData)
    sData = [ list (x) for x in dataSet ]
    return createStreamDB(sData)

# Function to create a table of stream data
def createStreamDB(self):
    global mydb
    global mycursor
    sData = self
    tupleList = ()
    for x in sData:
        tupleList = tuple(x)
        sql = "INSERT INTO streams (id, user_id, user_name, game_id, community_ids, type, title, viewer_count, started_at, language, thumbnail_url, tag_ids, time_stamp) VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s)"
        val = tupleList
        try:
            mycursor.execute(sql, val)
            mydb.commit()
        except:
            test = 1
    return

if __name__== '__main__':
    getFileNames()
    streamsToList()
    filterStreams(sData)

Answer 1

If some of your rows succeeds but the some fails, Do you want your database to be left in a corrupt state? 如果某些行成功但有些行失败，是否要让数据库处于损坏状态？ if no, try to commit out of the loop. 如果否，请尝试退出循环。 like this: 像这样：

for x in sData:
    tupleList = tuple(x)
    sql = "INSERT INTO streams (id, user_id, user_name, game_id, community_ids, type, title, viewer_count, started_at, language, thumbnail_url, tag_ids, time_stamp) VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s)"
    val = tupleList
    try:
        mycursor.execute(sql, val)
    except:
        # do some thing
        pass
try:
    mydb.commit()
except:
    test = 1

And if you don't. 如果没有的话。 try to load your cvs file into your mysql directly. 尝试直接将cvs文件加载到mysql中。

LOAD DATA INFILE "/home/your_data.csv"
INTO TABLE CSVImport
COLUMNS TERMINATED BY ','
OPTIONALLY ENCLOSED BY '"'
LINES TERMINATED BY '\n'
IGNORE 1 LINES;

Also, to make you more clear. 另外，使您更清晰。 I've define three ways to insert those data, if you insistent to use python, since you have some processing with your data. 如果您坚持使用python，我已经定义了三种插入数据的方法，因为您需要对数据进行一些处理。

Bad way 不好的方法

In [18]: def inside_loop(): 
    ...:     start = time.time() 
    ...:     for i in range(10000): 
    ...:         mycursor = mydb.cursor() 
    ...:         sql = "insert into t1(name, age)values(%s, %s)" 
    ...:         try: 
    ...:             mycursor.execute(sql, ("frank", 27)) 
    ...:             mydb.commit() 
    ...:         except: 
    ...:             print("Failure..") 
    ...:     print("cost :{}".format(time.time() - start)) 
    ...:

Time cost: 时间成本：

In [19]: inside_loop()                                                                                                                                                                                                                        
cost :5.92155909538269

Okay way 好吧

In [9]: def outside_loop(): 
   ...:     start = time.time() 
   ...:     for i in range(10000): 
   ...:         mycursor = mydb.cursor() 
   ...:         sql = "insert into t1(name, age)values(%s, %s)" 
   ...:         try: 
   ...:             mycursor.execute(sql, ["frank", 27]) 
   ...:         except: 
   ...:             print("do something ..") 
   ...:              
   ...:     try: 
   ...:         mydb.commit() 
   ...:     except: 
   ...:         print("Failure..") 
   ...:     print("cost :{}".format(time.time() - start))

Time cost: 时间成本：

In [10]: outside_loop()                                                                                                                                                                                                                       
cost :0.9959311485290527

Maybe, there are still having some better way, even best. 也许，还有更好的方法，甚至最好的方法。 (ie, use pandas to process your data. and try redesign your table ...) （即，使用pandas处理数据。并尝试重新设计表...）

Answer 2

You might like my presentation Load Data Fast! 您可能喜欢我的演示文稿，快速加载数据！ in which I compared different methods of inserting bulk data, and did benchmarks to see which was the fastest method. 在其中，我比较了插入大数据的不同方法，并进行了基准测试，以了解哪种方法最快。

Inserting one row at a time, committing a transaction for each row, is about the worst way you can do it. 一次插入一行，为每一行提交一个事务，这是最差的操作方式。

Using LOAD DATA INFILE is fastest by a wide margin. 使用LOAD DATA INFILE是最快的。 Although there are some configuration changes you need to make on a default MySQL instance to allow it to work. 尽管您需要对默认的MySQL实例进行一些配置更改才能使其正常工作。 Read the MySQL documentation about options secure_file_priv and local_infile . 阅读有关选项secure_file_priv和local_infile的MySQL文档。

Even without using LOAD DATA INFILE, you can do much better. 即使不使用LOAD DATA INFILE，您也可以做得更好。 You can insert multiple rows per INSERT, and you can execute multiple INSERT statements per transaction. 您可以为每个INSERT插入多行，并且可以为每个事务执行多个INSERT语句。

I wouldn't try to INSERT the whole 100 million rows in a single transaction, though. 不过，我不会尝试在单个事务中插入全部1亿行。 My habit is to commit about once every 10,000 rows. 我的习惯是每10,000行提交一次。

如何使用python更快地将行插入mysql表？

问题描述

2 个解决方案

解决方案1
2 已采纳 2019-03-04 02:50:06

解决方案2
1 2019-03-04 03:06:28

如何使用python更快地将行插入mysql表？

问题描述

2 个解决方案

解决方案1 2 已采纳 2019-03-04 02:50:06

解决方案2 1 2019-03-04 03:06:28

解决方案1
2 已采纳 2019-03-04 02:50:06

解决方案2
1 2019-03-04 03:06:28