简体   繁体   中英

How can I insert rows into mysql table faster using python?

I am trying to find a faster method to insert data into my table, the table should end up with over 100 million rows, I have been running my code for 24 hours nearly and the table currently only has 9 million rows entered and is still in progress.

My code currently reads 300 csv files at a time, and stores the data in a list, it gets filtered for duplicate rows, then I use a for loop to place an entry in the list as a tuple and update the table one tuple at a time. This method just takes too long, is there a way for me to bulk insert all rows? I have tried looking online but the methods I am reading do not seem to help in my situation.

Many thanks,

David

import glob
import os
import csv
import mysql.connector

# MYSQL logon
mydb = mysql.connector.connect(
    host="localhost",
    user="David",
    passwd="Sword",
    database="twitch"
)
mycursor = mydb.cursor()

# list for strean data file names
streamData=[]

# This function obtains file name list from a folder, this is to open files 
in other functions
def getFileNames():
    global streamData
    global topGames

    # the folders to be scanned
    #os.chdir("D://UG_Project_Data")
    os.chdir("E://UG_Project_Data")
    # obtains stream data file names
    for file in glob.glob("*streamD*"):
        streamData.append(file)
    return

# List to store stream data from csv files
sData = []
# Function to read all streamData csv files and store data in a list
def streamsToList():
    global streamData
    global sData

    # Same as gamesToList
    index = len(streamData)
    num = 0
    theFile = streamData[0]
    for x in range(index):
        if (num == 301):
            filterStreams(sData)
            num = 0
            sData.clear()
        try:
            theFile = streamData[x]
            timestamp = theFile[0:15]
            dateTime = timestamp[4:8]+"-"+timestamp[2:4]+"-"+timestamp[0:2]+"T"+timestamp[9:11]+":"+timestamp[11:13]+":"+timestamp[13:15]+"Z"
            with open (theFile, encoding="utf-8-sig") as f:
                reader = csv.reader(f)
                next(reader) # skip header
                for row in reader:
                    if (row != []):
                        col1 = row[0]
                        col2 = row[1]
                        col3 = row[2]
                        col4 = row[3]
                        col5 = row[4]
                        col6 = row[5]
                        col7 = row[6]
                        col8 = row[7]
                        col9 = row[8]
                        col10 = row[9]
                        col11 = row[10]
                        col12 = row[11]
                        col13 = dateTime
                        temp = col1, col2, col3, col4, col5, col6, col7, col8, col9, col10, col11, col12, col13
                        sData.append(temp)
        except:
            print("Problem file:")
            print(theFile)
        print(num)
        num +=1
    return

def filterStreams(self):
    sData = self
    dataSet = set(tuple(x) for x in sData)
    sData = [ list (x) for x in dataSet ]
    return createStreamDB(sData)

# Function to create a table of stream data
def createStreamDB(self):
    global mydb
    global mycursor
    sData = self
    tupleList = ()
    for x in sData:
        tupleList = tuple(x)
        sql = "INSERT INTO streams (id, user_id, user_name, game_id, community_ids, type, title, viewer_count, started_at, language, thumbnail_url, tag_ids, time_stamp) VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s)"
        val = tupleList
        try:
            mycursor.execute(sql, val)
            mydb.commit()
        except:
            test = 1
    return

if __name__== '__main__':
    getFileNames()
    streamsToList()
    filterStreams(sData)

If some of your rows succeeds but the some fails, Do you want your database to be left in a corrupt state? if no, try to commit out of the loop. like this:

for x in sData:
    tupleList = tuple(x)
    sql = "INSERT INTO streams (id, user_id, user_name, game_id, community_ids, type, title, viewer_count, started_at, language, thumbnail_url, tag_ids, time_stamp) VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s)"
    val = tupleList
    try:
        mycursor.execute(sql, val)
    except:
        # do some thing
        pass
try:
    mydb.commit()
except:
    test = 1

And if you don't. try to load your cvs file into your mysql directly.

LOAD DATA INFILE "/home/your_data.csv"
INTO TABLE CSVImport
COLUMNS TERMINATED BY ','
OPTIONALLY ENCLOSED BY '"'
LINES TERMINATED BY '\n'
IGNORE 1 LINES;

Also, to make you more clear. I've define three ways to insert those data, if you insistent to use python, since you have some processing with your data.

Bad way

In [18]: def inside_loop(): 
    ...:     start = time.time() 
    ...:     for i in range(10000): 
    ...:         mycursor = mydb.cursor() 
    ...:         sql = "insert into t1(name, age)values(%s, %s)" 
    ...:         try: 
    ...:             mycursor.execute(sql, ("frank", 27)) 
    ...:             mydb.commit() 
    ...:         except: 
    ...:             print("Failure..") 
    ...:     print("cost :{}".format(time.time() - start)) 
    ...: 

Time cost:

In [19]: inside_loop()                                                                                                                                                                                                                        
cost :5.92155909538269 

Okay way

In [9]: def outside_loop(): 
   ...:     start = time.time() 
   ...:     for i in range(10000): 
   ...:         mycursor = mydb.cursor() 
   ...:         sql = "insert into t1(name, age)values(%s, %s)" 
   ...:         try: 
   ...:             mycursor.execute(sql, ["frank", 27]) 
   ...:         except: 
   ...:             print("do something ..") 
   ...:              
   ...:     try: 
   ...:         mydb.commit() 
   ...:     except: 
   ...:         print("Failure..") 
   ...:     print("cost :{}".format(time.time() - start))

Time cost:

In [10]: outside_loop()                                                                                                                                                                                                                       
cost :0.9959311485290527

Maybe, there are still having some better way, even best. (ie, use pandas to process your data. and try redesign your table ...)

You might like my presentation Load Data Fast! in which I compared different methods of inserting bulk data, and did benchmarks to see which was the fastest method.

Inserting one row at a time, committing a transaction for each row, is about the worst way you can do it.

Using LOAD DATA INFILE is fastest by a wide margin. Although there are some configuration changes you need to make on a default MySQL instance to allow it to work. Read the MySQL documentation about options secure_file_priv and local_infile .

Even without using LOAD DATA INFILE, you can do much better. You can insert multiple rows per INSERT, and you can execute multiple INSERT statements per transaction.

I wouldn't try to INSERT the whole 100 million rows in a single transaction, though. My habit is to commit about once every 10,000 rows.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM