Read a csv and insert to database performance

Question

I have a mission to read a csv file line by line and insert them to database.

And the csv file contains about 1.7 million lines.

I use python with sqlalchemy orm(merge function) to do this. But it spend over five hours.

Is it caused by python slow performance or sqlalchemy or sqlalchemy?

or what if i use golang to do it to make a obvious better performance?(but i have no experience on go. Besides, this job need to be scheduled every month)

Hope you guy giving any suggestion, thanks!

Update: database - mysql

Answer 1

For such a mission you don't want to insert data line by line :) Basically, you have 2 ways:

Ensure that sqlalchemy does not run queries one by one. Use BATCH INSERT query ( How to do a batch insert in MySQL ) instead.
Massage your data in a way you need, then output it into some temporary CSV file and then run LOAD DATA [LOCAL] INFILE as suggested above. If you don't need to preprocess you data, just feed the CSV to the database (I assume it's MySQL)

Answer 2

Follow below three steps

Save the CSV file with the name of table what you want to save it to.
Execute below python script to create a table dynamically (Update CSV filename, db parameters)
Execute "mysqlimport --ignore-lines=1 --fields-terminated-by=, --local -u dbuser -p db_name dbtable_name.csv"

PYTHON CODE :

import numpy as np
import pandas as pd
from mysql.connector import connect

csv_file = 'dbtable_name.csv'
df = pd.read_csv(csv_file)
table_name = csv_file.split('.')

query = "CREATE TABLE " + table_name[0] + "( \n" 
for count in np.arange(df.columns.values.size):
    query += df.columns.values[count]
    if df.dtypes[count] == 'int64':
        query += "\t\t int(11) NOT NULL"
    elif df.dtypes[count] == 'object':
        query += "\t\t varchar(64) NOT NULL"
    elif df.dtypes[count] == 'float64':
        query += "\t\t float(10,2) NOT NULL"


    if count == 0:
        query += " PRIMARY KEY"

    if count < df.columns.values.size - 1:
        query += ",\n"

query += " );"
#print(query)

database = connect(host='localhost',  # your host
                     user='username', # username
                     passwd='password',     # password
                     db='dbname') #dbname
curs = database.cursor(dictionary=True)
curs.execute(query)
# print(query)

Read a csv and insert to database performance

Question

2 answers

solution1
2 2016-03-22 06:35:18

solution2
0 2018-06-06 07:14:17

Read a csv and insert to database performance

Question

2 answers

solution1 2 2016-03-22 06:35:18

solution2 0 2018-06-06 07:14:17

solution1
2 2016-03-22 06:35:18

solution2
0 2018-06-06 07:14:17