简体   繁体   English

读取CSV并插入数据库性能

[英]Read a csv and insert to database performance

I have a mission to read a csv file line by line and insert them to database. 我的任务是逐行读取csv文件并将其插入数据库。

And the csv file contains about 1.7 million lines. csv文件包含约170万行。

I use python with sqlalchemy orm(merge function) to do this. 我将python与sqlalchemy orm(合并功能)配合使用。 But it spend over five hours. 但是它花费了五个多小时。

Is it caused by python slow performance or sqlalchemy or sqlalchemy? 是由python性能下降还是sqlalchemy或sqlalchemy引起的?

or what if i use golang to do it to make a obvious better performance?(but i have no experience on go. Besides, this job need to be scheduled every month) 还是如果我使用golang来使性能显着提高怎么办?(但我没有使用经验。此外,这项工作需要每月安排)

Hope you guy giving any suggestion, thanks! 希望您能提出任何建议,谢谢!

Update: database - mysql 更新:数据库-MySQL

For such a mission you don't want to insert data line by line :) Basically, you have 2 ways: 对于这样的任务,您不想 逐行插入数据:)基本上,您有两种方法:

  1. Ensure that sqlalchemy does not run queries one by one. 确保sqlalchemy不会一一运行查询。 Use BATCH INSERT query ( How to do a batch insert in MySQL ) instead. 改用BATCH INSERT查询( 如何在MySQL中进行批量插入 )。
  2. Massage your data in a way you need, then output it into some temporary CSV file and then run LOAD DATA [LOCAL] INFILE as suggested above. 按需要的方式处理数据,然后将其输出到一些临时CSV文件中,然后按照上面的建议运行LOAD DATA [LOCAL] INFILE If you don't need to preprocess you data, just feed the CSV to the database (I assume it's MySQL) 如果您不需要预处理数据,只需将CSV馈入数据库(我假设它是MySQL)

Follow below three steps 请遵循以下三个步骤

  1. Save the CSV file with the name of table what you want to save it to. 用要保存的表名保存CSV文件。
  2. Execute below python script to create a table dynamically (Update CSV filename, db parameters) 在python脚本下执行以动态创建表(更新CSV文件名,数据库参数)
  3. Execute "mysqlimport --ignore-lines=1 --fields-terminated-by=, --local -u dbuser -p db_name dbtable_name.csv" 执行“ mysqlimport --ignore-lines = 1 --fields-terminated-by =,--local -u dbuser -p db_name dbtable_name.csv”

PYTHON CODE : 密码

import numpy as np
import pandas as pd
from mysql.connector import connect

csv_file = 'dbtable_name.csv'
df = pd.read_csv(csv_file)
table_name = csv_file.split('.')

query = "CREATE TABLE " + table_name[0] + "( \n" 
for count in np.arange(df.columns.values.size):
    query += df.columns.values[count]
    if df.dtypes[count] == 'int64':
        query += "\t\t int(11) NOT NULL"
    elif df.dtypes[count] == 'object':
        query += "\t\t varchar(64) NOT NULL"
    elif df.dtypes[count] == 'float64':
        query += "\t\t float(10,2) NOT NULL"


    if count == 0:
        query += " PRIMARY KEY"

    if count < df.columns.values.size - 1:
        query += ",\n"

query += " );"
#print(query)

database = connect(host='localhost',  # your host
                     user='username', # username
                     passwd='password',     # password
                     db='dbname') #dbname
curs = database.cursor(dictionary=True)
curs.execute(query)
# print(query)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM