简体   繁体   中英

PC hangs while importing data from pandas dataframe into SQL database using to_sql

So, I am currently dealing with a large amount of data(62 files, ~150mb each, with ~3.5 million rows each) and I need to upload these flat files to MySQL database, so I used pandas library to read through the .csv files and make one dataframe containing all the data.

About the data:

The data is split into multiple files as each file contains info about a particular location(coordinates) on different days. For eg: one file contains data for about 3.5 million coordinates(these are fixed and common among all the files) on a particular day and I have 62 such files meaning data on 62 different dates.
I want to see all this data in one table, ie, I want to add the extra columns(info of different dates corresponding to a list of locations) into the table and the columns that are common(locations/coordinates) should appear only once in this table. An illustration of how I want the data in the table is present here

Till now my code looks like this:

#relevant imports:
import os
import MySQLdb as sql
import numpy as np
from functools import reduce
from sqlalchemy import create_engine
import glob
import pandas as pd

#Accessing database:
engine = create_engine("mysql://root:PASSWORD@localhost/DBNAME")
con = engine.connect()

#Directory change:

path="C:\\Users\\DELL\\Desktop\\DataAnalysis-Internship 18'\\extracted data\\"
os.chdir(path)

#listing file names:
files=glob.glob("IN.201*.csv")
num_files=len(files)

#DATAFRAMES:
#initialiasing first dataframe:
df=pd.read_csv(path+files[0])
df.rename(columns={"avg":"avg"+str(0)}, inplace=True)

for file in files[1:]:
  i=1
  df_next=pd.read_csv(path+file)
  df_next.rename(columns={"avg":"avg"+str(i)}, inplace=True)
  df=df.merge(df_next, on=['lon','lat','country'])
  i=i+1

df.to_sql(con=con, name='final_table', if_exists='replace', chunksize=10000)

con.close() 

When I run this code though, my PC starts executing the command and slowly the performance goes down and eventually the PC starts hanging. Probably too much memory is being used up maybe because I am making a single dataframe out of many and then transferring it to the database(I am not sure).

What should I do now to upload all these files to my table in the way I want? Is there a more optimized way? I thought of merging and making tables from 5 files at a time and then using the attribute "if_exists='append'" in 'to_sql' command to join these multiple dataframes into one table instead of making one huge dataframe and then transferring it, but I am not sure about the 'append' attribute as to whether it will join the tables the way I want it to.

My guess is that the problem arises with df = df.merge occuring in the loop

Never call DataFrame.append or pd.concat inside a for-loop. It leads to quadratic copying. Source: Why does concatenation of DataFrames get exponentially slower?

Try adding all the dataframes to a list df_list and then concatenating them once at the end:

import pandas as pd

df_list = []
df_list.append(pd.DataFrame(dict(a=[1, 2, 3], b=[4, 5, 6])))
df_list.append(pd.DataFrame(dict(c=[1, 2, 3], d=[4, 5, 6])))

pd.concat(df, axis=1)
Out[]:
   a  b  c  d
0  1  4  1  4
1  2  5  2  5
2  3  6  3  6

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM