Concatenating and sorting thousands of CSV files

Question

I have thousands of csv files in disk. Each of them with a size of approximately ~10MB (~10K columns). Most of these columns hold real (float) values.

I would like to create a dataframe by concatenating these files. Once I have this dataframe, I would like to sort its entries by the first two columns.

I currently have the following:

my_dfs = list()
for ix, file in enumerate(p_files):
    my_dfs.append(
       pd.read_csv(p_files[ix], sep=':', dtype={'c1' : np.object_, 'c2' : np.object_}))

print("Concatenating files ...")
df_merged= pd.concat(my_dfs)

print("Sorting the result by the first two columns...")
df_merged = df_merged.sort(['videoID', 'frameID'], ascending=[1, 1])

print("Saving it to disk ..")
df_merged.to_csv(p_output, sep=':', index=False)

But this requires so much memory that my process is killed before getting the result (in the logs I see that the process is killed when its using around 10GB of memory).

I am trying to figure out where exactly it fails, but I am still unable to do it (although I hope to log the stdout soon)

Is there a better way to do this in Pandas?

Answer 1

Loading them into a database is easy, flexible for making changes later on and takes advantage of all the optimization work that goes into databases. Once you've loaded it, if you wanted to get an iterable of the data, you could run the following query and be done:

SELECT * FROM my_table ORDER BY column1, column2

I'm pretty sure there are more direct ways to load into sqlite3 within sqlite3, but if you don't want to do it directly in sqlite, you can use python to do load in the data taking advantage of csv reader as an iterator so you only load a minimal amount into memory, as in:

import csv
import sqlite3
conn = sqlite3.Connection(dbpath)
c = conn.cursor()

for path in paths:
    with open(path) as f:
         reader = csv.reader(f)
         c.executemany("INSERT INTO mytable VALUES (?,?,?)""", reader)

that way, you don't have to ever load too much into memory and you can take advantage of sqlite.

After that (if you wanted to again do it in Python) you could do:

import csv
import sqlite3
conn = sqlite3.Connection(dbpath)
c = conn.cursor()

with open(outpath) as f:
    writer = csv.writer
    writer.writerows(c.execute("SELECT * FROM mytable ORDER BY col1, col2"))

Concatenating and sorting thousands of CSV files

Question

1 answers

solution1
4 2013-05-18 01:06:53

Concatenating and sorting thousands of CSV files

Question

1 answers

solution1 4 2013-05-18 01:06:53

solution1
4 2013-05-18 01:06:53