How do I efficiently combine similar dataframes in Pandas into one giant dataframe

Question

I've got 7000 data frames with columns

Date, X_1
Date, X_2
...

Each dataframe has around 2500 rows.

The dates sometimes overlap, but are not guaranteed to do so.

I'd like to combine them into a dataframe of the form

Date  X_1  X_2 etc.

I tried applying combine_first 7000 times, but it was really slow, as it had to create 7000 new objects, each slightly bigger than the last one.

Is there a more efficient way to combine multiple dataframes?

Answer 1

Assuming that Date is the index rather than a column then you can do an "outer" join :

df1.join([df2, df3, ..., df7000], how='outer')

Note: it may be more efficient to pass in a generator of DataFrames rather than a list.

For example:

df1 = pd.DataFrame([[1, 2]], columns=['a', 'b'])
df2 = pd.DataFrame([[3, 4]], index=[1], columns=['c', 'd'])
df3 = pd.DataFrame([[5, 6], [7, 8]], columns=['e', 'f'])

In [4]: df1.join([df2, df3], how='outer')
Out[4]: 
    a   b   c   d  e  f
0   1   2 NaN NaN  5  6
1 NaN NaN   3   4  7  8

.

If 'Date' is a column you can use set_index first:

df1.set_index('Date', inplace=True)

Answer 2

how about this.

list_of_dfs = os.listdir(dir_with_data)
df = concat(list_of_dfs)
df.set_index('Date')
df = df.unstack()

How do I efficiently combine similar dataframes in Pandas into one giant dataframe

Question

2 answers

solution1
4 ACCPTED 2013-02-01 22:03:17

solution2
0 2013-02-02 20:34:35

How do I efficiently combine similar dataframes in Pandas into one giant dataframe

Question

2 answers

solution1 4 ACCPTED 2013-02-01 22:03:17

solution2 0 2013-02-02 20:34:35

solution1
4 ACCPTED 2013-02-01 22:03:17

solution2
0 2013-02-02 20:34:35