I've got 7000 data frames with columns
Date, X_1
Date, X_2
...
Each dataframe has around 2500 rows.
The dates sometimes overlap, but are not guaranteed to do so.
I'd like to combine them into a dataframe of the form
Date X_1 X_2 etc.
I tried applying combine_first
7000 times, but it was really slow, as it had to create 7000 new objects, each slightly bigger than the last one.
Is there a more efficient way to combine multiple dataframes?
Assuming that Date is the index rather than a column then you can do an "outer" join
:
df1.join([df2, df3, ..., df7000], how='outer')
Note: it may be more efficient to pass in a generator of DataFrames rather than a list.
For example:
df1 = pd.DataFrame([[1, 2]], columns=['a', 'b'])
df2 = pd.DataFrame([[3, 4]], index=[1], columns=['c', 'd'])
df3 = pd.DataFrame([[5, 6], [7, 8]], columns=['e', 'f'])
In [4]: df1.join([df2, df3], how='outer')
Out[4]:
a b c d e f
0 1 2 NaN NaN 5 6
1 NaN NaN 3 4 7 8
.
If 'Date'
is a column you can use set_index
first:
df1.set_index('Date', inplace=True)
how about this.
list_of_dfs = os.listdir(dir_with_data)
df = concat(list_of_dfs)
df.set_index('Date')
df = df.unstack()
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.