[英]combine multiple pandas DataFrames
I have multiple DataFrames that have the same format.我有多个具有相同格式的 DataFrame。 I want to create a dataframe that combine the previous ones.
我想创建一个结合以前的 dataframe 。 each row of the result dataframe is a row of one of the previous dataframes where a certain column is the maximum,
结果 dataframe 的每一行都是先前数据帧之一的行,其中某一列是最大值,
Example例子
data1 :
Name Age
0 michael 18
1 lincoln 20
2 theodore 84
3 alexandre 95
data2 :
Name Age
0 sayed 17
1 hurley 29
2 sawyer 44
3 John 15
data3 :
Name Age
0 walter 50
1 jesse 15
2 fring 20
3 saul 34
the expected result would be:预期的结果是:
Results :
Name Age
0 walter 50
1 hurley 29
2 theodore 84
3 alexandre 95
I have more than 500.000 rows and 51 columns i'm looking for something faster than just parsing all the data (O(n2) of complexity is so big)我有超过 500.000 行和 51 列我正在寻找比解析所有数据更快的东西(O(n2) 的复杂性是如此之大)
thank you.谢谢你。
You can use np.where
to choose the max value between column of dataframes.您可以使用
np.where
选择数据框列之间的最大值。 Then apply this to all columns of dataframe.然后将此应用于 dataframe 的所有列。 At last use
reduce()
to apply on all dataframes.最后使用
reduce()
应用于所有数据帧。
import functools
columns = df_.columns
df_ = pd.DataFrame(columns=columns)
def choose_larger(df1, df2):
m = df1['Age'] > df2['Age']
for col in columns:
df_[col] = np.where(m, df1[col], df2[col])
return df_
# Another possible function
def choose_larger2(df1, df2):
m = df1['Age'] > df2['Age']
m = pd.concat([m]*len(columns), axis=1)
return pd.DataFrame(np.where(m, df1, df2), columns=columns)
df_max = functools.reduce(lambda df1, df2: choose_larger(df1, df2), [data1, data2, data3])
print(df_max)
Name Age
0 michael 18
1 lincoln 20
2 theodore 84
3 alexandre 95
If you stack the dataframes horizontally:如果您水平堆叠数据框:
dfs = [df.add_suffix(index) for index, df in enumerate([data1, data2, data3])]
df = pd.concat(dfs, axis=1)
You can use idxmax()
to find the column indexes
of the max Age
per row:您可以使用
idxmax()
查找每行最大Age
的列indexes
:
indexes = df.filter(like='Age').idxmax(axis=1)
Then indexes
will give every max Age
and shift()
will give each corresponding Name
:然后
indexes
将给出每个 max Age
并且shift()
将给出每个对应的Name
:
pd.DataFrame({'Name': np.diag(df.shift(axis=1)[indexes]), 'Age': np.diag(df[indexes])})
# Name Age
# 0 walter 50
# 1 hurley 29
# 2 theodore 84
# 3 alexandre 95
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.