简体   繁体   English

组合多个 pandas DataFrames

[英]combine multiple pandas DataFrames

I have multiple DataFrames that have the same format.我有多个具有相同格式的 DataFrame。 I want to create a dataframe that combine the previous ones.我想创建一个结合以前的 dataframe 。 each row of the result dataframe is a row of one of the previous dataframes where a certain column is the maximum,结果 dataframe 的每一行都是先前数据帧之一的行,其中某一列是最大值,

Example例子

data1 :   
Name            Age
0   michael     18
1   lincoln     20
2   theodore    84
3   alexandre   95

data2 :   
Name            Age
0   sayed       17
1   hurley      29
2   sawyer      44
3   John        15

data3 :   
Name            Age
0   walter      50
1   jesse       15
2   fring       20
3   saul        34

the expected result would be:预期的结果是:

Results :   
Name            Age
0   walter      50
1   hurley      29
2   theodore    84
3   alexandre   95

I have more than 500.000 rows and 51 columns i'm looking for something faster than just parsing all the data (O(n2) of complexity is so big)我有超过 500.000 行和 51 列我正在寻找比解析所有数据更快的东西(O(n2) 的复杂性是如此之大)

thank you.谢谢你。

You can use np.where to choose the max value between column of dataframes.您可以使用np.where选择数据框列之间的最大值。 Then apply this to all columns of dataframe.然后将此应用于 dataframe 的所有列。 At last use reduce() to apply on all dataframes.最后使用reduce()应用于所有数据帧。

import functools

columns = df_.columns

df_ = pd.DataFrame(columns=columns)

def choose_larger(df1, df2):
    m = df1['Age'] > df2['Age']
    for col in columns:
        df_[col] = np.where(m, df1[col], df2[col])
    return df_

# Another possible function
def choose_larger2(df1, df2):
    m = df1['Age'] > df2['Age']
    m = pd.concat([m]*len(columns), axis=1)
    return pd.DataFrame(np.where(m, df1, df2), columns=columns)

df_max = functools.reduce(lambda df1, df2: choose_larger(df1, df2), [data1, data2, data3])
print(df_max)

        Name  Age
0    michael   18
1    lincoln   20
2   theodore   84
3  alexandre   95

If you stack the dataframes horizontally:如果您水平堆叠数据框:

dfs = [df.add_suffix(index) for index, df in enumerate([data1, data2, data3])]
df = pd.concat(dfs, axis=1)

You can use idxmax() to find the column indexes of the max Age per row:您可以使用idxmax()查找每行最大Age的列indexes

indexes = df.filter(like='Age').idxmax(axis=1)

Then indexes will give every max Age and shift() will give each corresponding Name :然后indexes将给出每个 max Age并且shift()将给出每个对应的Name

pd.DataFrame({'Name': np.diag(df.shift(axis=1)[indexes]), 'Age': np.diag(df[indexes])})

#         Name  Age
# 0     walter   50
# 1     hurley   29
# 2   theodore   84
# 3  alexandre   95

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM