简体   繁体   中英

combine multiple pandas DataFrames

I have multiple DataFrames that have the same format. I want to create a dataframe that combine the previous ones. each row of the result dataframe is a row of one of the previous dataframes where a certain column is the maximum,

Example

data1 :   
Name            Age
0   michael     18
1   lincoln     20
2   theodore    84
3   alexandre   95

data2 :   
Name            Age
0   sayed       17
1   hurley      29
2   sawyer      44
3   John        15

data3 :   
Name            Age
0   walter      50
1   jesse       15
2   fring       20
3   saul        34

the expected result would be:

Results :   
Name            Age
0   walter      50
1   hurley      29
2   theodore    84
3   alexandre   95

I have more than 500.000 rows and 51 columns i'm looking for something faster than just parsing all the data (O(n2) of complexity is so big)

thank you.

You can use np.where to choose the max value between column of dataframes. Then apply this to all columns of dataframe. At last use reduce() to apply on all dataframes.

import functools

columns = df_.columns

df_ = pd.DataFrame(columns=columns)

def choose_larger(df1, df2):
    m = df1['Age'] > df2['Age']
    for col in columns:
        df_[col] = np.where(m, df1[col], df2[col])
    return df_

# Another possible function
def choose_larger2(df1, df2):
    m = df1['Age'] > df2['Age']
    m = pd.concat([m]*len(columns), axis=1)
    return pd.DataFrame(np.where(m, df1, df2), columns=columns)

df_max = functools.reduce(lambda df1, df2: choose_larger(df1, df2), [data1, data2, data3])
print(df_max)

        Name  Age
0    michael   18
1    lincoln   20
2   theodore   84
3  alexandre   95

If you stack the dataframes horizontally:

dfs = [df.add_suffix(index) for index, df in enumerate([data1, data2, data3])]
df = pd.concat(dfs, axis=1)

You can use idxmax() to find the column indexes of the max Age per row:

indexes = df.filter(like='Age').idxmax(axis=1)

Then indexes will give every max Age and shift() will give each corresponding Name :

pd.DataFrame({'Name': np.diag(df.shift(axis=1)[indexes]), 'Age': np.diag(df[indexes])})

#         Name  Age
# 0     walter   50
# 1     hurley   29
# 2   theodore   84
# 3  alexandre   95

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM