简体   繁体   中英

How to aggregate, combining dataframes, with pandas groupby

I have a dataframe df and a column df['table'] such that each item in df['table'] is another dataframe with the same headers/number of columns. I was wondering if there's a way to do a groupby like this:

Original dataframe:

name    table
Bob     Pandas df1
Joe     Pandas df2
Bob     Pandas df3
Bob     Pandas df4
Emily   Pandas df5

After groupby:

name    table
Bob     Pandas df containing the appended df1, df3, and df4
Joe     Pandas df2
Emily   Pandas df5

I found this code snippet to do a groupby and lambda for strings in a dataframe, but haven't been able to figure out how to append entire dataframes in a groupby .

df['table'] = df.groupby(['name'])['table'].transform(lambda x : ' '.join(x)) 

I've also tried df['table'] = df.groupby(['name'])['HTML'].apply(list) , but that gives me a df['table'] of all NaN .

Thanks for your help!!

  • Given 3 dataframes
import pandas as pd

dfa = pd.DataFrame({'a': [1, 2, 3]})
dfb = pd.DataFrame({'a': ['a', 'b', 'c']})
dfc = pd.DataFrame({'a': ['pie', 'steak', 'milk']})
  • Given another dataframe, with dataframes in the columns
df = pd.DataFrame({'name': ['Bob', 'Joe', 'Bob', 'Bob', 'Emily'], 'table': [dfa, dfa, dfb, dfc, dfb]})

# print the type for the first value in the table column, to confirm it's a dataframe
print(type(df.loc[0, 'table']))
[out]:
<class 'pandas.core.frame.DataFrame'>
  • Each group of dataframes, can be combined into a single dataframe, by using .groupby and aggregating a list for each group, and combining the dataframes in the list , with pd.concat
# if there is only one column, or if there are multiple columns of dataframes to aggregate
dfg = df.groupby('name').agg(lambda x: pd.concat(list(x)).reset_index(drop=True))

# display(dfg.loc['Bob', 'table'])
       a
0      1
1      2
2      3
3      a
4      b
5      c
6    pie
7  steak
8   milk

# to specify a single column, or specify multiple columns, from many columns
dfg = df.groupby('name')[['table']].agg(lambda x: pd.concat(list(x)).reset_index(drop=True))

Not a duplicate

df.groupby('name')['table'].apply(list)
df.groupby('name').agg(list)
df.groupby('name')['table'].agg(list)
df.groupby('name').agg({'table': list})
df.groupby('name').agg(lambda x: list(x))
  • However, these all result in a StopIteration error, when there are dataframes to aggregate.

Here let's create a dataframe with dataframes as columns:

First, I start with three dataframes:

import pandas as pd

#creating dataframes that we will assign to Bob and Joe, notice b's and j':

df1 = pd.DataFrame({'var1':[12, 34, -4, None], 'letter':['b1', 'b2', 'b3', 'b4']})
df2 = pd.DataFrame({'var1':[1, 23, 44, 0], 'letter':['j1', 'j2', 'j3', 'j4']})
df3 = pd.DataFrame({'var1':[22, -3, 7, 78], 'letter':['b5', 'b6', 'b7', 'b8']})

#lets make a list of dictionaries:
list_of_dfs = [
    {'name':'Bob' ,'table':df1},
    {'name':'Joe' ,'table':df2},
    {'name':'Bob' ,'table':df3}
]

#constuct the main dataframe:
original_df = pd.DataFrame(list_of_dfs)
print(original_df)

original_df.shape #shows (3, 2)

Now we have the original dataframe created as the input, we will produce the resulting new dataframe. In doing so, we use groupby(),agg(), and pd.concat(). We also reset the index.

new_df = original_df.groupby('name')['table'].agg(lambda series: pd.concat(series.tolist())).reset_index()
print(new_df)

#check that Bob's table is now a concatenated table of df1 and df3:
new_df[new_df['name']=='Bob']['table'][0]

The output to the last line of code is:

    var1    letter
0   12.0    b1
1   34.0    b2
2   -4.0    b3
3    NaN    b4
0   22.0    b5
1   -3.0    b6
2    7.0    b7
3   78.0    b8

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM