如何在 pandas 中的两个数据帧之间进行聚合

Question

Good evening.晚上好。

I have two dataframes, each having several columns and a couple of million lines but right now we're interested only in two of columns: FOOs and IDs for DataFrame1, BARs and IDs for DataFrame2.我有两个数据框，每个都有几列和几百万行，但现在我们只对其中两列感兴趣：DataFrame1 的 FOO 和 ID，DataFrame2 的 BAR 和 ID。 IDs is basically a many-to-many relationship and it goes like this: IDs 基本上是一个多对多的关系，它是这样的：

import pandas as pd

data1=['foo1', 'foo2', 'foo1', 'foo1', 'foo3', 'foo2', 'foo3','foo4',
      'foo1', 'foo3', 'foo1', 'foo2', 'foo1', 'foo2', 'foo3']
id1=[1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 5, 6, 7, 7, 7]

data2=['bar1', 'bar1', 'bar1', 'bar2', 'bar3', 'bar1', 'bar1',
      'bar3', 'bar2', 'bar1' ]
id2=[1, 1, 1, 1, 2, 3, 4, 5, 6, 7]

df_foo=pd.DataFrame(data=zip(data1, id1), columns=['FOOs', 'IDs'],
                    dtype='object')
df_bar=pd.DataFrame(data=zip(data2, id2), columns=['BARs', 'IDs'],
                    dtype='object')

What I need to do is to aggregate all the FOOs with the BARs.我需要做的是将所有 FOO 与 BAR 聚合起来。 I have a solution that works but it does look messy:我有一个可行的解决方案，但看起来确实很乱：

def my_agg(series):
    return df_bar[df_bar.IDs.isin(series)].groupby('BARs').agg({'BARs': pd.unique})

print(df_foo.groupby('FOOs').agg({'FOOs': pd.unique, 'IDs': my_agg}).values)

And the output is: output 是：

[[array(['foo1'], dtype=object)
  array([[array(['bar1'], dtype=object)],
         [array(['bar2'], dtype=object)],
         [array(['bar3'], dtype=object)]], dtype=object)]
 [array(['foo2'], dtype=object)
  array([[array(['bar1'], dtype=object)],
         [array(['bar2'], dtype=object)]], dtype=object)]
 [array(['foo3'], dtype=object)
  array([[array(['bar1'], dtype=object)],
         [array(['bar3'], dtype=object)]], dtype=object)]
 [array(['foo4'], dtype=object)
  array([[array(['bar1'], dtype=object)]], dtype=object)]]

Question is : Is there a way to make a clean solution with nice readable output like问题是：有没有办法用可读性好的 output 来制作一个干净的解决方案

FOOs    BARs
foo1    bar1
        bar2
        bar3
foo2    bar1
        bar2
foo3    bar1
        bar3
foo4    bar1

Thanks in advance.提前致谢。

Answer 1

How about:怎么样：

df_foo.merge(df_bar, on='IDs')[['FOOs', 'BARs']].drop_duplicates()

Answer 2

merge and pivot may work, - not sure about speed.合并和 pivot 可能工作， - 不确定速度。

df_out = pd.merge(df_foo, df_bar, on='IDs').pivot_table(index=['FOOs', 'BARs'])

out:出去：

               IDs
FOOs BARs          
foo1 bar1  2.800000
     bar2  1.000000
     bar3  3.000000
foo2 bar1  2.600000
     bar2  3.500000
foo3 bar1  4.666667
     bar3  2.000000
foo4 bar1  3.000000

如何在 pandas 中的两个数据帧之间进行聚合

问题描述

2 个解决方案

解决方案1
0 已采纳 2022-09-06 14:59:54

解决方案2
0 2022-09-06 15:07:04

如何在 pandas 中的两个数据帧之间进行聚合

问题描述

2 个解决方案

解决方案1 0 已采纳 2022-09-06 14:59:54

解决方案2 0 2022-09-06 15:07:04

解决方案1
0 已采纳 2022-09-06 14:59:54

解决方案2
0 2022-09-06 15:07:04