简体   繁体   English

如何在 pandas 中的两个数据帧之间进行聚合

[英]How to aggregate between two dataframes in pandas

Good evening.晚上好。

I have two dataframes, each having several columns and a couple of million lines but right now we're interested only in two of columns: FOOs and IDs for DataFrame1, BARs and IDs for DataFrame2.我有两个数据框,每个都有几列和几百万行,但现在我们只对其中两列感兴趣:DataFrame1 的 FOO 和 ID,DataFrame2 的 BAR 和 ID。 IDs is basically a many-to-many relationship and it goes like this: IDs 基本上是一个多对多的关系,它是这样的:

import pandas as pd

data1=['foo1', 'foo2', 'foo1', 'foo1', 'foo3', 'foo2', 'foo3','foo4',
      'foo1', 'foo3', 'foo1', 'foo2', 'foo1', 'foo2', 'foo3']
id1=[1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 5, 6, 7, 7, 7]

data2=['bar1', 'bar1', 'bar1', 'bar2', 'bar3', 'bar1', 'bar1',
      'bar3', 'bar2', 'bar1' ]
id2=[1, 1, 1, 1, 2, 3, 4, 5, 6, 7]

df_foo=pd.DataFrame(data=zip(data1, id1), columns=['FOOs', 'IDs'],
                    dtype='object')
df_bar=pd.DataFrame(data=zip(data2, id2), columns=['BARs', 'IDs'],
                    dtype='object')

What I need to do is to aggregate all the FOOs with the BARs.我需要做的是将所有 FOO 与 BAR 聚合起来。 I have a solution that works but it does look messy:我有一个可行的解决方案,但看起来确实很乱:

def my_agg(series):
    return df_bar[df_bar.IDs.isin(series)].groupby('BARs').agg({'BARs': pd.unique})

print(df_foo.groupby('FOOs').agg({'FOOs': pd.unique, 'IDs': my_agg}).values)

And the output is: output 是:

[[array(['foo1'], dtype=object)
  array([[array(['bar1'], dtype=object)],
         [array(['bar2'], dtype=object)],
         [array(['bar3'], dtype=object)]], dtype=object)]
 [array(['foo2'], dtype=object)
  array([[array(['bar1'], dtype=object)],
         [array(['bar2'], dtype=object)]], dtype=object)]
 [array(['foo3'], dtype=object)
  array([[array(['bar1'], dtype=object)],
         [array(['bar3'], dtype=object)]], dtype=object)]
 [array(['foo4'], dtype=object)
  array([[array(['bar1'], dtype=object)]], dtype=object)]]

Question is : Is there a way to make a clean solution with nice readable output like问题是:有没有办法用可读性好的 output 来制作一个干净的解决方案

FOOs    BARs
foo1    bar1
        bar2
        bar3
foo2    bar1
        bar2
foo3    bar1
        bar3
foo4    bar1

Thanks in advance.提前致谢。

How about:怎么样:

df_foo.merge(df_bar, on='IDs')[['FOOs', 'BARs']].drop_duplicates()

merge and pivot may work, - not sure about speed.合并和 pivot 可能工作, - 不确定速度。

df_out = pd.merge(df_foo, df_bar, on='IDs').pivot_table(index=['FOOs', 'BARs'])

out:出去:

               IDs
FOOs BARs          
foo1 bar1  2.800000
     bar2  1.000000
     bar3  3.000000
foo2 bar1  2.600000
     bar2  3.500000
foo3 bar1  4.666667
     bar3  2.000000
foo4 bar1  3.000000

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM