[英]How to aggregate between two dataframes in pandas
Good evening.晚上好。
I have two dataframes, each having several columns and a couple of million lines but right now we're interested only in two of columns: FOOs and IDs for DataFrame1, BARs and IDs for DataFrame2.我有两个数据框,每个都有几列和几百万行,但现在我们只对其中两列感兴趣:DataFrame1 的 FOO 和 ID,DataFrame2 的 BAR 和 ID。 IDs is basically a many-to-many relationship and it goes like this:
IDs 基本上是一个多对多的关系,它是这样的:
import pandas as pd
data1=['foo1', 'foo2', 'foo1', 'foo1', 'foo3', 'foo2', 'foo3','foo4',
'foo1', 'foo3', 'foo1', 'foo2', 'foo1', 'foo2', 'foo3']
id1=[1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 5, 6, 7, 7, 7]
data2=['bar1', 'bar1', 'bar1', 'bar2', 'bar3', 'bar1', 'bar1',
'bar3', 'bar2', 'bar1' ]
id2=[1, 1, 1, 1, 2, 3, 4, 5, 6, 7]
df_foo=pd.DataFrame(data=zip(data1, id1), columns=['FOOs', 'IDs'],
dtype='object')
df_bar=pd.DataFrame(data=zip(data2, id2), columns=['BARs', 'IDs'],
dtype='object')
What I need to do is to aggregate all the FOOs with the BARs.我需要做的是将所有 FOO 与 BAR 聚合起来。 I have a solution that works but it does look messy:
我有一个可行的解决方案,但看起来确实很乱:
def my_agg(series):
return df_bar[df_bar.IDs.isin(series)].groupby('BARs').agg({'BARs': pd.unique})
print(df_foo.groupby('FOOs').agg({'FOOs': pd.unique, 'IDs': my_agg}).values)
And the output is: output 是:
[[array(['foo1'], dtype=object)
array([[array(['bar1'], dtype=object)],
[array(['bar2'], dtype=object)],
[array(['bar3'], dtype=object)]], dtype=object)]
[array(['foo2'], dtype=object)
array([[array(['bar1'], dtype=object)],
[array(['bar2'], dtype=object)]], dtype=object)]
[array(['foo3'], dtype=object)
array([[array(['bar1'], dtype=object)],
[array(['bar3'], dtype=object)]], dtype=object)]
[array(['foo4'], dtype=object)
array([[array(['bar1'], dtype=object)]], dtype=object)]]
Question is : Is there a way to make a clean solution with nice readable output like问题是:有没有办法用可读性好的 output 来制作一个干净的解决方案
FOOs BARs
foo1 bar1
bar2
bar3
foo2 bar1
bar2
foo3 bar1
bar3
foo4 bar1
Thanks in advance.提前致谢。
How about:怎么样:
df_foo.merge(df_bar, on='IDs')[['FOOs', 'BARs']].drop_duplicates()
merge and pivot may work, - not sure about speed.合并和 pivot 可能工作, - 不确定速度。
df_out = pd.merge(df_foo, df_bar, on='IDs').pivot_table(index=['FOOs', 'BARs'])
out:出去:
IDs
FOOs BARs
foo1 bar1 2.800000
bar2 1.000000
bar3 3.000000
foo2 bar1 2.600000
bar2 3.500000
foo3 bar1 4.666667
bar3 2.000000
foo4 bar1 3.000000
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.