简体   繁体   English

在3个DataFrame之间找到共同的价值?

[英]Find common values between 3 DataFrames?

I have 3 dataframes: df1, df2, and df3. 我有3个数据框:df1,df2和df3。

df1 = 'num' 'type' 
       23     a 
       34     b 
       89     a 
       90     c

df2 = 'num' 'type' 
       23     a 
       34     b 
       56     a 
       90     c

df3 = 'num' 'type' 
       56     a 
       34     s 
       71     a 
       90     c

What I want is an output of all of the 'num' values which appear in 2 or more of the dfs, and I want to flag how many dfs that 'num' value appeared in. So I want something like this: 我想要的是出现在2个或多个dfs中的所有'num'值的输出,并且我想标记该'num'值出现在多少个dfs中。所以我想要这样的东西:

df = 'num' 'type' 'count' 
       23     a       2 
       34     s       3 
       90     c       3 
       56     a       2

I tried doing an inner merge, but that only accounts for 'num' values that appear in all 3 dfs, ignoring the ones that appear in 2/3 dfs. 我尝试进行内部合并,但这仅考虑了在所有3个df中出现的“ num”值,而忽略了在2/3 dfs中出现的值。 What's the best way to go about this? 最好的方法是什么?

et voila my friend 等我的朋友

df_full = pd.concat([df1,df2,df3], axis = 0)
df_agg = df_full.groupby('num').agg({'type': 'count'})
df_agg = df_agg.loc[df_agg['type'] >= 2]

Here is a collections.Counter solution, which has O(n) complexity. 这是collections.Counter解决方案,具有O(n)复杂度。

The results of the count can easily be brought back into pandas , if required. 如果需要,计数结果可以很容易地带回pandas

from collections import Counter

c = sum((Counter(df['num']) for df in [df1, df2, df3]), Counter())

c_masked = {k: v for k, v in c.items() if v>=2}

# {23: 2, 34: 3, 90: 3, 56: 2}

df = pd.DataFrame.from_dict(c_masked, orient='index')

#     0
# 23  2
# 34  3
# 90  3
# 56  2

Here is another way to get the desired result using groupby and size 这是使用groupby和size获得所需结果的另一种方法

d1 = {'num': [23,34,89,90], 'type': ['a', 'b', 'a', 'c']}
d2 = {'num': [23,34,56,90], 'type': ['a', 'b', 'a', 'c']}
d3 = {'num': [56,34,71,90], 'type': ['a', 's', 'a', 'c']}

df1 = pd.DataFrame(data=d1)
df2 = pd.DataFrame(data=d2)
df3 = pd.DataFrame(data=d3)

df10 = pd.concat([df1,df2,df3], axis=0)
# Using groupby with 'num' and 'type' and then using size to get the count.
# resent_index(name='count') will name the size column as 'count'
df20 = df10.groupby(['num','type']).size().reset_index(name='count')

# getting the index with 'count' >= 2 and storing those in df_out.
df_out = df20[df20['count'] >=2].reset_index(drop=True)
print(df_out)

The output looks like: 输出如下:

   num type  count
0   23    a      2
1   34    b      2
2   56    a      2
3   90    c      3

For reference 以供参考

print(df20)
   num type  count
0   23    a      2
1   34    b      2
2   34    s      1
3   56    a      2
4   71    a      1
5   89    a      1
6   90    c      3

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM