简体   繁体   English

如何比较 pandas 中的两个数据帧?

[英]How to compare two dataframes in pandas?

I have a dataframe like this:我有一个像这样的 dataframe:

df = pd.DataFrame([[1,'aaa',50],[0,'aaa',1000],[0,'aba',30],[1,'aaa',50],[1,'aba',10]], 
columns=['A','B','C'])
df



    A   B   C
0   1   aaa 50
1   0   aaa 1000
2   0   aba 30
3   1   aaa 50
4   1   aba 10

I want for each item in 'B'(which also there are repeated items), check its value in 'A'.我想要'B'中的每个项目(也有重复的项目),检查它在'A'中的值。 If it's 1, it should calculate the sum of values in 'C' for that item.如果它是 1,它应该计算该项目的“C”中值的总和。 If it's 0, it should count the number of items which their 'A' value is zero.如果为 0,则应计算其“A”值为零的项目数。 Then the final result would be: sum/count.那么最终结果将是:总和/计数。

In the end, I want to show the result like this:最后,我想显示这样的结果:

    ID  Value
0   aaa 100
1   aba 10

For example, 'aaa' has two 1 which their sum is 50 + 50 = 100, and one 0 which its count is 1. So the result is 100 / 1 = 100.例如,'aaa' 有两个 1,它们的总和是 50 + 50 = 100,还有一个 0,它的计数是 1。所以结果是 100 / 1 = 100。

How can I do something like that in an efficient way?我怎样才能以有效的方式做这样的事情? I tried to use groupby() and have the sum and count in different dataframes, but I don't know how to compare them and get this result.我尝试使用 groupby() 并在不同的数据帧中求和和计数,但我不知道如何比较它们并得到这个结果。

Try groupby aggregate on columns A and B , while summing and sizing the C column.在列AB上尝试groupby aggregate ,同时对C列进行求和和调整大小。 Then divide A==1 'sum' by A==0 'count':然后将A==1 'sum' 除以A==0 'count':

new_df = df.groupby(['A', 'B']).aggregate(sum=('C', 'sum'), count=('C', 'size'))
new_df = (new_df.loc[1, 'sum'] / new_df.loc[0, 'count']).reset_index()
new_df.columns = ['ID', 'Value']  # Rename Columns

new_df : new_df

    ID  Value
0  aaa  100.0
1  aba   10.0

*Beware division by 0. It is possible some group would have 0 entries for a given B value. *注意除以 0。对于给定的 B 值,某些组可能有 0 个条目。

In [90]: df[df['A'] == 1].groupby('B')['C'].sum() /  df[df['A'] == 0].groupby('B').size()
Out[90]:
B
aaa    100.0
aba     10.0
dtype: float64

this should take care of dividing correctly as both the series are indexed by the column 'B' because of the grouping这应该注意正确划分,因为这两个系列都由'B'列索引,因为分组

You can do a groupy and select the right group:你可以做一个 groupy 和 select 正确的组:

import pandas as pd导入 pandas 作为 pd

df_grouped = df.groupby(['A', 'B']).sum().loc[1]

B      C
aaa  100
aba   10

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM