简体   繁体   English

如何在单个列上使用groupby并对Pandas中的多个列执行比较?

[英]How to use groupby on a single column and perform comparisons for multiple columns in Pandas?

I have a dataframe of users, whether or not they have signed up, and the model's prediction for whether or not they have signed up. 我有一个用户数据框,无论他们是否已注册,以及模型是否已注册的预测。 I want to find per user: the TP (they signed up and the model predicted they did), FP (they didn't sign up but the model predicted they did), FN (they signed up but the model predicted no), and TN (they didn't sign up and the model predicted no). 我想找到每个用户:TP(他们注册并且模型预测他们做了),FP(他们没有注册,但模型预测他们做了),FN(他们注册但模型预测没有),以及TN(他们没有注册,模型预测没有)。 Here 1 means they signed up and 0 means they did not. 这里1表示他们注册,0表示他们没有注册。 I want to groupby on users, and then perform comparisons using the other two columns. 我想分组用户,然后使用其他两列进行比较。 For example, I might have something like the following: 例如,我可能会有以下内容:

Users    |    Signed_up    |     Prediction   |
User1         1                  0            
User2         0                  0
User1         1                  1
User3         1                  1
User2         0                  1
User2         0                  0
...

For TP, the resulting table might look something like:

Users    |    TP    |
User1         1
User2         0
User3         1

For TN, the resulting table might look something like:
Users    |    TN    |
User1         0
User2         1
User3         0

and so on for FP and FN.

I am assuming I groupby on the Users column and use a lambda function to compare the Sign_up and Prediction columns, but I am not sure how to actually do this. 我假设我在Users列上使用groupby并使用lambda函数来比较Sign_upPrediction列,但我不确定如何实际执行此操作。 I would appreciate any help! 我将不胜感激任何帮助!

Do the comparison before you groupby and then groupby + sum groupby之前进行比较,然后groupby + sum

(df.assign(TP = df.Signed_up & df.Prediction, 
           TN = (df.Signed_up == 0) & (df.Prediction == 0),
           FN = df.Signed_up & (df.Prediction == 0), 
           FP = (df.Signed_up == 0) & df.Prediction)
   .groupby('Users')['TP', 'TN', 'FN', 'FP'].sum())

       TP   TN   FN   FP
Users                   
User1   1  0.0  1.0  0.0
User2   0  2.0  0.0  1.0
User3   1  0.0  0.0  0.0

Inspired by @BrianJoseph, with much less typing, you could groupby all 3 columns, determine the size, and unstack everything but the users: 灵感来自@BrianJoseph,输入更少,你可以groupby所有3列,确定大小,并除了用户之外的所有内容:

df.groupby([*df]).size().unstack([1,2]).fillna(0)

Signed_up     1         0     
Prediction    0    1    0    1
Users                         
User1       1.0  1.0  0.0  0.0
User2       0.0  0.0  2.0  1.0
User3       0.0  1.0  0.0  0.0

Remember that pandas can groupby using function results. 请记住,pandas可以使用函数结果进行分组。 In order to distinguish these 4 classes of results you just need to know the relationship between Signed_up and Prediction . 为了区分这4类结果,您只需要知道Signed_upPrediction之间的关系。 You can classify them like this: 你可以这样对它们进行分类:

grps = df.groupby(lambda index: (df.loc[index, 'Signed_up'], df.loc[index, 'Prediction']))

This just gives you the groupby object and you can feel free to name groups like: 这只是为您提供了groupby对象,您可以随意命名以下组:

tp_df = grps.get_group((1,1))

If creating different dfs, for each model prediction which it seems like from your post, you could do this using boolean masking and the & bitwise operator. 如果为你的帖子中的每个模型预测创建不同的dfs,你可以使用boolean masking和& bitwise运算符来完成。 & means that both conditions must be met to return the value, so: &表示必须满足两个条件才能返回值,因此:

df = pd.read_csv('./Desktop/models.csv')

TP = df.loc[(df['Signed_up'] == 1) & (df['Prediction'] == 1)]

TN = df.loc[(df['Signed_up'] == 0) & (df['Prediction'] == 0)]

FN = df.loc[(df['Signed_up'] == 1) & (df['Prediction'] == 0)]

FP = df.loc[(df['Signed_up'] == 0) & (df['Prediction'] == 1)]

output: 输出:

>>> TP
   Users  Signed_up  Prediction
2  User1          1           1
3  User3          1           1
>>> TN = df.loc[(df['Signed_up'] == 0) & (df['Prediction'] == 0)]
>>> TN
   Users  Signed_up  Prediction
1  User2          0           0
5  User2          0           0
>>> FN = df.loc[(df['Signed_up'] == 1) & (df['Prediction'] == 0)]
>>> FN
   Users  Signed_up  Prediction
0  User1          1           0
>>> FP = df.loc[(df['Signed_up'] == 0) & (df['Prediction'] == 1)]
>>> FP
   Users  Signed_up  Prediction
4  User2          0           1

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM