[英]How to use groupby on a single column and perform comparisons for multiple columns in Pandas?
I have a dataframe of users, whether or not they have signed up, and the model's prediction for whether or not they have signed up. 我有一个用户数据框,无论他们是否已注册,以及模型是否已注册的预测。 I want to find per user: the TP (they signed up and the model predicted they did), FP (they didn't sign up but the model predicted they did), FN (they signed up but the model predicted no), and TN (they didn't sign up and the model predicted no).
我想找到每个用户:TP(他们注册并且模型预测他们做了),FP(他们没有注册,但模型预测他们做了),FN(他们注册但模型预测没有),以及TN(他们没有注册,模型预测没有)。 Here 1 means they signed up and 0 means they did not.
这里1表示他们注册,0表示他们没有注册。 I want to groupby on users, and then perform comparisons using the other two columns.
我想分组用户,然后使用其他两列进行比较。 For example, I might have something like the following:
例如,我可能会有以下内容:
Users | Signed_up | Prediction |
User1 1 0
User2 0 0
User1 1 1
User3 1 1
User2 0 1
User2 0 0
...
For TP, the resulting table might look something like:
Users | TP |
User1 1
User2 0
User3 1
For TN, the resulting table might look something like:
Users | TN |
User1 0
User2 1
User3 0
and so on for FP and FN.
I am assuming I groupby on the Users
column and use a lambda function to compare the Sign_up
and Prediction
columns, but I am not sure how to actually do this. 我假设我在
Users
列上使用groupby并使用lambda函数来比较Sign_up
和Prediction
列,但我不确定如何实际执行此操作。 I would appreciate any help! 我将不胜感激任何帮助!
Do the comparison before you groupby
and then groupby
+ sum
在
groupby
之前进行比较,然后groupby
+ sum
(df.assign(TP = df.Signed_up & df.Prediction,
TN = (df.Signed_up == 0) & (df.Prediction == 0),
FN = df.Signed_up & (df.Prediction == 0),
FP = (df.Signed_up == 0) & df.Prediction)
.groupby('Users')['TP', 'TN', 'FN', 'FP'].sum())
TP TN FN FP
Users
User1 1 0.0 1.0 0.0
User2 0 2.0 0.0 1.0
User3 1 0.0 0.0 0.0
Inspired by @BrianJoseph, with much less typing, you could groupby
all 3 columns, determine the size, and unstack everything but the users: 灵感来自@BrianJoseph,输入更少,你可以
groupby
所有3列,确定大小,并除了用户之外的所有内容:
df.groupby([*df]).size().unstack([1,2]).fillna(0)
Signed_up 1 0
Prediction 0 1 0 1
Users
User1 1.0 1.0 0.0 0.0
User2 0.0 0.0 2.0 1.0
User3 0.0 1.0 0.0 0.0
Remember that pandas can groupby using function results. 请记住,pandas可以使用函数结果进行分组。 In order to distinguish these 4 classes of results you just need to know the relationship between
Signed_up
and Prediction
. 为了区分这4类结果,您只需要知道
Signed_up
和Prediction
之间的关系。 You can classify them like this: 你可以这样对它们进行分类:
grps = df.groupby(lambda index: (df.loc[index, 'Signed_up'], df.loc[index, 'Prediction']))
This just gives you the groupby object and you can feel free to name groups like: 这只是为您提供了groupby对象,您可以随意命名以下组:
tp_df = grps.get_group((1,1))
If creating different dfs, for each model prediction which it seems like from your post, you could do this using boolean masking and the &
bitwise operator. 如果为你的帖子中的每个模型预测创建不同的dfs,你可以使用boolean masking和
&
bitwise运算符来完成。 &
means that both conditions must be met to return the value, so: &
表示必须满足两个条件才能返回值,因此:
df = pd.read_csv('./Desktop/models.csv')
TP = df.loc[(df['Signed_up'] == 1) & (df['Prediction'] == 1)]
TN = df.loc[(df['Signed_up'] == 0) & (df['Prediction'] == 0)]
FN = df.loc[(df['Signed_up'] == 1) & (df['Prediction'] == 0)]
FP = df.loc[(df['Signed_up'] == 0) & (df['Prediction'] == 1)]
output: 输出:
>>> TP
Users Signed_up Prediction
2 User1 1 1
3 User3 1 1
>>> TN = df.loc[(df['Signed_up'] == 0) & (df['Prediction'] == 0)]
>>> TN
Users Signed_up Prediction
1 User2 0 0
5 User2 0 0
>>> FN = df.loc[(df['Signed_up'] == 1) & (df['Prediction'] == 0)]
>>> FN
Users Signed_up Prediction
0 User1 1 0
>>> FP = df.loc[(df['Signed_up'] == 0) & (df['Prediction'] == 1)]
>>> FP
Users Signed_up Prediction
4 User2 0 1
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.