Pandas 相当于计算误报率

Question

I have a pandas dataframe df from a ML classifier that has the following fields userid , classifier_score , truth .我有一个来自 ML 分类器的 pandas dataframe df ，该分类器具有以下字段userid ， classifier_score ， truth 。 I want to calculate the false positive rate per userid at a threshold of 0.62.我想以 0.62 的阈值计算每个用户 ID 的误报率。 classifier_score ranges in the data go from 0.1999 to a 0.89.数据 go 中的classifier_score范围从 0.1999 到 0.89。 Right now, I use a series of conditions and create a new column col that states whether if the relationship between truth and classifier score is a false positive, false negative, true positive or true negative现在，我使用一系列条件并创建一个新列col来说明真值与分类器分数之间的关系是假阳性、假阴性、真阳性还是真阴性

df['col'] = df.apply(condition, axis=1)

Then I store the the unique userids in a list然后我将唯一的用户标识存储在一个列表中

unique_users = df.user.unique().tolist() Then I loop through each one to calculate False positive score. unique_users = df.user.unique().tolist()然后我遍历每一个来计算假阳性分数。

fpr_dict = {}
for id in user_ids:
   fn, tn, fp, tp = 0, 0, 0, 0 
   elems = df[df.userid==id].tolist()
   for elem in col:
       if elem == 'fn': fn += 1
       elif elem == 'fp': fp += 1
       elif elem == 'tp': tp += 1
       elif elem == 'tn': tn += 1
   try:
      fpr = fp / (fp + tn) 
   except ZeroDivisionError:
      fpr = 0.0
   fpr_dict[id] = fpr

Is there a better way of doing this with just pandas functions?仅使用 pandas 函数有没有更好的方法？ Note: I initialize fn, tn, fp, tp to 0 because some user ids might not have all 4 of them, they will have some combination of the 4注意：我将 fn、tn、fp、tp 初始化为 0，因为某些用户 id 可能没有全部 4 个，它们将具有 4 个的某种组合

Edit: Dataframe编辑：Dataframe

userid | classifier_score | truth  | col 
0001      0.6721            1        TP
0001      0.2918            1        FP
0001      0.1236            0        TN
.
.
.
0064      0.7168            0        FN

Answer 1

I didn't test it with an actual dataframe, maybe try this我没有用实际的 dataframe 测试它，也许试试这个

th = 0.62
predicted_pos = df['classifier_score'] > th

userid_group = df.groupby('userid', sort=False)

userid_count = userid_group.size()

df['fp'] = predicted_pos & (df['truth'] == 0)
fpr = userid_group['fp'].sum() / userid_count

if you want a dictionary, you can put dict(fpr) in the end如果你想要一本字典，你可以把dict(fpr)放在最后

Edit: As OP pointed out, fpr = fp/(fp + tn), the calculation should be:编辑：正如 OP 指出的那样，fpr = fp/(fp + tn)，计算应该是：

th = 0.62
predicted_pos = df['classifier_score'] > th

userid_group = df.groupby('userid', sort=False)

df['fp'] = predicted_pos & (df['truth'] == 0)
df['tn'] = df['truth'] == 0

fp = userid_group['fp'].sum()
tn = userid_group['tn'].sum()
fpr =  fp / (fp + tn)

Pandas 相当于计算误报率

问题描述

1 个解决方案

解决方案1
1 已采纳 2020-06-27 20:54:50

Pandas 相当于计算误报率

问题描述

1 个解决方案

解决方案1 1 已采纳 2020-06-27 20:54:50

解决方案1
1 已采纳 2020-06-27 20:54:50