根据多个条件在 pandas dataframe 中创建多个 boolean 列

Question

我有一个数据集，其中作者按作者顺序（1、2、3 等）排名。

Authorid    Author  Article Articleid   Rank
1            John   article 1   1        1
1            John   article 2   2        2
1            John   article 3   3        3
1            John   article 4   4        3
2            Mary   article 5   5        1
2            Mary   article 6   6        2
2            Mary   article 7   7        1
2            Mary   article 8   8        8

我想再创建三个 Boolean 列If_first ， If_second ， If_last 。 这样做的目的 - 我想显示作者在文章中的排名是第 1 位、第 2 位还是最后一位。 last表示Rank列中的最大数量（ Rank列中此Authorid的最大数量）。

我可以做If_first和If_second ，这很容易，但不确定如何解决If_last 。

df.loc[df['Rank'] == 1, 'If_first'] = 1
df.loc[df['Rank'] != 1, 'If_first'] = 0
df.loc[df['Rank'] == 2, 'If_second'] = 1
df.loc[df['Rank'] != 2, 'If_second'] = 0

这里有两条规则

If_first = if_last - 把他当作if_first
If_second = if_last - 把他当作if_second

预期 output：

Authorid    Author  Article Articleid   Rank    If_first    If_second   If_last
1            John   article 1   1        1       1              0         0
1            John   article 2   2        2       0              1         0
1            John   article 3   3        3       0              0         1 (third is the last here)
2            Mary   article 5   5        1       1              0         0
2            Mary   article 6   6        2       0              1         0
2            Mary   article 7   7        3       0              0         0 (third is not the last here, because of the fourth below, all zeros)
2            Mary   article 8   8        4       0              0         1 (fourth is the last here)

Answer 1

尝试这个：

df = df.reset_index(drop=True)
res = df.groupby('Authorid')['Rank'].apply(lambda x: [x.idxmin(), x.drop_duplicates()[1:].nsmallest(1).index[0], x.idxmax()])

df[['If_first', 'If_second', 'If_last']] = 0
df.loc[res.str[0].tolist(), 'If_first'] = 1
df.loc[res.str[1].tolist(), 'If_second'] = 1
df.loc[res.str[2].tolist(), 'If_last'] = 1

Output：

>>> df
  Authorid   Author  Article  Articleid  Rank  If_first  If_second  If_last
0     John  article        1          1     1         1          0        0
1     John  article        2          2     2         0          1        0
2     John  article        3          3     3         0          0        1
3     John  article        4          4     3         0          0        0
4     Mary  article        5          5     1         1          0        0
5     Mary  article        6          6     2         0          1        0
6     Mary  article        7          7     1         0          0        0
7     Mary  article        8          8     8         0          0        1

Answer 2

一种方法可能是创建第二个 dataframe 按Articleid分组，收集您感兴趣的统计信息：

df2 = df.groupby('Articleid').agg(mxrank=('Rank', 'max'))

然后通过合并数据框添加新列：

dfm = df.merge(df2, how='left', on='Articleid')

使用示例结果（添加了一些行来演示具有多个等级的文章“article4”）：

   Authorid Author   Article Articleid Rank mxrank
0         1   John  article1         1    1      1
1         1   John  article2         2    2      2
2         1   John  article3         3    3      3
3         1   John  article4         4    3      4
4         1    Foo  article4         4    1      4
5         1    Bar  article4         4    2      4
6         1    Baz  article4         4    4      4
7         2   Mary  article5         5    1      1
8         2   Mary  article6         6    2      2
9         2   Mary  article7         7    1      1
10        2   Mary  article8         8    8      8

然后将mxrank列与Rank进行比较以确定每一行的标志。

根据多个条件在 pandas dataframe 中创建多个 boolean 列

问题描述

2 个解决方案

解决方案1
1 2021-12-10 17:12:04

解决方案2
0 2021-12-10 17:08:08

根据多个条件在 pandas dataframe 中创建多个 boolean 列

问题描述

2 个解决方案

解决方案1 1 2021-12-10 17:12:04

解决方案2 0 2021-12-10 17:08:08

解决方案1
1 2021-12-10 17:12:04

解决方案2
0 2021-12-10 17:08:08