[英]How, in python, can I count unique values in a column for gradually increasing numbers of rows within groups
I am working in python on a pandas data frame and am trying to count unique values of a column within groups.我正在使用 python 处理 Pandas 数据框,并尝试计算组内列的唯一值。 My problem is that I need that count to represent steadily increasing numbers of rows within the groups and I also don't want NaNs to be counted.
我的问题是,我需要该计数来表示组内稳定增加的行数,而且我也不希望计算 NaN。
Simplified, the data looks like this简化后,数据看起来像这样
ID occup
1 NaN
1 A
1 NaN
1 Nan
1 B
2 K
2 NaN
2 L
2 L
2 M
The new column 'occupcount' should, within the groups defined by 'ID', count the number of unique values in 'occup' but, in the first row of each group I want the count to only consider the first row in the respective group.新列 'occupcount' 应该在由 'ID' 定义的组内计算 'occup' 中唯一值的数量,但是,在每个组的第一行中,我希望计数只考虑相应组中的第一行. In the second row, I want to count over the first two rows.
在第二行,我想计算前两行。 In the fifth row, I want the count of unique values over all five rows within each group.
在第五行中,我想要每个组内所有五行的唯一值计数。 It should look like this:
它应该是这样的:
ID occup occupcount
1 NaN 0
1 A 1
1 NaN 1
1 B 2
1 A 2
2 K 1
2 NaN 1
2 L 2
2 K 2
2 M 3
I tried to solve the task with something like我试图用类似的东西来解决这个任务
df['occupcount'] = (df.groupby(["ID"])['occup'].transform('nunique'))
But it only provides the total amount of unique values over all rows within each group, no gradual increase.但它只提供每个组内所有行的唯一值总数,没有逐渐增加。 Thanks in advance!
提前致谢!
Idea is chain first duplicated values by both columns with not missing values for mask and then use GroupBy.cumsum
:想法是首先将两列的重复值链接起来,并且不缺少掩码值,然后使用
GroupBy.cumsum
:
df['occupcount'] = ((~df.duplicated(['ID','occup']) & df['occup'].notna())
.groupby(df['ID'])
.cumsum())
print (df)
ID occup occupcount
0 1 NaN 0
1 1 A 1
2 1 NaN 1
3 1 B 2
4 1 A 2
5 2 K 1
6 2 NaN 1
7 2 L 2
8 2 L 2
9 2 M 3
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.