[英]How to count cumulative unique values by group?
I wonder how to count accumulative unique values by groups in python?我想知道如何按组计算 python 中的累积唯一值?
Below is the dataframe example:下面是 dataframe 示例:
Group![]() |
Year![]() |
Type![]() |
---|---|---|
A![]() |
1998 ![]() |
red![]() |
A![]() |
1998 ![]() |
blue![]() |
A![]() |
2002 ![]() |
red![]() |
A![]() |
2005 ![]() |
blue![]() |
A![]() |
2008 ![]() |
blue![]() |
A![]() |
2008 ![]() |
yello![]() |
B![]() |
1998 ![]() |
red![]() |
B![]() |
2001 ![]() |
red![]() |
B![]() |
2003 ![]() |
red![]() |
C ![]() |
1996 ![]() |
red![]() |
C ![]() |
2002 ![]() |
orange![]() |
C ![]() |
2002 ![]() |
red![]() |
C ![]() |
2012 ![]() |
blue![]() |
C ![]() |
2012 ![]() |
yello![]() |
I need to create a new column by Column "Group".我需要按“组”列创建一个新列。 The value of this new column should be the accumulative unique values of Column "Type", accumulating by Column "Year".
这个新列的值应该是列“类型”的累积唯一值,按列“年”累积。
Below is the dataframe I want.下面是我想要的dataframe。 For example: (1)For Group A and in year 1998, I want to count the unique value of Type in year 1998, and there are two unique values of Type: red and blue.
例如: (1)对于A组和1998年,我想统计1998年Type的唯一值,Type有两个唯一值:红色和蓝色。 (2)For Group A and in year 2002, I want to count the unique value of Type in year 1998 and 2002, and there are also two unique values of Type: red and blue.
(2)对于A组和2002年,我想统计1998年和2002年Type的唯一值,Type也有两个唯一值:红色和蓝色。 (3)For Group A and in year 2008, I want to count the unique value of Type in year 1998, 2002, 2005, and 2008, and there are three unique values of Type: red, blue, and yellow.
(3)对于A组和2008年,我想统计1998年、2002年、2005年和2008年Type的唯一值,Type的唯一值有红色、蓝色和黄色三个。
Group![]() |
Year![]() |
Type![]() |
Want![]() |
---|---|---|---|
A![]() |
1998 ![]() |
red![]() |
2 ![]() |
A![]() |
1998 ![]() |
blue![]() |
2 ![]() |
A![]() |
2002 ![]() |
red![]() |
2 ![]() |
A![]() |
2005 ![]() |
blue![]() |
2 ![]() |
A![]() |
2008 ![]() |
blue![]() |
3 ![]() |
A![]() |
2008 ![]() |
yello![]() |
3 ![]() |
B![]() |
1998 ![]() |
red![]() |
1 ![]() |
B![]() |
2001 ![]() |
red![]() |
1 ![]() |
B![]() |
2003 ![]() |
red![]() |
1 ![]() |
C ![]() |
1996 ![]() |
red![]() |
1 ![]() |
C ![]() |
2002 ![]() |
orange![]() |
2 ![]() |
C ![]() |
2002 ![]() |
red![]() |
2 ![]() |
C ![]() |
2012 ![]() |
blue![]() |
4 ![]() |
C ![]() |
2012 ![]() |
yello![]() |
4 ![]() |
One more thing about this dataframe: not all groups have values in the same years.关于此 dataframe 的另一件事:并非所有组在同一年份都有值。 For example, group A has two values in year 1998 and 2008, one value in year 2002 and 2005. Group B has values in year 1998, 2001, and 2003.
例如,A 组在 1998 年和 2008 年有两个值,在 2002 年和 2005 年有一个值。B 组在 1998、2001 和 2003 年有值。
I wonder how to address this problem.我想知道如何解决这个问题。 Your great help means a lot to me.
您的大力帮助对我来说意义重大。 Thanks!
谢谢!
For each Group
:对于每个
Group
:
Append a new column Want
that has the values like you want: Append 新列
Want
具有您想要的值:
def f(df):
want = df.groupby('Year')['Type'].agg(list).cumsum().apply(set).apply(len)
want.name = 'Want'
return df.merge(want, on='Year')
df.groupby('Group', group_keys=False).apply(f).reset_index(drop=True)
Result:结果:
Group Year Type Want
0 A 1998 red 2
1 A 1998 blue 2
2 A 2002 red 2
3 A 2005 blue 2
4 A 2008 blue 3
5 A 2008 yello 3
6 B 1998 red 1
7 B 2001 red 1
8 B 2003 red 1
9 C 1996 red 1
10 C 2002 orange 2
11 C 2002 red 2
12 C 2012 blue 4
13 C 2012 yello 4
Notes:
笔记:
I think the use of
.merge
here is efficient.我认为在这里使用
.merge
是有效的。You can also use 1
.apply
insidef
instead of 2 chained ones to improve efficiency:.apply(lambda x: len(set(x)))
您还可以在
f
中使用 1 个.apply
而不是 2 个链式来提高效率:.apply(lambda x: len(set(x)))
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.