如何按组计算累积唯一值？

Question

I wonder how to count accumulative unique values by groups in python?我想知道如何按组计算 python 中的累积唯一值？

Below is the dataframe example:下面是 dataframe 示例：

Group团体	Year年	Type类型
A一个	1998 1998	red红色的
A一个	1998 1998	blue蓝色的
A一个	2002 2002年	red红色的
A一个	2005 2005年	blue蓝色的
A一个	2008 2008年	blue蓝色的
A一个	2008 2008年	yello黄
B乙	1998 1998	red红色的
B乙	2001 2001年	red红色的
B乙	2003 2003年	red红色的
C C	1996 1996	red红色的
C C	2002 2002年	orange橙
C C	2002 2002年	red红色的
C C	2012 2012	blue蓝色的
C C	2012 2012	yello黄

I need to create a new column by Column "Group".我需要按“组”列创建一个新列。 The value of this new column should be the accumulative unique values of Column "Type", accumulating by Column "Year".这个新列的值应该是列“类型”的累积唯一值，按列“年”累积。

Below is the dataframe I want.下面是我想要的dataframe。 For example: (1)For Group A and in year 1998, I want to count the unique value of Type in year 1998, and there are two unique values of Type: red and blue.例如： (1)对于A组和1998年，我想统计1998年Type的唯一值，Type有两个唯一值：红色和蓝色。 (2)For Group A and in year 2002, I want to count the unique value of Type in year 1998 and 2002, and there are also two unique values of Type: red and blue. (2)对于A组和2002年，我想统计1998年和2002年Type的唯一值，Type也有两个唯一值：红色和蓝色。 (3)For Group A and in year 2008, I want to count the unique value of Type in year 1998, 2002, 2005, and 2008, and there are three unique values of Type: red, blue, and yellow. (3)对于A组和2008年，我想统计1998年、2002年、2005年和2008年Type的唯一值，Type的唯一值有红色、蓝色和黄色三个。

Group团体	Year年	Type类型	Want想
A一个	1998 1998	red红色的	2 2
A一个	1998 1998	blue蓝色的	2 2
A一个	2002 2002年	red红色的	2 2
A一个	2005 2005年	blue蓝色的	2 2
A一个	2008 2008年	blue蓝色的	3 3
A一个	2008 2008年	yello黄	3 3
B乙	1998 1998	red红色的	1 1
B乙	2001 2001年	red红色的	1 1
B乙	2003 2003年	red红色的	1 1
C C	1996 1996	red红色的	1 1
C C	2002 2002年	orange橙	2 2
C C	2002 2002年	red红色的	2 2
C C	2012 2012	blue蓝色的	4 4
C C	2012 2012	yello黄	4 4

One more thing about this dataframe: not all groups have values in the same years.关于此 dataframe 的另一件事：并非所有组在同一年份都有值。 For example, group A has two values in year 1998 and 2008, one value in year 2002 and 2005. Group B has values in year 1998, 2001, and 2003.例如，A 组在 1998 年和 2008 年有两个值，在 2002 年和 2005 年有一个值。B 组在 1998、2001 和 2003 年有值。

I wonder how to address this problem.我想知道如何解决这个问题。 Your great help means a lot to me.您的大力帮助对我来说意义重大。 Thanks!谢谢！

Answer 1

For each Group :对于每个Group ：

Append a new column Want that has the values like you want: Append 新列Want具有您想要的值：

def f(df):
    want = df.groupby('Year')['Type'].agg(list).cumsum().apply(set).apply(len)
    want.name = 'Want'
    return df.merge(want, on='Year')

df.groupby('Group', group_keys=False).apply(f).reset_index(drop=True)

Result:结果：

   Group  Year    Type  Want
0      A  1998     red     2
1      A  1998    blue     2
2      A  2002     red     2
3      A  2005    blue     2
4      A  2008    blue     3
5      A  2008   yello     3
6      B  1998     red     1
7      B  2001     red     1
8      B  2003     red     1
9      C  1996     red     1
10     C  2002  orange     2
11     C  2002     red     2
12     C  2012    blue     4
13     C  2012   yello     4

Notes:笔记：

I think the use of .merge here is efficient.我认为在这里使用.merge是有效的。

You can also use 1 .apply inside f instead of 2 chained ones to improve efficiency: .apply(lambda x: len(set(x)))您还可以在f中使用 1 个.apply而不是 2 个链式来提高效率： .apply(lambda x: len(set(x)))

如何按组计算累积唯一值？

问题描述

1 个解决方案

解决方案1
2 2022-09-01 08:59:35

如何按组计算累积唯一值？

问题描述

1 个解决方案

解决方案1 2 2022-09-01 08:59:35

解决方案1
2 2022-09-01 08:59:35