计数级别在 Python dataframe 中的集群/组内出现的次数

Question

I have a dataframe with clusters.我有一个带集群的 dataframe。 In this dataframe, I want to count the number of times a particular value occurs inside a cluster.在这个 dataframe 中，我想计算特定值在集群内出现的次数。 For example:例如：

data = {'cluster':['1001', '1001', '1001', '1002', '1002', '1002'],
        'attribute':['1', '2', '1', '1', '2', '2']}

df = pd.DataFrame(data)

df

I want to count how many times '1' has occurred inside each cluster.我想计算每个集群内出现了多少次“1”。 I have tried using lambda functions, and although trying to average inside the cluster works, count is not working.我曾尝试使用 lambda 函数，虽然尝试在集群内进行平均，但计数不起作用。

For averaging, I used:对于平均，我使用：

df['newcol'] = df.groupby('cluster')['attribute'].transform(lambda x: x.mean())
df

Using the same, but with mean replaced with count:使用相同，但均值替换为计数：

df['newcol'] = df.groupby('cluster')['attribute'].transform(lambda x: x.count('2'))
df

Gives me this error:给我这个错误：

Error: 'Requested level (3) does not match index name (None)'错误：“请求的级别 (3) 与索引名称 (None) 不匹配”

I ideally want to add the count as an additional column, hence am using the lambda function.理想情况下，我想将计数添加为附加列，因此我使用 lambda function。

Please help me in solving this, If any additional detail is required or if I was not clear, I'd be happy to add information!请帮我解决这个问题，如果需要任何额外的细节或者我不清楚，我很乐意添加信息！

Edit编辑

Thank you, @Rutger has provided what I was looking for.谢谢，@Rutger 提供了我想要的东西。 In a gist, I was looking to create a new column that would show me how many times the attribute has occurred in a cluster.简而言之，我希望创建一个新列，以显示该属性在集群中出现了多少次。 I also needed it to be generalizable, so that all the attributes could be calculated.我还需要它是可概括的，以便可以计算所有属性。

On a separate note, my dataframe consists of around 600,000 rows.另外，我的 dataframe 包含大约 600,000 行。 Is there a recommended way to perhaps take a chunk out of this dataset so that I could do my work on that?有没有推荐的方法可以从这个数据集中取出一个块，以便我可以做我的工作？ If there's a similar answer somewhere else, kindly point me towards the same!如果其他地方有类似的答案，请指出我的相同！ Thank you!谢谢！

Answer 1

There are many ways of doing it.有很多方法可以做到这一点。 I would go for a groupby with both columns and then you just see how frequent they occur.我将 go 用于包含两列的 groupby ，然后您就会看到它们发生的频率。 This is not the most straightforward method I assume but I think it's the result you are looking for.这不是我假设的最直接的方法，但我认为这是您正在寻找的结果。

df['count'] = df.set_index(['cluster', 'attribute']).index.map(df.groupby(['cluster', 'attribute']).size())

Answer 2

Since you want to add a column alongside with the existing columns to show the number of 1's in a cluster (group), you can keep on using .transform() as you are doing now.由于您想在现有列旁边添加一列以显示集群（组）中1's数量，因此您可以像现在一样继续使用.transform() 。

Inside the .transform() , you can use lambda function to check the elements equal '1' and get the sum() (instead of count) of such True entries, as follows:在.transform()内部，您可以使用 lambda function 来检查元素是否等于 '1' 并获取此类True条目的sum() （而不是 count），如下所示：

df['newcol'] = df.groupby('cluster')['attribute'].transform(lambda x: x.eq('1').sum())

Result:结果：

print(df)


  cluster attribute   newcol
0    1001         1        2
1    1001         2        2
2    1001         1        2
3    1002         1        1
4    1002         2        1
5    1002         2        1

计数级别在 Python dataframe 中的集群/组内出现的次数

问题描述

2 个解决方案

解决方案1
0 已采纳 2021-05-20 15:33:10

解决方案2
0 2021-05-20 16:16:23

计数级别在 Python dataframe 中的集群/组内出现的次数

问题描述

2 个解决方案

解决方案1 0 已采纳 2021-05-20 15:33:10

解决方案2 0 2021-05-20 16:16:23

解决方案1
0 已采纳 2021-05-20 15:33:10

解决方案2
0 2021-05-20 16:16:23