简体   繁体   中英

Count number of times a level occurs within a cluster/group in Python dataframe

I have a dataframe with clusters. In this dataframe, I want to count the number of times a particular value occurs inside a cluster. For example:

data = {'cluster':['1001', '1001', '1001', '1002', '1002', '1002'],
        'attribute':['1', '2', '1', '1', '2', '2']}

df = pd.DataFrame(data)

df

I want to count how many times '1' has occurred inside each cluster. I have tried using lambda functions, and although trying to average inside the cluster works, count is not working.

For averaging, I used:

df['newcol'] = df.groupby('cluster')['attribute'].transform(lambda x: x.mean())
df

Using the same, but with mean replaced with count:

df['newcol'] = df.groupby('cluster')['attribute'].transform(lambda x: x.count('2'))
df

Gives me this error:

Error: 'Requested level (3) does not match index name (None)'

I ideally want to add the count as an additional column, hence am using the lambda function.

Please help me in solving this, If any additional detail is required or if I was not clear, I'd be happy to add information!

Edit

Thank you, @Rutger has provided what I was looking for. In a gist, I was looking to create a new column that would show me how many times the attribute has occurred in a cluster. I also needed it to be generalizable, so that all the attributes could be calculated.

On a separate note, my dataframe consists of around 600,000 rows. Is there a recommended way to perhaps take a chunk out of this dataset so that I could do my work on that? If there's a similar answer somewhere else, kindly point me towards the same! Thank you!

There are many ways of doing it. I would go for a groupby with both columns and then you just see how frequent they occur. This is not the most straightforward method I assume but I think it's the result you are looking for.

df['count'] = df.set_index(['cluster', 'attribute']).index.map(df.groupby(['cluster', 'attribute']).size())

Since you want to add a column alongside with the existing columns to show the number of 1's in a cluster (group), you can keep on using .transform() as you are doing now.

Inside the .transform() , you can use lambda function to check the elements equal '1' and get the sum() (instead of count) of such True entries, as follows:

df['newcol'] = df.groupby('cluster')['attribute'].transform(lambda x: x.eq('1').sum())

Result:

print(df)


  cluster attribute   newcol
0    1001         1        2
1    1001         2        2
2    1001         1        2
3    1002         1        1
4    1002         2        1
5    1002         2        1

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM