I have a dataframe with clusters. In this dataframe, I want to count the number of times a particular value occurs inside a cluster. For example:
data = {'cluster':['1001', '1001', '1001', '1002', '1002', '1002'],
'attribute':['1', '2', '1', '1', '2', '2']}
df = pd.DataFrame(data)
df
I want to count how many times '1' has occurred inside each cluster. I have tried using lambda functions, and although trying to average inside the cluster works, count is not working.
For averaging, I used:
df['newcol'] = df.groupby('cluster')['attribute'].transform(lambda x: x.mean())
df
Using the same, but with mean replaced with count:
df['newcol'] = df.groupby('cluster')['attribute'].transform(lambda x: x.count('2'))
df
Gives me this error:
Error: 'Requested level (3) does not match index name (None)'
I ideally want to add the count as an additional column, hence am using the lambda function.
Please help me in solving this, If any additional detail is required or if I was not clear, I'd be happy to add information!
Edit
Thank you, @Rutger has provided what I was looking for. In a gist, I was looking to create a new column that would show me how many times the attribute has occurred in a cluster. I also needed it to be generalizable, so that all the attributes could be calculated.
On a separate note, my dataframe consists of around 600,000 rows. Is there a recommended way to perhaps take a chunk out of this dataset so that I could do my work on that? If there's a similar answer somewhere else, kindly point me towards the same! Thank you!
There are many ways of doing it. I would go for a groupby with both columns and then you just see how frequent they occur. This is not the most straightforward method I assume but I think it's the result you are looking for.
df['count'] = df.set_index(['cluster', 'attribute']).index.map(df.groupby(['cluster', 'attribute']).size())
Since you want to add a column alongside with the existing columns to show the number of 1's
in a cluster (group), you can keep on using .transform()
as you are doing now.
Inside the .transform()
, you can use lambda function to check the elements equal '1' and get the sum()
(instead of count) of such True
entries, as follows:
df['newcol'] = df.groupby('cluster')['attribute'].transform(lambda x: x.eq('1').sum())
Result:
print(df)
cluster attribute newcol
0 1001 1 2
1 1001 2 2
2 1001 1 2
3 1002 1 1
4 1002 2 1
5 1002 2 1
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.