[英]Is there a function for getting the number of unique values in the dataframe in each group?
I have a dataframe that has two columns: label and value.我有一个 dataframe 有两列:label 和值。 I would like to identify the number of unique values in the dataframe that occurs in each label group.我想确定每个 label 组中出现的 dataframe 中唯一值的数量。
For example, given the following dataframe:例如,给定以下 dataframe:
test_df = pd.DataFrame({
'label': [1, 1, 1, 1, 2, 2, 3, 3, 3],
'value': [0, 0, 1, 2, 1, 2, 2, 3, 4]})
test_df
label value
0 1 0
1 1 0
2 1 1
3 1 2
4 2 1
5 2 2
6 3 2
7 3 3
8 3 4
The expected output is:预期的 output 为:
label uni_val
0 1 1 -> {0} is unique value for this label compared to other labels
1 2 0 -> no unique values for this label compared to other labels
2 3 2 -> {3, 4} are unique values for this label compared to other labels
One way of doing this is to get the unique values for each label and then count the non-duplicates of them across all elements.一种方法是获取每个 label 的唯一值,然后计算它们在所有元素中的非重复值。
test_df.groupby('label')['value'].unique()
label
1 [0, 1, 2]
2 [1, 2]
3 [2, 3, 4]
Name: value, dtype: object
Is there a more efficient and simpler way?有没有更高效、更简单的方法?
You could drop duplicates on ['label', 'value']
, then drop duplicates on value
:您可以在['label', 'value']
上删除重复项,然后在value
上删除重复项:
(test_df.drop_duplicates(['label','value']) # remove duplicates on pair (label, value)
.drop_duplicates('value', keep=False) # only keep unique `value`
.groupby('label')['value'].count() # count as usual
.reindex(test_df.label.unique(), fill_value=0) # fill missing labels with 0
)
Output: Output:
label
1 1
2 0
3 2
Name: value, dtype: int64
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.