是否有 function 用于获取每个组中 dataframe 中唯一值的数量？

Question

I have a dataframe that has two columns: label and value.我有一个 dataframe 有两列：label 和值。 I would like to identify the number of unique values in the dataframe that occurs in each label group.我想确定每个 label 组中出现的 dataframe 中唯一值的数量。

For example, given the following dataframe:例如，给定以下 dataframe：

test_df = pd.DataFrame({
    'label': [1, 1, 1, 1, 2, 2, 3, 3, 3], 
    'value': [0, 0, 1, 2, 1, 2, 2, 3, 4]})
test_df

  label     value
0   1         0
1   1         0
2   1         1
3   1         2
4   2         1
5   2         2
6   3         2
7   3         3
8   3         4

The expected output is:预期的 output 为：

  label     uni_val
0   1         1 -> {0} is unique value for this label compared to other labels
1   2         0 -> no unique values for this label compared to other labels
2   3         2 -> {3, 4} are unique values for this label compared to other labels

One way of doing this is to get the unique values for each label and then count the non-duplicates of them across all elements.一种方法是获取每个 label 的唯一值，然后计算它们在所有元素中的非重复值。

test_df.groupby('label')['value'].unique()

label
1    [0, 1, 2]
2       [1, 2]
3    [2, 3, 4]
Name: value, dtype: object

Is there a more efficient and simpler way?有没有更高效、更简单的方法？

Answer 1

You could drop duplicates on ['label', 'value'] , then drop duplicates on value :您可以在['label', 'value']上删除重复项，然后在value上删除重复项：

(test_df.drop_duplicates(['label','value'])         # remove duplicates on pair (label, value)
    .drop_duplicates('value', keep=False)           # only keep unique `value`
    .groupby('label')['value'].count()              # count as usual
    .reindex(test_df.label.unique(), fill_value=0)  # fill missing labels with 0
)

Output: Output：

label
1    1
2    0
3    2
Name: value, dtype: int64

是否有 function 用于获取每个组中 dataframe 中唯一值的数量？

问题描述

1 个解决方案

解决方案1
2 已采纳 2019-10-30 13:36:12

是否有 function 用于获取每个组中 dataframe 中唯一值的数量？

问题描述

1 个解决方案

解决方案1 2 已采纳 2019-10-30 13:36:12

解决方案1
2 已采纳 2019-10-30 13:36:12