从一列中获取所有不同的值，其中另一列对于初始列中的每个值至少有两个不同的值

Question

I'm having a very large dataset (20GB+) and I need to select all distinct values from column A where there are at least two other distinct values in column B for each distinct value on column A.我有一个非常大的数据集 (20GB+)，我需要从 A 列中选择所有不同的值，其中对于 A 列上的每个不同值，B 列中至少有两个其他不同值。

For the following dataframe:对于以下数据框：

| A | B |
|---|---|
| x | 1 |
| x | 2 |
| y | 1 |
| y | 1 |

Should return only x because it has two distinct values on column B, while y has only 1 distinct value.应该只返回 x，因为它在 B 列上有两个不同的值，而 y 只有 1 个不同的值。

The following code does the trick, but it takes a very long time (as in hours) since the dataset is very large:以下代码可以解决问题，但由于数据集非常大，因此需要很长时间（以小时为单位）：

def get_values(list_of_distinct_values, dataframe):
    valid_values = []
    for value in list_of_distinct_values:
        value_df = dataframe.loc[dataframe['A'] == value]
        if len(value_df.groupby('B')) > 1:
            valid_values.append(value)
    return valid_values

Can anybody suggest a faster way of doing this?有人可以建议一种更快的方法吗？

Answer 1

I think you can solve your problem with the method drop_duplicates() of the dataframe.我认为您可以使用数据drop_duplicates()方法解决您的问题。 You need to use the parameters subset and keep (to remove all the lines with duplicates) :您需要使用参数subset并keep （删除所有重复的行）：

import pandas as pd
df = pd.DataFrame({
    'A': ["x", "x", "y", "y"],
    'B': [1, 2, 1, 1],
})
df.drop_duplicates(subset=['A', 'B'], keep=False).drop_duplicates(subset=['A'])['A']

从一列中获取所有不同的值，其中另一列对于初始列中的每个值至少有两个不同的值

问题描述

1 个解决方案

解决方案1
0 2020-09-28 20:05:28

从一列中获取所有不同的值，其中另一列对于初始列中的每个值至少有两个不同的值

问题描述

1 个解决方案

解决方案1 0 2020-09-28 20:05:28

解决方案1
0 2020-09-28 20:05:28