I'm having a very large dataset (20GB+) and I need to select all distinct values from column A where there are at least two other distinct values in column B for each distinct value on column A.
For the following dataframe:
| A | B |
|---|---|
| x | 1 |
| x | 2 |
| y | 1 |
| y | 1 |
Should return only x because it has two distinct values on column B, while y has only 1 distinct value.
The following code does the trick, but it takes a very long time (as in hours) since the dataset is very large:
def get_values(list_of_distinct_values, dataframe):
valid_values = []
for value in list_of_distinct_values:
value_df = dataframe.loc[dataframe['A'] == value]
if len(value_df.groupby('B')) > 1:
valid_values.append(value)
return valid_values
Can anybody suggest a faster way of doing this?
I think you can solve your problem with the method drop_duplicates()
of the dataframe. You need to use the parameters subset
and keep
(to remove all the lines with duplicates) :
import pandas as pd
df = pd.DataFrame({
'A': ["x", "x", "y", "y"],
'B': [1, 2, 1, 1],
})
df.drop_duplicates(subset=['A', 'B'], keep=False).drop_duplicates(subset=['A'])['A']
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.