Drop rows in pandas if records in two columns do not appear together at least twice in the dataset

Question

I am having a dataset with dates and company names. I only want to keep rows such that the combination of the company name and the date appeared in the dataset at least twice.

To illustrate the problem, let us assume I have the following dataframe:

df1 = pd.DataFrame(np.array([['28/02/2017', 'Apple'], ['28/02/2017', 'Apple'], ['31/03/2017', 'Apple'],['28/02/2017', 'IBM'],['28/02/2017', 'WalMart'],
['28/02/2017', 'WalMart'],['03/07/2017', 'WalMart']]), columns=['date','keyword'])

My desired output would be:

df2 = pd.DataFrame(np.array([['28/02/2017', 'Apple'], ['28/02/2017', 'Apple'],
                             ['28/02/2017', 'WalMart'],
                             ['28/02/2017', 'WalMart']]), columns=['date', 'keyword'])

I would know how to drop the rows based on conditions in two columns, but I can't figure out how to drop rows based on how many times the combination of two values appeared in a dataset.

Could anyone provide some insight?

Answer 1

Use DataFrame.duplicated with specify columns for check dupes and keep=False for return all dupe rows by boolean indexing :

df2 = df1[df1.duplicated(subset=['date','keyword'], keep=False)]
print (df2)
         date  keyword
0  28/02/2017    Apple
1  28/02/2017    Apple
4  28/02/2017  WalMart
5  28/02/2017  WalMart

If need specify number of rows use GroupBy.transform with count by GroupBy.size :

df2 = df1[df1.groupby(['date','keyword'])['date'].transform('size') >= 2]

If small DataFrame or performance is not important use filter :

df2 = df1.groupby(['date','keyword']).filter(lambda x: len(x) >= 2)
print (df2)
         date  keyword
0  28/02/2017    Apple
1  28/02/2017    Apple
4  28/02/2017  WalMart
5  28/02/2017  WalMart

Answer 2

df1.groupby(['date','keyword']).apply(lambda x: x if len(x) >= 2 else None).dropna()

Output

         date  keyword
0  28/02/2017    Apple
1  28/02/2017    Apple
4  28/02/2017  WalMart
5  28/02/2017  WalMart

Drop rows in pandas if records in two columns do not appear together at least twice in the dataset

Question

2 answers

solution1
6 ACCPTED 2019-07-08 10:40:54

solution2
3 2019-07-08 10:42:09

Drop rows in pandas if records in two columns do not appear together at least twice in the dataset

Question

2 answers

solution1 6 ACCPTED 2019-07-08 10:40:54

solution2 3 2019-07-08 10:42:09

solution1
6 ACCPTED 2019-07-08 10:40:54

solution2
3 2019-07-08 10:42:09