简体   繁体   中英

Drop rows in pandas if records in two columns do not appear together at least twice in the dataset

I am having a dataset with dates and company names. I only want to keep rows such that the combination of the company name and the date appeared in the dataset at least twice.

To illustrate the problem, let us assume I have the following dataframe:

df1 = pd.DataFrame(np.array([['28/02/2017', 'Apple'], ['28/02/2017', 'Apple'], ['31/03/2017', 'Apple'],['28/02/2017', 'IBM'],['28/02/2017', 'WalMart'],
['28/02/2017', 'WalMart'],['03/07/2017', 'WalMart']]), columns=['date','keyword'])

My desired output would be:

df2 = pd.DataFrame(np.array([['28/02/2017', 'Apple'], ['28/02/2017', 'Apple'],
                             ['28/02/2017', 'WalMart'],
                             ['28/02/2017', 'WalMart']]), columns=['date', 'keyword'])

I would know how to drop the rows based on conditions in two columns, but I can't figure out how to drop rows based on how many times the combination of two values appeared in a dataset.

Could anyone provide some insight?

Use DataFrame.duplicated with specify columns for check dupes and keep=False for return all dupe rows by boolean indexing :

df2 = df1[df1.duplicated(subset=['date','keyword'], keep=False)]
print (df2)
         date  keyword
0  28/02/2017    Apple
1  28/02/2017    Apple
4  28/02/2017  WalMart
5  28/02/2017  WalMart

If need specify number of rows use GroupBy.transform with count by GroupBy.size :

df2 = df1[df1.groupby(['date','keyword'])['date'].transform('size') >= 2]

If small DataFrame or performance is not important use filter :

df2 = df1.groupby(['date','keyword']).filter(lambda x: len(x) >= 2)
print (df2)
         date  keyword
0  28/02/2017    Apple
1  28/02/2017    Apple
4  28/02/2017  WalMart
5  28/02/2017  WalMart
df1.groupby(['date','keyword']).apply(lambda x: x if len(x) >= 2 else None).dropna()

Output

         date  keyword
0  28/02/2017    Apple
1  28/02/2017    Apple
4  28/02/2017  WalMart
5  28/02/2017  WalMart

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM