I am having a dataset with dates and company names. I only want to keep rows such that the combination of the company name and the date appeared in the dataset at least twice.
To illustrate the problem, let us assume I have the following dataframe:
df1 = pd.DataFrame(np.array([['28/02/2017', 'Apple'], ['28/02/2017', 'Apple'], ['31/03/2017', 'Apple'],['28/02/2017', 'IBM'],['28/02/2017', 'WalMart'],
['28/02/2017', 'WalMart'],['03/07/2017', 'WalMart']]), columns=['date','keyword'])
My desired output would be:
df2 = pd.DataFrame(np.array([['28/02/2017', 'Apple'], ['28/02/2017', 'Apple'],
['28/02/2017', 'WalMart'],
['28/02/2017', 'WalMart']]), columns=['date', 'keyword'])
I would know how to drop the rows based on conditions in two columns, but I can't figure out how to drop rows based on how many times the combination of two values appeared in a dataset.
Could anyone provide some insight?
Use DataFrame.duplicated
with specify columns for check dupes and keep=False
for return all dupe rows by boolean indexing
:
df2 = df1[df1.duplicated(subset=['date','keyword'], keep=False)]
print (df2)
date keyword
0 28/02/2017 Apple
1 28/02/2017 Apple
4 28/02/2017 WalMart
5 28/02/2017 WalMart
If need specify number of rows use GroupBy.transform
with count by GroupBy.size
:
df2 = df1[df1.groupby(['date','keyword'])['date'].transform('size') >= 2]
If small DataFrame or performance is not important use filter :
df2 = df1.groupby(['date','keyword']).filter(lambda x: len(x) >= 2)
print (df2)
date keyword
0 28/02/2017 Apple
1 28/02/2017 Apple
4 28/02/2017 WalMart
5 28/02/2017 WalMart
df1.groupby(['date','keyword']).apply(lambda x: x if len(x) >= 2 else None).dropna()
Output
date keyword
0 28/02/2017 Apple
1 28/02/2017 Apple
4 28/02/2017 WalMart
5 28/02/2017 WalMart
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.