简体   繁体   English

如果两列中的记录在数据集中不至少出现两次,则删除pandas中的行

[英]Drop rows in pandas if records in two columns do not appear together at least twice in the dataset

I am having a dataset with dates and company names. 我有一个包含日期和公司名称的数据集。 I only want to keep rows such that the combination of the company name and the date appeared in the dataset at least twice. 我只想保留行,使公司名称和日期的组合出现在数据集中至少两次。

To illustrate the problem, let us assume I have the following dataframe: 为了说明问题,我们假设我有以下数据帧:

df1 = pd.DataFrame(np.array([['28/02/2017', 'Apple'], ['28/02/2017', 'Apple'], ['31/03/2017', 'Apple'],['28/02/2017', 'IBM'],['28/02/2017', 'WalMart'],
['28/02/2017', 'WalMart'],['03/07/2017', 'WalMart']]), columns=['date','keyword'])

My desired output would be: 我想要的输出是:

df2 = pd.DataFrame(np.array([['28/02/2017', 'Apple'], ['28/02/2017', 'Apple'],
                             ['28/02/2017', 'WalMart'],
                             ['28/02/2017', 'WalMart']]), columns=['date', 'keyword'])

I would know how to drop the rows based on conditions in two columns, but I can't figure out how to drop rows based on how many times the combination of two values appeared in a dataset. 我会知道如何根据两列中的条件删除行,但我无法弄清楚如何根据数据集中两个值的组合出现的次数来删除行。

Could anyone provide some insight? 有人能提供一些见解吗?

Use DataFrame.duplicated with specify columns for check dupes and keep=False for return all dupe rows by boolean indexing : 使用DataFrame.duplicated指定用于检查DataFrame.duplicated列,并使用keep=False通过boolean indexing返回所有欺骗行:

df2 = df1[df1.duplicated(subset=['date','keyword'], keep=False)]
print (df2)
         date  keyword
0  28/02/2017    Apple
1  28/02/2017    Apple
4  28/02/2017  WalMart
5  28/02/2017  WalMart

If need specify number of rows use GroupBy.transform with count by GroupBy.size : 如果需要指定的行数使用GroupBy.transform与数GroupBy.size

df2 = df1[df1.groupby(['date','keyword'])['date'].transform('size') >= 2]

If small DataFrame or performance is not important use filter : 如果小DataFrame或性能不重要,请使用过滤器

df2 = df1.groupby(['date','keyword']).filter(lambda x: len(x) >= 2)
print (df2)
         date  keyword
0  28/02/2017    Apple
1  28/02/2017    Apple
4  28/02/2017  WalMart
5  28/02/2017  WalMart
df1.groupby(['date','keyword']).apply(lambda x: x if len(x) >= 2 else None).dropna()

Output 产量

         date  keyword
0  28/02/2017    Apple
1  28/02/2017    Apple
4  28/02/2017  WalMart
5  28/02/2017  WalMart

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM