简体   繁体   English

从 Pandas 数据框中删除重复项,同时保留多数元素

[英]Remove duplicates from pandas dataframe while keeping majority element

I have a pandas Dataframe that looks like this:我有一个看起来像这样的熊猫数据框:

   Cat  Date
1  A    2019-12-30
2  A    2019-12-30
3  A    2020-12-30
4  A    2020-01-06
5  A    2020-01-06
6  B    2020-01-06
7  B    2020-01-13
8  B    2020-01-13
9  A    2020-01-13
 .    .
 .    .
 .    .

There are duplicate dates in the Date column, and I want to "smush" down the DataFrame so that all the duplicate dates are removed. Date 列中有重复的日期,我想“粉碎”DataFrame,以便删除所有重复的日期。 However, to determine what's in the "Cat" column after this "smushing", I want to pick the majority element of the dates that are being "smushed".但是,为了确定在“smushing”之后“Cat”列中的内容,我想选择被“smushed”的日期的大部分元素。

Thus, I want the output to be:因此,我希望输出为:

   Cat  Date
1  A    2019-12-30
2  A    2020-01-06
3  B    2020-01-13
 .    .
 .    .
 .    .

Efficiency is important, I want to be able to do this as quickly as possible, as my DataFrame is quite large (100k rows).效率很重要,我希望能够尽快完成此操作,因为我的 DataFrame 非常大(10 万行)。 There is a guarantee that the number of repeated dates will always be odd, and that the total number of different "Cat" letters can be max 2, so there is no concern of ties.可以保证重复日期的数量总是奇数,并且不同“Cat”字母的总数最多为 2,因此没有关系的问题。

Try value_counts to count all values after a groupby on the date column:尝试value_counts计算date列上 groupby 之后的所有值:

df.groupby("Date").agg(lambda x: x.value_counts().index[0])
#            Cat
# Date
# 2019-12-30   A
# 2020-01-06   A
# 2020-01-13   B
# 2020-12-30   A

Explanations :说明

  1. Split the dataframe in groups according the Date using groupby .使用groupby根据Date将数据框拆分为组。

  2. Apply an aggregation using agg .使用agg应用聚合。 This function accepts an function to aggregate the groups.此函数接受一个函数来聚合组。

  3. Define the aggregation function:定义聚合函数:

    3.1. 3.1. Get the number of values per group using the value_counts function:使用value_counts函数获取每组值的数量:

print(df.groupby("Date").agg(lambda x: x.value_counts()))
#                Cat
# Date
# 2019-12-30       2
# 2020-01-06  [3, 2]
# 2020-01-13  [2, 1]
# 2020-12-30       1

Note: the result from the value_counts method is an ordered series.注意: value_counts方法的结果是一个有序的系列。

3.2. 3.2. However, we actually want the values and not the count .然而,我们实际上想要的是values而不是count The trick is to use the index on the serie.诀窍是使用系列上的index

print(df.groupby("Date").agg(lambda x: x.value_counts().index))
#                Cat
# Date
# 2019-12-30       A
# 2020-01-06  [B, A]
# 2020-01-13  [B, A]
# 2020-12-30       A

3.3. 3.3. Finaly, select the first value:最后,选择第一个值:

print(df.groupby("Date").agg(lambda x: x.value_counts().index[0]))
#            Cat
# Date
# 2019-12-30   A
# 2020-01-06   B
# 2020-01-13   B
# 2020-12-30   A

Here's a simple solution这是一个简单的解决方案

def removeDuplicatesKeepBest(df):
    # sort the data frame 
    df.sort_values(by="Cat")
    # Look only in the date column and only keep the first occurence if there is a dulplicate
    df.drop_duplicates(subset = "Date" , keep = 'first', inplace = True)

    return df

Hope this helps!希望这可以帮助!

I'd consider the old groupby我会考虑旧的groupby

df.groupby(["Cat", "Date"]).size()\
  .reset_index(name="to_drop")\
  .drop("to_drop", axis=1)

Alternatively you can use drop duplicates with two columns或者,您可以使用两列删除重复项

df.drop_duplicates(['Date',"Cat"])

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM