简体   繁体   English

熊猫删除双重条件的重复项

[英]Pandas remove duplicates with a double condition

consider the following DF 考虑以下DF

    import pandas as pd
    df = pd.DataFrame({'ID': [1,1,1,1,2,2,2,2], 
    'Course': 
    ['English','English','English','History','Science', 'Science', 'Science','Math'],
    'Status':
    ['Attended', 'Requested', 'Partially Attended', 'No show',
    'Requested','Attended','Partially Attended','No show']})
    df.set_index(['ID'])
    print(df)

Course  Status
ID      
1   English Attended
1   English Requested
1   English Partially Attended
1   History No show
2   Science Requested
2   Science Attended
2   Science Partially Attended
2   Math    No show

I'm trying to work out a way to remove duplicates based on the following 3 assumptions. 我正在尝试根据以下3个假设找出一种删除重复项的方法。

  1. ID occurs more than once. ID出现多次。
  2. Where the ID occurs more than once the course has to be the same. 如果ID多次出现,则课程必须相同。 (so 1, History and 2, Math are fine to stay) (所以1,历史和2,数学都可以保留)
  3. if a match is found then I want to only drop in the instance where a course is completed and there is a request, dropping the row with the request. 如果找到一个匹配的话,我想去的地方课程完成的情况下, 下降并有一个请求,丢弃该行与请求。 A no-show and partially attended are fine. 没有出现和部分出席可以。

I'm currently studying and taking DataCamps Python and pandas courses so I'm familiar with groupby, aggregate, sort functions where I can drop the later or earlier duplicate with time-series data. 我目前正在学习和学习DataCamps Python和pandas课程,因此我熟悉groupby,aggregate,sort函数,可以在其中删除带有时间序列数据的更高版本或更高版本。 I have no idea how to apply conditions or logic to the drop functions. 我不知道如何将条件或逻辑应用于放置函数。 I've searched this forum for similar functions but I've not been apply anything to my own DF. 我已经在该论坛中搜索了类似的功能,但没有对自己的DF应用任何功能。

my desired result is as follows: 我想要的结果如下:

Course  Status
ID      
1   English Attended
1   English Partially Attended
1   History No show
2   Science Attended
2   Science Partially Attended
2   Math    No show

not duplicated or not Requested duplicated或未Requested

df[~df.duplicated(['ID', 'Course'], keep=False) | df.Status.ne('Requested')]

    Course  ID              Status
0  English   1            Attended
2  English   1  Partially Attended
3  History   1             No show
5  Science   2            Attended
6  Science   2  Partially Attended
7     Math   2             No show

pandas.DataFrame.duplicated

Identifies if things are duplicates. 标识事物是否重复。 I pass a list of column names to use to determine duplicity. 我传递了一个列名列表,用于确定重复性。 By using keep=False I specify that I want to count the first or last occurrence as a duplicate as well. 通过使用keep=False我指定我也想将第一次出现或最后一次出现也计为重复。

df.duplicated(['ID', 'Course'], keep=False)

0     True
1     True
2     True
3    False
4     True
5     True
6     True
7    False
dtype: bool

However, if it is a duplicate, also check if it is Requested 但是,如果重复,也请检查是否已Requested

df.Status.ne('Requested')

0     True
1    False
2     True
3     True
4    False
5     True
6     True
7     True
Name: Status, dtype: bool

So we want rows that are either not duplicates, and if they are at least don't have Status equal to Requested 因此,我们希望行不是重复的,并且如果行至少不等于“ Requested Status

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM