如何根据条件删除 Pandas 数据框中的重复项

Question

I am making a small counting software that basically counts the the total number of people present inside a premises.我正在制作一个小型计数软件，它基本上可以计算场所内的总人数。 The data frame I am getting from the microcontroller database (which allows people to go in and out) has a human error in which sometimes the user has exit before an Entry.我从微控制器数据库（允许人们进出）获得的数据帧存在人为错误，有时用户在进入之前退出。 So there are instances in the data frame where one entry has multiple exits before another subsequent entry.因此，在数据框中存在这样的情况，其中一个条目在另一个后续条目之前有多个出口。 The df is something like this: df 是这样的：

date     timestamp  type    cardno      status
**20201006  55737   PC010   117016056   Valid Card Exit**
20201006    55907   PC010   117016056   Valid Card Entry
20201006    60312   PC006   100024021   Valid Card Entry
20201006    61311   PC006   100024021   Valid Card Exit
20201006    61445   PC006   100024021   Valid Card Entry
20201006    61538   PC006   100024021   Valid Card Exit
20201006    61646   PC010   117016056   Valid Card Exit
20201006    61933   PC006   100024021   Valid Card Entry
20201006    61938   PC010   117016056   Valid Card Entry
20201006    62025   PC006   100024021   Valid Card Exit
20201006    62041   PC010   117016056   Valid Card Exit
20201006    62042   PC006   100024021   Valid Card Entry
20201006    62225   PC010   117016056   Valid Card Entry
20201006    62527   PC006   100024021   Valid Card Exit
20201006    63018   PC006   100024021   Valid Card Entry
20201006    64832   PC007   116057383   Valid Card Entry
20201006    64834   PC011   117016074   Valid Card Entry
**20201006  64952   PC012   116054003   Valid Card Exit**

The entries with ** is basically the employee hitting exit before an entry (for whatever reason) and that messes up the counting.带有 ** 的条目基本上是员工在进入之前（无论出于何种原因）退出，这会扰乱计数。 I want to get rid of all such instances in the data frame.我想摆脱数据框中的所有此类实例。 I am having a really hard time in going about doing this tbh.我真的很难做到这一点。 The counting software that I have made thus far basically reads a firebird database and then makes different data frames out of it, proceeds to count its shape, and then displays the output as a Simple HTML on a big Screen placed within the premises.到目前为止，我制作的计数软件基本上是读取火鸟数据库，然后从中制作不同的数据帧，继续计算其形状，然后将输出显示为放置在房屋内的大屏幕上的简单 HTML。 The data frame I have described above having the issue is called 'contractorDf' in the program that I am running in production (testing) as below:我在上面描述的有问题的数据框在我在生产（测试）中运行的程序中称为“contractorDf”，如下所示：

import subprocess
from datetime import datetime
from datetime import date
import pandas as pd
import re
import os
import sys
   
#------------------------------------------------------PRODUCTION-----------------------------------------#
# Generating a Temporary Date for Production Environment
tempDate = date(2020, 10, 6)
tempDate = str(tempDate)
tempDate = tempDate.replace('-', '')
#------------------------------------------------------PRODUCTION----------------------------------------#
   
################################################################################################################################
# Getting Current Day (This will be used in real environment)
currentDay = datetime.now().day

if currentDay < 10:
    currentDay = str(currentDay)
    currentDay = '0'+ currentDay
else:
    currentDay = str(currentDay)


# Getting Current Year & Month
currentYear = datetime.now().year
currentMonth = datetime.now().month
currentYear = str(currentYear)
currentMonth = str(currentMonth)
currentYearMonth = currentYear+currentMonth
currentYearMonthDay = currentYearMonth+currentDay

# Getting Variable for After FROM
currentTableName = 'ST'+currentYearMonth

# Getting Final Query (Commented Right now because Testing)
query = "SELECT * FROM " + currentYearMonth + " " + "WHERE TRDATE=" + currentYearMonthDay + ";"
finalQuery = bytes(query, 'ascii')
#############################################################################################################################


#-------------------------------------------------------PRODUCTION------------------------------------------------------#
# Making a temporary Table Name and Query for Production Environment
tempTableName = 'ST'+currentYearMonth
nonByteQuery = "SELECT * FROM " + tempTableName + " " + "WHERE TRDATE=" + tempDate + ";"
tempQuery = bytes(nonByteQuery, 'ascii')
#-------------------------------------------------------PRODUCTION------------------------------------------------------#



# Generating record.csv file from command prompt (Before initiating this, C:\\Program Files (x86)\\FireBird\\FireBird_2_1\\bin should be in the environment variables)
p = subprocess.Popen('isql', shell=True, stdin=subprocess.PIPE, stdout=subprocess.PIPE)
p.stdin.write(b'CONNECT "C:\\Users\\JH\\OneDrive\\Desktop\\EntryPass\\P1_Server\\event\\TRANS.fdb";') #The italicized b is because its a Byte size code and we can't 
p.stdin.write(b'OUTPUT "C:\\Users\\JH\\OneDrive\\Desktop\\EntryPass\\P1_Server\\event\\record.csv";')
p.stdin.write(tempQuery)
p.stdin.write(b'OUTPUT;')
p.communicate()
p.terminate()
# Terminating the Command Prompt Window



# Reading the record file that is just generated above
tempdf = pd.read_csv('C:\\Users\\JH\\OneDrive\\Desktop\\EntryPass\\P1_Server\\event\\record.csv', sep='delimeter', engine='python', header=None, skipinitialspace=True)
pd.set_option('display.max_colwidth', -1)
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
pd.set_option('display.width', 1000)

#tempdf = tempdf[0].astype(str)
columns = ["TRDATE", "TRTIME", "TRCODE", "TRDESC", "CTRLTAG", "CTRLNAME", "CTRLIP", "CARDNO", "STAFFNO", "STAFFNAME", "DEPTNAME", "JOBNAME", "SHIFTNAME", "DEVTYPE", "DEVNAME", "DEVNO", "TRID", "ISCAP", "RCGROUP", "POLLTIME", "SENDSEQ", "RECSEQ", "IOBNO", "IOBNAME", "ZONENO", "ZONENAME", "POINTNO", "POINTNAME", "ISSNAPRET", "PROTRAG"]
header = tempdf.iloc[0]
linespace = tempdf.iloc[1]
header = str(header)
header = header[5:]
header = header[:-24]
linespace = str(linespace)
linespace = linespace[7:]
linespace = linespace[:-23]

tempdf = tempdf[~tempdf[0].str.contains(header)]
tempdf = tempdf[~tempdf[0].str.contains(linespace)]
tempdf = tempdf[0].str.replace(' ', ',')
df = tempdf.str.split(",", n=400, expand=True)
df = df[[0,1,7,8,9,10,31,41,42,43,52,53,54]]
df[100] = df[7].map(str) + ' ' + df[8].map(str) + ' ' + df[9].map(str) + ' ' + df[10].map(str)
df = df.drop([7,8,9,10], axis=1)
df[101] = df[31].map(str) + df[41].map(str)
df = df.drop([31,41], axis=1)
df[102] = df[43].map(str) + df[52].map(str) + df[53].map(str) + df[54].map(str)
df = df.drop([43,52,53,54], axis=1)

def newblock(column):
    if column[42].startswith('VIS'):
        return column[42]
    else:
        pass


df = df.assign(newblock=df.apply(newblock, axis=1))

df[42] = df[42].str.replace('VIS_\d\d\d\d\d\d\d\d\d\d', '')

df[105] = df[42].map(str) + df[101].map(str)
df = df.drop([42,101], axis=1)
df[106] = df[102].map(str) + df['newblock'].map(str)
df = df.drop(['newblock', 102], axis=1)
df[106] = df[106].str.replace('None', '')
df = df[[0,1,106,105,100]]
columns = ['date', 'timestamp', 'type', 'cardno', 'status']
df.columns = df.columns.map(str)
df.columns = columns
df = df.reset_index()
df = df.drop(['index'], axis=1)




#Making Visitor Counter
visitorDf = df[df['type'].str.startswith('VIS')]
#visitorDf = visitorDf[~visitorDf['status'].str.contains('Unknown')]
visitorIn1 = len(visitorDf[visitorDf['status'].str.contains('Unknown')])
VisitorIn1 = int(visitorIn1)
visitorDf = visitorDf.reset_index()
visitorDf = visitorDf.drop(('index'), axis=1)
visitorIn = len(visitorDf[visitorDf['status'].str.contains('Valid Card Entry')])
visitorOut = len(visitorDf[visitorDf['status'].str.contains('Valid Card Exit')])
visitorIn = int(visitorIn)
visitorOut = int(visitorOut)
totalVisitor = visitorIn1 + visitorIn - visitorOut

#Making Contractor Counter
contractorDf = df[df['type'].str.startswith('PC')]
#contractorDf = contractorDf[~contractorDf['status'].str.contains('Unknown')]
contractorIn1 = len(contractorDf[contractorDf['status'].str.contains('Unknown')])
contractorIn1 = int(contractorIn1)
contractorDf = contractorDf.reset_index()
contractorDf = contractorDf.drop(('index'), axis=1)
contractorIn = len(contractorDf[contractorDf['status'].str.contains('Valid Card Entry')])
contractorOut = len(contractorDf[contractorDf['status'].str.contains('Valid Card Exit')])
contractorIn = int(contractorIn)
contractorOut = int(contractorOut)
totalContractor = contractorIn1 + contractorIn - contractorOut


#Making Employee Counter
employeeDf = df[df['type'].str.contains('^\d', regex=True)]
#employeeDf = employeeDf[~employeeDf['status'].str.contains('Unknown')]
employeeIn1 = len(employeeDf[employeeDf['status'].str.contains('Unknown')])
employeeIn1 = int(employeeIn1)
employeeDf = employeeDf.reset_index()
employeeDf = employeeDf.drop(('index'), axis=1)
employeeIn = len(employeeDf[employeeDf['status'].str.contains('Valid Card Entry')])
employeeOut = len(employeeDf[employeeDf['status'].str.contains('Valid Card Exit')])
employeeIn = int(employeeIn)
employeeOut = int(employeeOut)
totalEmployee = employeeIn1 + employeeIn - employeeOut


os.remove('C:\\Users\\JH\\OneDrive\\Desktop\\EntryPass\\P1_Server\\event\\record.csv')

visitor = totalVisitor
employee = totalEmployee
contractor = totalContractor

if os.path.exists('C:\\Apache24\\htdocs\\counter\\index.html'):
    os.remove('c:\\Apache24\\htdocs\\counter\\index.html')
else:
    pass

f = open('C:\\Apache24\\htdocs\\counter\\index.html', 'w')

message = """
<html lang="en-US" class="hide-scroll">
    <head>
        <title>Emhart Counter</title>
        <meta charset="utf-8" />
        <meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no">
        <link rel="stylesheet" href="https://stackpath.bootstrapcdn.com/bootstrap/4.5.2/css/bootstrap.min.css" integrity="sha384-JcKb8q3iqJ61gNV9KGb8thSsNjpSL0n8PARn9HuZOnIxN0hoP+VmmDGMN5t9UJ0Z" crossorigin="anonymous">
        <style>
        body {{
            background-color: lightblue;
        }}

        .verticalCenter {{
            margin: 0;
            top: 100%;
            -ms-transform: translateY(25%);
            transform: translateY(25%);
        }}
        </style>
    </head>
    <body>
        <center>
            <div class=“verticalCenter">
                <h1 style=font-size:100px>VISITORS: &emsp;&emsp;&emsp;&emsp;&emsp;&emsp; {visitor}</h1><br></br><br></br>
                <h1 style=font-size:100px>EMPLOYEES: &emsp;&emsp;&emsp;&emsp;&emsp;&emsp; {employee}</h1><br></br><br></br>
                <h1 style=font-size:100px>CONTRACTORS: &emsp;&emsp;&emsp;&emsp;&emsp;&emsp; {contractor}</h1><br></br><br></br><br></br><br></br>
                <h3 style=font-size: 50px>THIS IS A TEST RUN<h3>
            </div>
        </center>
    </body>
</html>"""


new_message = message.format(visitor=visitor, employee=employee, contractor=contractor)
f.write(new_message)
f.close()


sys.exit()

The only problem left is how do I go about in getting rid of exits for a cardno/type before it has a corresponding Entry in the contractorDf.剩下的唯一问题是我如何摆脱卡诺/类型的退出，然后才能在contractorDf 中有相应的条目。 I would really really appreciate any help on the matter.我真的很感激在这个问题上的任何帮助。

Answer 1

For your example startswith and endswith would work.对于您的示例， startswith和endswith会起作用。 For more complex regex-patterns use contains .对于更复杂的正则表达式模式，请使用contains 。

mask = df.date.str.startswith("**")
print(df[mask])

# or

mask = df.status.str.endswith("**")
print(df[mask])

Outputs:输出：

         date timestamp   type     cardno             status
0  **20201006     55737  PC010  117016056  Valid_Card_Exit**
3  **20201006     64952  PC012  116054003  Valid_Card_Exit**

Setup:设置：

columns = ['date','timestamp','type','cardno','status']
data = [el.split(",") for el in ['**20201006,55737,PC010,117016056,Valid_Card_Exit**',
'20201006,55907,PC010,117016056,Valid_Card_Entry',
'20201006,64834,PC011,117016074,Valid_Card_Entry',
'**20201006,64952,PC012,116054003,Valid_Card_Exit**']]
df = pd.DataFrame(data, columns=columns)

Answer 2

The Cumsum Trick Cumsum 把戏

The key to the problem is a commonly-seen mathematical trick.问题的关键是一个常见的数学技巧。 We first regard entry as 1 , and exit as the cancellation of entry, namely -1 .我们先把entry看作1 ， exit看作 entry 的取消，即-1 。 Then an exit event is bad if it first produces a negative cumulative sum ( cumsum ) up to that row.如果退出事件首先产生负累积总和 ( cumsum ) 直到该行，则退出事件是错误的。 Ie, when the exit event happened cannot be interpreted as a proper cancellation of a previous entry.即，当退出事件发生时不能解释为对先前条目的正确取消。 However, note that subsequent negative cumsum values can be caused by previous bad values.但是，请注意，随后的负cumsum值可能是由先前的错误值引起的。 Therefore, we identify ONLY the first negative cumsum value as bad.因此，我们仅将第一个负 cumsum 值标识为坏的。

Based on the observation above, one can find the first bad entry for each card in a recursive manner until no negative cumsum value is produced.基于上述观察，可以递归地找到每张卡片的第一个坏条目，直到没有产生负的累积和值。

Code代码

The implementation demonstrates how to do this recursively.该实现演示了如何递归地执行此操作。 It is not quite optimized for large datasets, but the virtue should be somehow similar.它没有针对大型数据集进行完全优化，但优点应该是相似的。

# initialize
df["retain"] = True
df["delta"] = -1
df.loc[df["status"] == "Valid Card Entry", "delta"] = 1

def recurse(df):

    # sort for cumsum (bad values found were not retained)
    df_sorted = df[df["retain"]].sort_values(by=["cardno", "timestamp"]).reset_index(drop=True)

    # cumsum
    df_sorted["cumsum"] = df_sorted[["cardno", "delta"]].groupby("cardno").cumsum()

    # get the first occurrence of negative cumsum
    df_dup = df_sorted[df_sorted["cumsum"] < 0].groupby("cardno").first()

    # termination condition: no more bad values were found
    if len(df_dup) == 0:
        return

    # else, remove the bad rows
    for cardno, row in df_dup.iterrows():
        df.loc[(df["cardno"] == cardno) & (df["timestamp"] == row["timestamp"]), "retain"] = False

# execute    
recurse(df)

del df["delta"]  # optional cleanup

Output输出

See the "retain" column ( False = bad exits).请参阅“保留”列（ False = bad exits）。

df
Out[61]: 
        date  timestamp   type     cardno            status  retain
0   20201006      55737  PC010  117016056   Valid Card Exit   False
1   20201006      55907  PC010  117016056  Valid Card Entry    True
2   20201006      60312  PC006  100024021  Valid Card Entry    True
3   20201006      61311  PC006  100024021   Valid Card Exit    True
4   20201006      61445  PC006  100024021  Valid Card Entry    True
5   20201006      61538  PC006  100024021   Valid Card Exit    True
6   20201006      61646  PC010  117016056   Valid Card Exit    True
7   20201006      61933  PC006  100024021  Valid Card Entry    True
8   20201006      61938  PC010  117016056  Valid Card Entry    True
9   20201006      62025  PC006  100024021   Valid Card Exit    True
10  20201006      62041  PC010  117016056   Valid Card Exit    True
11  20201006      62042  PC006  100024021  Valid Card Entry    True
12  20201006      62225  PC010  117016056  Valid Card Entry    True
13  20201006      62527  PC006  100024021   Valid Card Exit    True
14  20201006      63018  PC006  100024021  Valid Card Entry    True
15  20201006      64832  PC007  116057383  Valid Card Entry    True
16  20201006      64834  PC011  117016074  Valid Card Entry    True
17  20201006      64952  PC012  116054003   Valid Card Exit   False

For demonstration purpose, cumsum s before and after cleanup is shown below.出于演示目的，清理前后的cumsum如下所示。 The dataset is sorted by (cardno, timestamp) , and the date column is deleted for clarity.数据集按(cardno, timestamp)排序，为清楚起见删除了date列。

Before前

df_sorted
Out[69]: 
    timestamp   type     cardno            status  retain  delta  cumsum
0       60312  PC006  100024021  Valid Card Entry    True      1       1
1       61311  PC006  100024021   Valid Card Exit    True     -1       0
2       61445  PC006  100024021  Valid Card Entry    True      1       1
3       61538  PC006  100024021   Valid Card Exit    True     -1       0
4       61933  PC006  100024021  Valid Card Entry    True      1       1
5       62025  PC006  100024021   Valid Card Exit    True     -1       0
6       62042  PC006  100024021  Valid Card Entry    True      1       1
7       62527  PC006  100024021   Valid Card Exit    True     -1       0
8       63018  PC006  100024021  Valid Card Entry    True      1       1
9       64952  PC012  116054003   Valid Card Exit    True     -1      -1
10      64832  PC007  116057383  Valid Card Entry    True      1       1
11      55737  PC010  117016056   Valid Card Exit    True     -1      -1
12      55907  PC010  117016056  Valid Card Entry    True      1       0
13      61646  PC010  117016056   Valid Card Exit    True     -1      -1
14      61938  PC010  117016056  Valid Card Entry    True      1       0
15      62041  PC010  117016056   Valid Card Exit    True     -1      -1
16      62225  PC010  117016056  Valid Card Entry    True      1       0
17      64834  PC011  117016074  Valid Card Entry    True      1       1

After后

df_sorted
Out[73]: 
    timestamp   type     cardno            status  retain  delta  cumsum
0       60312  PC006  100024021  Valid Card Entry    True      1       1
1       61311  PC006  100024021   Valid Card Exit    True     -1       0
2       61445  PC006  100024021  Valid Card Entry    True      1       1
3       61538  PC006  100024021   Valid Card Exit    True     -1       0
4       61933  PC006  100024021  Valid Card Entry    True      1       1
5       62025  PC006  100024021   Valid Card Exit    True     -1       0
6       62042  PC006  100024021  Valid Card Entry    True      1       1
7       62527  PC006  100024021   Valid Card Exit    True     -1       0
8       63018  PC006  100024021  Valid Card Entry    True      1       1
9       64832  PC007  116057383  Valid Card Entry    True      1       1
10      55907  PC010  117016056  Valid Card Entry    True      1       1
11      61646  PC010  117016056   Valid Card Exit    True     -1       0
12      61938  PC010  117016056  Valid Card Entry    True      1       1
13      62041  PC010  117016056   Valid Card Exit    True     -1       0
14      62225  PC010  117016056  Valid Card Entry    True      1       1
15      64834  PC011  117016074  Valid Card Entry    True      1       1

如何根据条件删除 Pandas 数据框中的重复项

问题描述

2 个解决方案

解决方案1
0 2020-10-12 10:03:30

解决方案2
0 2020-10-12 10:34:30

The Cumsum Trick Cumsum 把戏

Code代码

Output输出

如何根据条件删除 Pandas 数据框中的重复项

问题描述

2 个解决方案

解决方案1 0 2020-10-12 10:03:30

解决方案2 0 2020-10-12 10:34:30

The Cumsum Trick Cumsum 把戏

Code代码

Output输出

解决方案1
0 2020-10-12 10:03:30

解决方案2
0 2020-10-12 10:34:30