[英]How to remove duplicates in Pandas Data frame based on a Condition
I am making a small counting software that basically counts the the total number of people present inside a premises.我正在制作一个小型计数软件,它基本上可以计算场所内的总人数。 The data frame I am getting from the microcontroller database (which allows people to go in and out) has a human error in which sometimes the user has exit before an Entry.
我从微控制器数据库(允许人们进出)获得的数据帧存在人为错误,有时用户在进入之前退出。 So there are instances in the data frame where one entry has multiple exits before another subsequent entry.
因此,在数据框中存在这样的情况,其中一个条目在另一个后续条目之前有多个出口。 The df is something like this:
df 是这样的:
date timestamp type cardno status
**20201006 55737 PC010 117016056 Valid Card Exit**
20201006 55907 PC010 117016056 Valid Card Entry
20201006 60312 PC006 100024021 Valid Card Entry
20201006 61311 PC006 100024021 Valid Card Exit
20201006 61445 PC006 100024021 Valid Card Entry
20201006 61538 PC006 100024021 Valid Card Exit
20201006 61646 PC010 117016056 Valid Card Exit
20201006 61933 PC006 100024021 Valid Card Entry
20201006 61938 PC010 117016056 Valid Card Entry
20201006 62025 PC006 100024021 Valid Card Exit
20201006 62041 PC010 117016056 Valid Card Exit
20201006 62042 PC006 100024021 Valid Card Entry
20201006 62225 PC010 117016056 Valid Card Entry
20201006 62527 PC006 100024021 Valid Card Exit
20201006 63018 PC006 100024021 Valid Card Entry
20201006 64832 PC007 116057383 Valid Card Entry
20201006 64834 PC011 117016074 Valid Card Entry
**20201006 64952 PC012 116054003 Valid Card Exit**
The entries with ** is basically the employee hitting exit before an entry (for whatever reason) and that messes up the counting.带有 ** 的条目基本上是员工在进入之前(无论出于何种原因)退出,这会扰乱计数。 I want to get rid of all such instances in the data frame.
我想摆脱数据框中的所有此类实例。 I am having a really hard time in going about doing this tbh.
我真的很难做到这一点。 The counting software that I have made thus far basically reads a firebird database and then makes different data frames out of it, proceeds to count its shape, and then displays the output as a Simple HTML on a big Screen placed within the premises.
到目前为止,我制作的计数软件基本上是读取火鸟数据库,然后从中制作不同的数据帧,继续计算其形状,然后将输出显示为放置在房屋内的大屏幕上的简单 HTML。 The data frame I have described above having the issue is called 'contractorDf' in the program that I am running in production (testing) as below:
我在上面描述的有问题的数据框在我在生产(测试)中运行的程序中称为“contractorDf”,如下所示:
import subprocess
from datetime import datetime
from datetime import date
import pandas as pd
import re
import os
import sys
#------------------------------------------------------PRODUCTION-----------------------------------------#
# Generating a Temporary Date for Production Environment
tempDate = date(2020, 10, 6)
tempDate = str(tempDate)
tempDate = tempDate.replace('-', '')
#------------------------------------------------------PRODUCTION----------------------------------------#
################################################################################################################################
# Getting Current Day (This will be used in real environment)
currentDay = datetime.now().day
if currentDay < 10:
currentDay = str(currentDay)
currentDay = '0'+ currentDay
else:
currentDay = str(currentDay)
# Getting Current Year & Month
currentYear = datetime.now().year
currentMonth = datetime.now().month
currentYear = str(currentYear)
currentMonth = str(currentMonth)
currentYearMonth = currentYear+currentMonth
currentYearMonthDay = currentYearMonth+currentDay
# Getting Variable for After FROM
currentTableName = 'ST'+currentYearMonth
# Getting Final Query (Commented Right now because Testing)
query = "SELECT * FROM " + currentYearMonth + " " + "WHERE TRDATE=" + currentYearMonthDay + ";"
finalQuery = bytes(query, 'ascii')
#############################################################################################################################
#-------------------------------------------------------PRODUCTION------------------------------------------------------#
# Making a temporary Table Name and Query for Production Environment
tempTableName = 'ST'+currentYearMonth
nonByteQuery = "SELECT * FROM " + tempTableName + " " + "WHERE TRDATE=" + tempDate + ";"
tempQuery = bytes(nonByteQuery, 'ascii')
#-------------------------------------------------------PRODUCTION------------------------------------------------------#
# Generating record.csv file from command prompt (Before initiating this, C:\\Program Files (x86)\\FireBird\\FireBird_2_1\\bin should be in the environment variables)
p = subprocess.Popen('isql', shell=True, stdin=subprocess.PIPE, stdout=subprocess.PIPE)
p.stdin.write(b'CONNECT "C:\\Users\\JH\\OneDrive\\Desktop\\EntryPass\\P1_Server\\event\\TRANS.fdb";') #The italicized b is because its a Byte size code and we can't
p.stdin.write(b'OUTPUT "C:\\Users\\JH\\OneDrive\\Desktop\\EntryPass\\P1_Server\\event\\record.csv";')
p.stdin.write(tempQuery)
p.stdin.write(b'OUTPUT;')
p.communicate()
p.terminate()
# Terminating the Command Prompt Window
# Reading the record file that is just generated above
tempdf = pd.read_csv('C:\\Users\\JH\\OneDrive\\Desktop\\EntryPass\\P1_Server\\event\\record.csv', sep='delimeter', engine='python', header=None, skipinitialspace=True)
pd.set_option('display.max_colwidth', -1)
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
pd.set_option('display.width', 1000)
#tempdf = tempdf[0].astype(str)
columns = ["TRDATE", "TRTIME", "TRCODE", "TRDESC", "CTRLTAG", "CTRLNAME", "CTRLIP", "CARDNO", "STAFFNO", "STAFFNAME", "DEPTNAME", "JOBNAME", "SHIFTNAME", "DEVTYPE", "DEVNAME", "DEVNO", "TRID", "ISCAP", "RCGROUP", "POLLTIME", "SENDSEQ", "RECSEQ", "IOBNO", "IOBNAME", "ZONENO", "ZONENAME", "POINTNO", "POINTNAME", "ISSNAPRET", "PROTRAG"]
header = tempdf.iloc[0]
linespace = tempdf.iloc[1]
header = str(header)
header = header[5:]
header = header[:-24]
linespace = str(linespace)
linespace = linespace[7:]
linespace = linespace[:-23]
tempdf = tempdf[~tempdf[0].str.contains(header)]
tempdf = tempdf[~tempdf[0].str.contains(linespace)]
tempdf = tempdf[0].str.replace(' ', ',')
df = tempdf.str.split(",", n=400, expand=True)
df = df[[0,1,7,8,9,10,31,41,42,43,52,53,54]]
df[100] = df[7].map(str) + ' ' + df[8].map(str) + ' ' + df[9].map(str) + ' ' + df[10].map(str)
df = df.drop([7,8,9,10], axis=1)
df[101] = df[31].map(str) + df[41].map(str)
df = df.drop([31,41], axis=1)
df[102] = df[43].map(str) + df[52].map(str) + df[53].map(str) + df[54].map(str)
df = df.drop([43,52,53,54], axis=1)
def newblock(column):
if column[42].startswith('VIS'):
return column[42]
else:
pass
df = df.assign(newblock=df.apply(newblock, axis=1))
df[42] = df[42].str.replace('VIS_\d\d\d\d\d\d\d\d\d\d', '')
df[105] = df[42].map(str) + df[101].map(str)
df = df.drop([42,101], axis=1)
df[106] = df[102].map(str) + df['newblock'].map(str)
df = df.drop(['newblock', 102], axis=1)
df[106] = df[106].str.replace('None', '')
df = df[[0,1,106,105,100]]
columns = ['date', 'timestamp', 'type', 'cardno', 'status']
df.columns = df.columns.map(str)
df.columns = columns
df = df.reset_index()
df = df.drop(['index'], axis=1)
#Making Visitor Counter
visitorDf = df[df['type'].str.startswith('VIS')]
#visitorDf = visitorDf[~visitorDf['status'].str.contains('Unknown')]
visitorIn1 = len(visitorDf[visitorDf['status'].str.contains('Unknown')])
VisitorIn1 = int(visitorIn1)
visitorDf = visitorDf.reset_index()
visitorDf = visitorDf.drop(('index'), axis=1)
visitorIn = len(visitorDf[visitorDf['status'].str.contains('Valid Card Entry')])
visitorOut = len(visitorDf[visitorDf['status'].str.contains('Valid Card Exit')])
visitorIn = int(visitorIn)
visitorOut = int(visitorOut)
totalVisitor = visitorIn1 + visitorIn - visitorOut
#Making Contractor Counter
contractorDf = df[df['type'].str.startswith('PC')]
#contractorDf = contractorDf[~contractorDf['status'].str.contains('Unknown')]
contractorIn1 = len(contractorDf[contractorDf['status'].str.contains('Unknown')])
contractorIn1 = int(contractorIn1)
contractorDf = contractorDf.reset_index()
contractorDf = contractorDf.drop(('index'), axis=1)
contractorIn = len(contractorDf[contractorDf['status'].str.contains('Valid Card Entry')])
contractorOut = len(contractorDf[contractorDf['status'].str.contains('Valid Card Exit')])
contractorIn = int(contractorIn)
contractorOut = int(contractorOut)
totalContractor = contractorIn1 + contractorIn - contractorOut
#Making Employee Counter
employeeDf = df[df['type'].str.contains('^\d', regex=True)]
#employeeDf = employeeDf[~employeeDf['status'].str.contains('Unknown')]
employeeIn1 = len(employeeDf[employeeDf['status'].str.contains('Unknown')])
employeeIn1 = int(employeeIn1)
employeeDf = employeeDf.reset_index()
employeeDf = employeeDf.drop(('index'), axis=1)
employeeIn = len(employeeDf[employeeDf['status'].str.contains('Valid Card Entry')])
employeeOut = len(employeeDf[employeeDf['status'].str.contains('Valid Card Exit')])
employeeIn = int(employeeIn)
employeeOut = int(employeeOut)
totalEmployee = employeeIn1 + employeeIn - employeeOut
os.remove('C:\\Users\\JH\\OneDrive\\Desktop\\EntryPass\\P1_Server\\event\\record.csv')
visitor = totalVisitor
employee = totalEmployee
contractor = totalContractor
if os.path.exists('C:\\Apache24\\htdocs\\counter\\index.html'):
os.remove('c:\\Apache24\\htdocs\\counter\\index.html')
else:
pass
f = open('C:\\Apache24\\htdocs\\counter\\index.html', 'w')
message = """
<html lang="en-US" class="hide-scroll">
<head>
<title>Emhart Counter</title>
<meta charset="utf-8" />
<meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no">
<link rel="stylesheet" href="https://stackpath.bootstrapcdn.com/bootstrap/4.5.2/css/bootstrap.min.css" integrity="sha384-JcKb8q3iqJ61gNV9KGb8thSsNjpSL0n8PARn9HuZOnIxN0hoP+VmmDGMN5t9UJ0Z" crossorigin="anonymous">
<style>
body {{
background-color: lightblue;
}}
.verticalCenter {{
margin: 0;
top: 100%;
-ms-transform: translateY(25%);
transform: translateY(25%);
}}
</style>
</head>
<body>
<center>
<div class=“verticalCenter">
<h1 style=font-size:100px>VISITORS:        {visitor}</h1><br></br><br></br>
<h1 style=font-size:100px>EMPLOYEES:        {employee}</h1><br></br><br></br>
<h1 style=font-size:100px>CONTRACTORS:        {contractor}</h1><br></br><br></br><br></br><br></br>
<h3 style=font-size: 50px>THIS IS A TEST RUN<h3>
</div>
</center>
</body>
</html>"""
new_message = message.format(visitor=visitor, employee=employee, contractor=contractor)
f.write(new_message)
f.close()
sys.exit()
The only problem left is how do I go about in getting rid of exits for a cardno/type before it has a corresponding Entry in the contractorDf.剩下的唯一问题是我如何摆脱卡诺/类型的退出,然后才能在contractorDf 中有相应的条目。 I would really really appreciate any help on the matter.
我真的很感激在这个问题上的任何帮助。
For your example startswith and endswith would work.对于您的示例, startswith和endswith会起作用。 For more complex regex-patterns use contains .
对于更复杂的正则表达式模式,请使用contains 。
mask = df.date.str.startswith("**")
print(df[mask])
# or
mask = df.status.str.endswith("**")
print(df[mask])
Outputs:输出:
date timestamp type cardno status
0 **20201006 55737 PC010 117016056 Valid_Card_Exit**
3 **20201006 64952 PC012 116054003 Valid_Card_Exit**
Setup:设置:
columns = ['date','timestamp','type','cardno','status']
data = [el.split(",") for el in ['**20201006,55737,PC010,117016056,Valid_Card_Exit**',
'20201006,55907,PC010,117016056,Valid_Card_Entry',
'20201006,64834,PC011,117016074,Valid_Card_Entry',
'**20201006,64952,PC012,116054003,Valid_Card_Exit**']]
df = pd.DataFrame(data, columns=columns)
The key to the problem is a commonly-seen mathematical trick.问题的关键是一个常见的数学技巧。 We first regard
entry
as 1
, and exit
as the cancellation of entry, namely -1
.我们先把
entry
看作1
, exit
看作 entry 的取消,即-1
。 Then an exit event is bad if it first produces a negative cumulative sum ( cumsum
) up to that row.如果退出事件首先产生负累积总和 (
cumsum
) 直到该行,则退出事件是错误的。 Ie, when the exit event happened cannot be interpreted as a proper cancellation of a previous entry.即,当退出事件发生时不能解释为对先前条目的正确取消。 However, note that subsequent negative
cumsum
values can be caused by previous bad values.但是,请注意,随后的负
cumsum
值可能是由先前的错误值引起的。 Therefore, we identify ONLY the first negative cumsum value as bad.因此,我们仅将第一个负 cumsum 值标识为坏的。
Based on the observation above, one can find the first bad entry for each card in a recursive manner until no negative cumsum value is produced.基于上述观察,可以递归地找到每张卡片的第一个坏条目,直到没有产生负的累积和值。
The implementation demonstrates how to do this recursively.该实现演示了如何递归地执行此操作。 It is not quite optimized for large datasets, but the virtue should be somehow similar.
它没有针对大型数据集进行完全优化,但优点应该是相似的。
# initialize
df["retain"] = True
df["delta"] = -1
df.loc[df["status"] == "Valid Card Entry", "delta"] = 1
def recurse(df):
# sort for cumsum (bad values found were not retained)
df_sorted = df[df["retain"]].sort_values(by=["cardno", "timestamp"]).reset_index(drop=True)
# cumsum
df_sorted["cumsum"] = df_sorted[["cardno", "delta"]].groupby("cardno").cumsum()
# get the first occurrence of negative cumsum
df_dup = df_sorted[df_sorted["cumsum"] < 0].groupby("cardno").first()
# termination condition: no more bad values were found
if len(df_dup) == 0:
return
# else, remove the bad rows
for cardno, row in df_dup.iterrows():
df.loc[(df["cardno"] == cardno) & (df["timestamp"] == row["timestamp"]), "retain"] = False
# execute
recurse(df)
del df["delta"] # optional cleanup
See the "retain" column ( False
= bad exits).请参阅“保留”列(
False
= bad exits)。
df
Out[61]:
date timestamp type cardno status retain
0 20201006 55737 PC010 117016056 Valid Card Exit False
1 20201006 55907 PC010 117016056 Valid Card Entry True
2 20201006 60312 PC006 100024021 Valid Card Entry True
3 20201006 61311 PC006 100024021 Valid Card Exit True
4 20201006 61445 PC006 100024021 Valid Card Entry True
5 20201006 61538 PC006 100024021 Valid Card Exit True
6 20201006 61646 PC010 117016056 Valid Card Exit True
7 20201006 61933 PC006 100024021 Valid Card Entry True
8 20201006 61938 PC010 117016056 Valid Card Entry True
9 20201006 62025 PC006 100024021 Valid Card Exit True
10 20201006 62041 PC010 117016056 Valid Card Exit True
11 20201006 62042 PC006 100024021 Valid Card Entry True
12 20201006 62225 PC010 117016056 Valid Card Entry True
13 20201006 62527 PC006 100024021 Valid Card Exit True
14 20201006 63018 PC006 100024021 Valid Card Entry True
15 20201006 64832 PC007 116057383 Valid Card Entry True
16 20201006 64834 PC011 117016074 Valid Card Entry True
17 20201006 64952 PC012 116054003 Valid Card Exit False
For demonstration purpose, cumsum
s before and after cleanup is shown below.出于演示目的,清理前后的
cumsum
如下所示。 The dataset is sorted by (cardno, timestamp)
, and the date
column is deleted for clarity.数据集按
(cardno, timestamp)
排序,为清楚起见删除了date
列。
Before前
df_sorted
Out[69]:
timestamp type cardno status retain delta cumsum
0 60312 PC006 100024021 Valid Card Entry True 1 1
1 61311 PC006 100024021 Valid Card Exit True -1 0
2 61445 PC006 100024021 Valid Card Entry True 1 1
3 61538 PC006 100024021 Valid Card Exit True -1 0
4 61933 PC006 100024021 Valid Card Entry True 1 1
5 62025 PC006 100024021 Valid Card Exit True -1 0
6 62042 PC006 100024021 Valid Card Entry True 1 1
7 62527 PC006 100024021 Valid Card Exit True -1 0
8 63018 PC006 100024021 Valid Card Entry True 1 1
9 64952 PC012 116054003 Valid Card Exit True -1 -1
10 64832 PC007 116057383 Valid Card Entry True 1 1
11 55737 PC010 117016056 Valid Card Exit True -1 -1
12 55907 PC010 117016056 Valid Card Entry True 1 0
13 61646 PC010 117016056 Valid Card Exit True -1 -1
14 61938 PC010 117016056 Valid Card Entry True 1 0
15 62041 PC010 117016056 Valid Card Exit True -1 -1
16 62225 PC010 117016056 Valid Card Entry True 1 0
17 64834 PC011 117016074 Valid Card Entry True 1 1
After后
df_sorted
Out[73]:
timestamp type cardno status retain delta cumsum
0 60312 PC006 100024021 Valid Card Entry True 1 1
1 61311 PC006 100024021 Valid Card Exit True -1 0
2 61445 PC006 100024021 Valid Card Entry True 1 1
3 61538 PC006 100024021 Valid Card Exit True -1 0
4 61933 PC006 100024021 Valid Card Entry True 1 1
5 62025 PC006 100024021 Valid Card Exit True -1 0
6 62042 PC006 100024021 Valid Card Entry True 1 1
7 62527 PC006 100024021 Valid Card Exit True -1 0
8 63018 PC006 100024021 Valid Card Entry True 1 1
9 64832 PC007 116057383 Valid Card Entry True 1 1
10 55907 PC010 117016056 Valid Card Entry True 1 1
11 61646 PC010 117016056 Valid Card Exit True -1 0
12 61938 PC010 117016056 Valid Card Entry True 1 1
13 62041 PC010 117016056 Valid Card Exit True -1 0
14 62225 PC010 117016056 Valid Card Entry True 1 1
15 64834 PC011 117016074 Valid Card Entry True 1 1
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.