根据使用生物识别设备生成的 CSV 文件计算出勤率

Question

Firstly, I am a complete beginner at Python and this is my first time writing a script for a personal project so please be gentle in your answers.首先，我是 Python 的完整初学者，这是我第一次为个人项目编写脚本，所以请在回答时保持温和。

The Input输入

I have an unsorted CSV file with the login times of all employees for a given month that looks like:我有一个未排序的 CSV 文件，其中包含给定月份所有员工的登录时间，如下所示：

13,03/02/2020 09:43 13,03/02/2020 09:43
12,03/02/2020 10:26 12,03/02/2020 10:26
10,03/02/2020 12:12 10,03/02/2020 12:12
13,03/02/2020 18:22 13,03/02/2020 18:22
12,03/02/2020 18:23 12,03/02/2020 18:23
13,03/03/2020 09:51 13,03/03/2020 09:51
12,03/03/2020 10:38 12,03/03/2020 10:38
10,03/03/2020 12:02 10,03/03/2020 12:02
13,03/03/2020 18:28 13,03/03/2020 18:28
12,03/03/2020 18:29 12,03/03/2020 18:29

where the first column is employee id, and second column is the login/logout time.其中第一列是员工 ID，第二列是登录/注销时间。

I want to know the best/most efficient way to read the login times from the file and calculate:我想知道从文件中读取登录时间并计算的最佳/最有效方法：

The Output Output

Basic:基本的：
1. How many days the employee was present at the office 1. 员工在办公室呆了多少天
2. The total working hours of an employee for each day 2.员工每天的总工作时间

Employee ID - xxxx

Date        Duration  
DD/MM/YY    hh:mm:ss
DD/MM/YY    hh:mm:ss
DD/MM/YY    hh:mm:ss

Total No. of Working Days in this month:

Advanced:先进的：
Calculate which days were Sundays and add those days to their attendance as present计算哪些日子是星期天，并将这些日子添加到他们目前的出勤率中
Even more Advanced:更高级：
Compare with the online google calendar for a region to find the holidays in that month for that region and add those holidays to their attendance与某个地区的在线 google 日历进行比较，以查找该地区该月的假期并将这些假期添加到他们的出勤中

My logic:我的逻辑：

Read the CSV file and extract the login times and save them in a sorted list.阅读 CSV 文件并提取登录时间并将其保存在排序列表中。 This creates a list of lists like so:这将创建一个列表列表，如下所示：

[['10', '03/02/2020 12:12'],['10', '03/03/2020 12:02'], ['10', '03/06/2020 15:12'], ['10', '03/07/2020 16:18'], ['10', '03/08/2020 11:04'], ['10', '03/08/2020 11:05'], ['10', '03/09/2020 11:27'], ['10', '03/10/2020 17:06'], ['10', '03/11/2020 22:13'], ['10', '03/12/2020 11:13'], ['10', '03/13/2020 11:57'], ['10', '03/14/2020 11:29'], ['10', '03/16/2020 10:32'], ['10', '03/17/2020 17:37'], ['10', '03/18/2020 12:24'], ['10', '03/19/2020 15:38'], ['10', '03/19/2020 15:45'], ['10', '03/20/2020 15:26']]

Convert this list into a sorted dictionary so that all the login times of an employee are saved together in a list.将此列表转换为排序字典，以便将员工的所有登录时间一起保存在列表中。 To look something like:看起来像：

{'10':['03/02/2020 12:12','03/02/2020 15:38','03/08/2020 11:05'],  
'12':['03/03/2020 11:27','03/03/2020 12:02','03/03/2020 18:29'],  
'13':['03/16/2020 10:32','03/16/2020 11:57','03/16/2020 19:04']}

and so on...等等...

...where, the "key" of the dictionary is the employee ID and the "value" is a list of all the login/logout times sorted by date ...其中，字典的“键”是员工 ID，“值”是按日期排序的所有登录/注销时间的列表

For each employee ID, for each day, calculate the time difference between first login time and last logout time (there will definitely be multiple entries) using the timedelta fuction of the datetime module对于每个员工ID，对于每一天，使用datetime模块的timedelta函数计算第一次登录时间和最后一次注销时间的时间差（肯定会有多个条目）
Create an excel file that looks like the expected output shown above创建一个 excel 文件，看起来像上面显示的预期 output

The Question问题

Seems like a pretty straightforward and simple task and yet...似乎是一个非常简单明了的任务，但...

I'm stuck at trying to merge the list of lists into a proper dictionary with the employee id as the key and a list of all their login times as the value.我一直试图将列表列表合并到一个正确的字典中，其中员工 ID 作为键，所有登录时间的列表作为值。 Trying to google a possible solution led me to https://thispointer.com/python-how-to-convert-a-list-to-dictionary/ .试图用谷歌搜索一个可能的解决方案让我找到了https://thispointer.com/python-how-to-convert-a-list-to-dictionary/ 。 But this doesn't help my problem because I'm trying to extract very specific info from the same list.但这对我的问题没有帮助，因为我试图从同一个列表中提取非常具体的信息。

Couldn't find anything similar on stackoverflow so I'm posting a new question.在 stackoverflow 上找不到类似的东西，所以我发布了一个新问题。

Again, I'm new to programming so please let me know if my logic of going about solving this problem makes sense or should I try a different approach.同样，我是编程新手，所以请让我知道我解决这个问题的逻辑是否有意义，或者我应该尝试不同的方法。

PS: I have looked at pandas but it seems unnecessary to learn from scratch at this point for such a simple task. PS：我看过 pandas 但似乎没有必要从头开始学习这样一个简单的任务。
Also, the next step, calculating the time difference might be more difficult than I imagine, so any help on that would be very welcome.此外，下一步，计算时差可能比我想象的要困难，因此非常欢迎任何帮助。
Also, I'm not asking to write code for me.另外，我不是要求为我编写代码。 I want to learn this beautiful language so that I can get better and create scripts like this in a breeze.我想学习这门美丽的语言，这样我就可以变得更好，并轻而易举地创建这样的脚本。

If you made it this far, thanks for taking the time: You make the world a better place :)如果您做到了这一点，感谢您抽出宝贵的时间：您让世界变得更美好:)

Answer 1

I guess you're just searching for a way to convert the list of lists to a dict , try this:我猜你只是在寻找一种将列表列表转换为dict的方法，试试这个：

from collections import defaultdict
import pprint
l = [['10', '03/02/2020 12:12'],['10', '03/03/2020 12:02'], ['10', '03/06/2020 15:12'], ['10', '03/07/2020 16:18'], ['10', '03/08/2020 11:04'], ['10', '03/08/2020 11:05'], ['10', '03/09/2020 11:27'], ['10', '03/10/2020 17:06'], ['10', '03/11/2020 22:13'], ['10', '03/12/2020 11:13'], ['10', '03/13/2020 11:57'], ['10', '03/14/2020 11:29'], ['10', '03/16/2020 10:32'], ['10', '03/17/2020 17:37'], ['10', '03/18/2020 12:24'], ['10', '03/19/2020 15:38'], ['10', '03/19/2020 15:45'], ['10', '03/20/2020 15:26'], ['11', '03/19/2020 15:45'], ['11', '03/20/2020 15:26'], ['12', '03/19/2020 15:45'], ['12', '03/20/2020 15:26']]
datesByEmployee = defaultdict(list)
for ll in l:
    datesByEmployee[ll[0]].append(ll[1])
pp = pprint.PrettyPrinter(indent=4)
pp.pprint(dict(datesByEmployee))

That gives you:这给了你：

{   '10': [   '03/02/2020 12:12',
          '03/03/2020 12:02',
          [...]],
'11': ['03/19/2020 15:45', '03/20/2020 15:26'],
'12': ['03/19/2020 15:45', '03/20/2020 15:26']}

Answer 2

Below you find one example output for employee (ID:13), the file created by my script is called Attendance of ID-13 2020-04-05.txt .在下面您可以找到一个示例 output 用于员工（ID：13），我的脚本创建的文件称为ID-13 的出勤 2020-04-05.txt 。

Be aware of two import limitations of my script this far到目前为止，请注意我的脚本的两个导入限制
1) it creates.txt files rather than.xlsx 1）它创建.txt文件而不是.xlsx
2) It only takes the minimum daytime and substract it from the maximum time of that same day. 2）它只取最短的白天时间，然后从当天的最长时间中减去它。

Limitation 2 also means, that when somebody logged in on one day ie on th 02 March and logged out the next day on the 03 march, in the duration column of the outputfile you will find "No Logout for this day".限制 2 还意味着，当某人在某一天（即 3 月 2 日）登录并在第二天的 3 月 3 日注销时，在输出文件的持续时间列中，您会发现“今天没有注销”。 Additionally if a person logs in and out multiple times a day ie for a break, these times will be ignored.此外，如果一个人每天多次登录和注销，即休息，这些时间将被忽略。
However this would be seperate questions which is part of your taks to solve但是，这将是单独的问题，这是您要解决的任务的一部分

Example Outputfile: Attendance of ID-13 2020-04-05.txt示例输出文件： ID-13 2020-04-05.txt 的出勤率

Employee ID - 13员工编号 - 13

Date Duration日期持续时间
02/03/2020 8:39:0 2020 年 2 月 3 日 8:39:0
03/03/2020 8:37:0 2020 年 3 月 3 日 8:37:0

My code / pandas solution:我的代码/pandas 解决方案：

#!/usr/bin/env python3
import pandas as pd
from pathlib import Path
import numpy as np
import datetime
from math import floor

def time_to_delat(t):
    """Convert datetime.time object with hour and minute to datetime.timedelta object"""
    dt = datetime.timedelta(hours=t.hour, minutes=t.minute)
    return dt
def trans_form_tostring(dt):
    hours = dt.seconds//3600
    minutes = (dt.seconds//60)%60
    seconds = dt.seconds%60
    return f"{hours}:{minutes}:{seconds}"

def main():
    # set path to csv
    path_to_csv = Path("C:/Test/tmp_csv.csv")
    # set names for the columns
    header = ['ID','Datetime']
    # read the csv as pandas dataframe
    df = pd.read_csv(path_to_csv, names = header,parse_dates=True)
    # Convert the column 'Date' to a datetime object
    df['Datetime'] = pd.to_datetime(df['Datetime'])
    df['Date'] = df['Datetime'].dt.date
    df['Time'] = df['Datetime'].dt.time

    for ID in df.ID.unique():
        # Iterate over every unique ID of employee and Filter for a single ID
        one_employee = df[df['ID']==ID].sort_values(by='Date')
        # Get the earliest start time of a day and the latest time of a day
        start_per_day = one_employee.groupby('Date')['Time'].min()
        end_per_day = one_employee.groupby('Date')['Time'].max()
        # Convert array of datetime.time objects to array of datetime.timedelta objects
        start_per_day_dt = np.array([time_to_delat(x) for x in start_per_day])
        end_per_day_dt = np.array([time_to_delat(x) for x in end_per_day])
        # get the duration for a single day
        delta_per_day = [trans_form_tostring(x) for x in (end_per_day_dt - start_per_day_dt)]
        # Create an empty list dates for the attendance
        attended_days = []
        for i,working_day in enumerate(one_employee.Date.unique()):
            if delta_per_day[i] == "0:0:0":
                delta_per_day[i] = "No Logout for this day"
            day = working_day.strftime("%d/%m/%Y")
            attended_days.append(f"{day}\t{delta_per_day[i]}")
        create_excel_output(ID,attended_days,Path("C:/Test"))

def create_excel_output(ID, dates,outpath=None):
    protocol_file = f"Attendance of ID-{ID} {datetime.date.today()}.txt"
    if outpath is not None:
        protocol_file = outpath / f"Attendance of ID-{ID} {datetime.date.today()}.txt"
    employee = f"Employee ID - {ID}"
    with open(protocol_file,'w') as txt:
        txt.write(employee+"\n\n")
        txt.write("Date\tDuration\n")
        for line in dates:
            txt.write(line)
            txt.write("\n")

if __name__ == '__main__':
    main()

根据使用生物识别设备生成的 CSV 文件计算出勤率

问题描述

The Input输入

The Output Output

My logic:我的逻辑：

The Question问题

2 个解决方案

解决方案1
0 2020-04-04 16:23:14

解决方案2
0 2020-04-04 16:38:00

根据使用生物识别设备生成的 CSV 文件计算出勤率

问题描述

The Input输入

The Output Output

My logic:我的逻辑：

The Question问题

2 个解决方案

解决方案1 0 2020-04-04 16:23:14

解决方案2 0 2020-04-04 16:38:00

解决方案1
0 2020-04-04 16:23:14

解决方案2
0 2020-04-04 16:38:00