Calculate the attendance from a CSV file generated using a biometric device

Question

Firstly, I am a complete beginner at Python and this is my first time writing a script for a personal project so please be gentle in your answers.

The Input

I have an unsorted CSV file with the login times of all employees for a given month that looks like:

13,03/02/2020 09:43
12,03/02/2020 10:26
10,03/02/2020 12:12
13,03/02/2020 18:22
12,03/02/2020 18:23
13,03/03/2020 09:51
12,03/03/2020 10:38
10,03/03/2020 12:02
13,03/03/2020 18:28
12,03/03/2020 18:29

where the first column is employee id, and second column is the login/logout time.

I want to know the best/most efficient way to read the login times from the file and calculate:

The Output

Basic:
1. How many days the employee was present at the office
2. The total working hours of an employee for each day

Employee ID - xxxx

Date        Duration  
DD/MM/YY    hh:mm:ss
DD/MM/YY    hh:mm:ss
DD/MM/YY    hh:mm:ss

Total No. of Working Days in this month:

Advanced:
Calculate which days were Sundays and add those days to their attendance as present
Even more Advanced:
Compare with the online google calendar for a region to find the holidays in that month for that region and add those holidays to their attendance

My logic:

Read the CSV file and extract the login times and save them in a sorted list. This creates a list of lists like so:

[['10', '03/02/2020 12:12'],['10', '03/03/2020 12:02'], ['10', '03/06/2020 15:12'], ['10', '03/07/2020 16:18'], ['10', '03/08/2020 11:04'], ['10', '03/08/2020 11:05'], ['10', '03/09/2020 11:27'], ['10', '03/10/2020 17:06'], ['10', '03/11/2020 22:13'], ['10', '03/12/2020 11:13'], ['10', '03/13/2020 11:57'], ['10', '03/14/2020 11:29'], ['10', '03/16/2020 10:32'], ['10', '03/17/2020 17:37'], ['10', '03/18/2020 12:24'], ['10', '03/19/2020 15:38'], ['10', '03/19/2020 15:45'], ['10', '03/20/2020 15:26']]

Convert this list into a sorted dictionary so that all the login times of an employee are saved together in a list. To look something like:

{'10':['03/02/2020 12:12','03/02/2020 15:38','03/08/2020 11:05'],  
'12':['03/03/2020 11:27','03/03/2020 12:02','03/03/2020 18:29'],  
'13':['03/16/2020 10:32','03/16/2020 11:57','03/16/2020 19:04']}

and so on...

...where, the "key" of the dictionary is the employee ID and the "value" is a list of all the login/logout times sorted by date

For each employee ID, for each day, calculate the time difference between first login time and last logout time (there will definitely be multiple entries) using the timedelta fuction of the datetime module
Create an excel file that looks like the expected output shown above

The Question

Seems like a pretty straightforward and simple task and yet...

I'm stuck at trying to merge the list of lists into a proper dictionary with the employee id as the key and a list of all their login times as the value. Trying to google a possible solution led me to https://thispointer.com/python-how-to-convert-a-list-to-dictionary/ . But this doesn't help my problem because I'm trying to extract very specific info from the same list.

Couldn't find anything similar on stackoverflow so I'm posting a new question.

Again, I'm new to programming so please let me know if my logic of going about solving this problem makes sense or should I try a different approach.

PS: I have looked at pandas but it seems unnecessary to learn from scratch at this point for such a simple task.
Also, the next step, calculating the time difference might be more difficult than I imagine, so any help on that would be very welcome.
Also, I'm not asking to write code for me. I want to learn this beautiful language so that I can get better and create scripts like this in a breeze.

If you made it this far, thanks for taking the time: You make the world a better place :)

Answer 1

I guess you're just searching for a way to convert the list of lists to a dict , try this:

from collections import defaultdict
import pprint
l = [['10', '03/02/2020 12:12'],['10', '03/03/2020 12:02'], ['10', '03/06/2020 15:12'], ['10', '03/07/2020 16:18'], ['10', '03/08/2020 11:04'], ['10', '03/08/2020 11:05'], ['10', '03/09/2020 11:27'], ['10', '03/10/2020 17:06'], ['10', '03/11/2020 22:13'], ['10', '03/12/2020 11:13'], ['10', '03/13/2020 11:57'], ['10', '03/14/2020 11:29'], ['10', '03/16/2020 10:32'], ['10', '03/17/2020 17:37'], ['10', '03/18/2020 12:24'], ['10', '03/19/2020 15:38'], ['10', '03/19/2020 15:45'], ['10', '03/20/2020 15:26'], ['11', '03/19/2020 15:45'], ['11', '03/20/2020 15:26'], ['12', '03/19/2020 15:45'], ['12', '03/20/2020 15:26']]
datesByEmployee = defaultdict(list)
for ll in l:
    datesByEmployee[ll[0]].append(ll[1])
pp = pprint.PrettyPrinter(indent=4)
pp.pprint(dict(datesByEmployee))

That gives you:

{   '10': [   '03/02/2020 12:12',
          '03/03/2020 12:02',
          [...]],
'11': ['03/19/2020 15:45', '03/20/2020 15:26'],
'12': ['03/19/2020 15:45', '03/20/2020 15:26']}

Answer 2

Below you find one example output for employee (ID:13), the file created by my script is called Attendance of ID-13 2020-04-05.txt .

Be aware of two import limitations of my script this far
1) it creates.txt files rather than.xlsx
2) It only takes the minimum daytime and substract it from the maximum time of that same day.

Limitation 2 also means, that when somebody logged in on one day ie on th 02 March and logged out the next day on the 03 march, in the duration column of the outputfile you will find "No Logout for this day". Additionally if a person logs in and out multiple times a day ie for a break, these times will be ignored.
However this would be seperate questions which is part of your taks to solve

Example Outputfile: Attendance of ID-13 2020-04-05.txt

Employee ID - 13

Date Duration
02/03/2020 8:39:0
03/03/2020 8:37:0

My code / pandas solution:

#!/usr/bin/env python3
import pandas as pd
from pathlib import Path
import numpy as np
import datetime
from math import floor

def time_to_delat(t):
    """Convert datetime.time object with hour and minute to datetime.timedelta object"""
    dt = datetime.timedelta(hours=t.hour, minutes=t.minute)
    return dt
def trans_form_tostring(dt):
    hours = dt.seconds//3600
    minutes = (dt.seconds//60)%60
    seconds = dt.seconds%60
    return f"{hours}:{minutes}:{seconds}"

def main():
    # set path to csv
    path_to_csv = Path("C:/Test/tmp_csv.csv")
    # set names for the columns
    header = ['ID','Datetime']
    # read the csv as pandas dataframe
    df = pd.read_csv(path_to_csv, names = header,parse_dates=True)
    # Convert the column 'Date' to a datetime object
    df['Datetime'] = pd.to_datetime(df['Datetime'])
    df['Date'] = df['Datetime'].dt.date
    df['Time'] = df['Datetime'].dt.time

    for ID in df.ID.unique():
        # Iterate over every unique ID of employee and Filter for a single ID
        one_employee = df[df['ID']==ID].sort_values(by='Date')
        # Get the earliest start time of a day and the latest time of a day
        start_per_day = one_employee.groupby('Date')['Time'].min()
        end_per_day = one_employee.groupby('Date')['Time'].max()
        # Convert array of datetime.time objects to array of datetime.timedelta objects
        start_per_day_dt = np.array([time_to_delat(x) for x in start_per_day])
        end_per_day_dt = np.array([time_to_delat(x) for x in end_per_day])
        # get the duration for a single day
        delta_per_day = [trans_form_tostring(x) for x in (end_per_day_dt - start_per_day_dt)]
        # Create an empty list dates for the attendance
        attended_days = []
        for i,working_day in enumerate(one_employee.Date.unique()):
            if delta_per_day[i] == "0:0:0":
                delta_per_day[i] = "No Logout for this day"
            day = working_day.strftime("%d/%m/%Y")
            attended_days.append(f"{day}\t{delta_per_day[i]}")
        create_excel_output(ID,attended_days,Path("C:/Test"))

def create_excel_output(ID, dates,outpath=None):
    protocol_file = f"Attendance of ID-{ID} {datetime.date.today()}.txt"
    if outpath is not None:
        protocol_file = outpath / f"Attendance of ID-{ID} {datetime.date.today()}.txt"
    employee = f"Employee ID - {ID}"
    with open(protocol_file,'w') as txt:
        txt.write(employee+"\n\n")
        txt.write("Date\tDuration\n")
        for line in dates:
            txt.write(line)
            txt.write("\n")

if __name__ == '__main__':
    main()

Calculate the attendance from a CSV file generated using a biometric device

Question

The Input

The Output

My logic:

The Question

2 answers

solution1
0 2020-04-04 16:23:14

solution2
0 2020-04-04 16:38:00

Calculate the attendance from a CSV file generated using a biometric device

Question

The Input

The Output

My logic:

The Question

2 answers

solution1 0 2020-04-04 16:23:14

solution2 0 2020-04-04 16:38:00

solution1
0 2020-04-04 16:23:14

solution2
0 2020-04-04 16:38:00