I have a csv file which contains data like that
Sample csv
Name | Start | End |
---|---|---|
John | 12:00 | 13:00 |
John | 12:10 | 13:00 |
John | 12:20 | 13:20 |
Tom | 12:00 | 13:10 |
John | 13:50 | 14:00 |
Jerry | 14:00 | 14:30 |
Alice | 15:00 | 16:00 |
Jerry | 11:00 | 15:00 |
Sample output
Avg time taken by different people are:
John (60+50+60+10)/4 min Tom (70)/1 min Jerry (30+240)/2 min Alice (60)/1 min
I tried parsing the csv file by python csv
import datetime
import csv
with open('people.csv', 'r') as file:
reader = csv.DictReader(file)
for row in reader:
print(row['Start'],row['End'])
But i am unable to parse the column with the particular row name belongs to Jerry and find the difference in their time.
Here in case Jerry took maximum time
ex - john [12:00,13:00],[12:10,13:00],[12:20,13:20],[13:50,14:00]
output - [12:00,13:20],[13:50,14:00]
Any help will be appreciated.
If you are willing to use pandas for this, the code below can do the job -
import pandas as pd
df = pd.read_csv("data.csv")
df = df.apply(pd.to_datetime, errors = "ignore")
time_df = df.iloc[:, 1:].diff(axis = 1).drop(columns = "Start").rename(columns = {"End" : "Time Diff"})
time_df["Name"] = df["Name"]
time_df.groupby("Name").mean()
Output -
Time Diff | |
---|---|
Alice | 0 days 01:00:00 |
Jerry | 0 days 02:15:00 |
John | 0 days 00:45:00 |
Tom | 0 days 01:10:00 |
Code Explanation -
The 3rd line in the code reads the csv file you have and converts it into a pandas dataframe. A dataframe is just a table and can be manipulated in the same way.
The 4th line in the code coverts all the values in valid columns to datetime which can help you in finding the time difference. I have passed the parameter errors = "ignore"
as the first column in the dataframe, the column Name
, has string values that cannot be converted to datetime. Passing in the errors
parameter as ignore
would allow me to retain all the original values of that column.
The 5th line of code selects the columns from index 1 onwards and substracts them. Once that's done the drop
function gets implemented and the redundant column with null values. Once that's done the rename
function kicks in and renames the End
column to Time Diff
. All this is stored in a new variable by the name time_df
.
Because time_df
doens't have the name column in it, the 6th line adds it.
Once I have the required dataframe, I just group the data based on the Name
column, meaning all the data belonging to particular person would be worked on separately. This is ideal for us as we want to find the mean time taken by every person. To do that we just apply the mean
function on the grouped data and voila we have the desired output.
without Pandas:
times = {}
with open('people.csv', 'r') as file:
reader = csv.DictReader(file)
for row in reader:
if row['name'] not in times.keys():
times[row['name']] = []
times[row['name']].append(row['End'] - row['Start'])
for person in times.keys():
print(person + ": " + str(sum(times[person]) / len(times[person])))
A simplified code below. You can write it in fewer lines and further in fewer lines by using pandas.
import csv
from datetime import datetime
avg = {}
with open('people.csv', mode='r') as csv_file:
csv_reader = csv.DictReader(csv_file)
for row in csv_reader:
name = row["Name"]
start_time = datetime.strptime(row["Start"], "%H:%M")
end_time = datetime.strptime(row["End"], "%H:%M")
time_taken = (end_time - start_time).total_seconds() / 60.0
if name not in avg.keys():
avg[name] = [time_taken,1]
else:
prev_time_total = avg[name][0]
prev_count = avg[name][1]
new_time_total = prev_time_total + time_taken
new_count = prev_count + 1
avg[name] = [new_time_total,new_count]
for entry in avg:
print(entry,avg[entry][0]/avg[entry][1])
Here's another method without using pandas
-
from datetime import datetime, timedelta
with open("data.csv", "r") as f:
f = csv.DictReader(f)
data = [row for row in f]
diffs = {list(row.values())[0]: [] for row in data}
for row in data:
vals = list(row.values())
diffs[vals[0]].append(datetime.strptime(vals[2], "%H:%M") - datetime.strptime(vals[1], "%H:%M"))
diffs_avg = [str(timedelta(seconds = sum(map(timedelta.total_seconds, times)) / len(times))) for times in diffs.values()]
dict(zip(diffs.keys(), diffs_avg))
Output -
{'Alice': '1:00:00', 'Jerry': '2:15:00', 'John': '0:45:00', 'Tom': '1:10:00'}
calculating with pandas mean and max time in minutes for each person:
import pandas as pd
df = (pd.read_csv('file_01.csv',parse_dates=['Start','End']).
assign(diff=(df1.End-df1.Start).dt.total_seconds()//60).
groupby('Name')['diff'].
agg(['mean','max']))
print(df)
'''
mean max
Name
Alice 60.0 60.0
Jerry 135.0 240.0
John 45.0 60.0
Tom 70.0 70.0
'''
df.to_dict()
>>> out
'''
{'mean': {'Alice': 60.0, 'Jerry': 135.0, 'John': 45.0, 'Tom': 70.0},
'max': {'Alice': 60.0, 'Jerry': 240.0, 'John': 60.0, 'Tom': 70.0}}
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.