简体   繁体   中英

How to parse csv file in python to get this output?

I have a csv file which contains data like that

Sample csv

Name Start End
John 12:00 13:00
John 12:10 13:00
John 12:20 13:20
Tom 12:00 13:10
John 13:50 14:00
Jerry 14:00 14:30
Alice 15:00 16:00
Jerry 11:00 15:00
  1. I need to find the average time taken by each people in python. How do i do that?

Sample output

Avg time taken by different people are:

John (60+50+60+10)/4 min Tom (70)/1 min Jerry (30+240)/2 min Alice (60)/1 min

I tried parsing the csv file by python csv

import datetime
import csv


with open('people.csv', 'r') as file:
    reader = csv.DictReader(file)
    for row in reader:
        print(row['Start'],row['End'])

But i am unable to parse the column with the particular row name belongs to Jerry and find the difference in their time.

  1. Also Need to find which Person took maximum time

Here in case Jerry took maximum time

  1. Also need to perform merge operation

ex - john [12:00,13:00],[12:10,13:00],[12:20,13:20],[13:50,14:00]

output - [12:00,13:20],[13:50,14:00]

Any help will be appreciated.

If you are willing to use pandas for this, the code below can do the job -

import pandas as pd

df = pd.read_csv("data.csv")
df = df.apply(pd.to_datetime, errors = "ignore")
time_df = df.iloc[:, 1:].diff(axis = 1).drop(columns = "Start").rename(columns = {"End" : "Time Diff"})
time_df["Name"] = df["Name"]
time_df.groupby("Name").mean()

Output -

Time Diff
Alice 0 days 01:00:00
Jerry 0 days 02:15:00
John 0 days 00:45:00
Tom 0 days 01:10:00

Code Explanation -

  • The 3rd line in the code reads the csv file you have and converts it into a pandas dataframe. A dataframe is just a table and can be manipulated in the same way.

  • The 4th line in the code coverts all the values in valid columns to datetime which can help you in finding the time difference. I have passed the parameter errors = "ignore" as the first column in the dataframe, the column Name , has string values that cannot be converted to datetime. Passing in the errors parameter as ignore would allow me to retain all the original values of that column.

  • The 5th line of code selects the columns from index 1 onwards and substracts them. Once that's done the drop function gets implemented and the redundant column with null values. Once that's done the rename function kicks in and renames the End column to Time Diff . All this is stored in a new variable by the name time_df .

  • Because time_df doens't have the name column in it, the 6th line adds it.

  • Once I have the required dataframe, I just group the data based on the Name column, meaning all the data belonging to particular person would be worked on separately. This is ideal for us as we want to find the mean time taken by every person. To do that we just apply the mean function on the grouped data and voila we have the desired output.

without Pandas:

times = {}

with open('people.csv', 'r') as file:
    reader = csv.DictReader(file)
    for row in reader:
    
        if row['name'] not in times.keys():
            times[row['name']] = []
    
        times[row['name']].append(row['End'] - row['Start'])
    
for person in times.keys():
    print(person + ": " + str(sum(times[person]) / len(times[person])))

A simplified code below. You can write it in fewer lines and further in fewer lines by using pandas.

import csv
from datetime import datetime

avg = {}
with open('people.csv', mode='r') as csv_file:
    csv_reader = csv.DictReader(csv_file)
    for row in csv_reader:
        name = row["Name"]
        start_time = datetime.strptime(row["Start"], "%H:%M")
        end_time = datetime.strptime(row["End"], "%H:%M")
        time_taken = (end_time - start_time).total_seconds() / 60.0
        if name not in avg.keys():
            avg[name] = [time_taken,1]
        else:
            prev_time_total = avg[name][0]
            prev_count = avg[name][1]
            new_time_total = prev_time_total + time_taken
            new_count = prev_count + 1
            avg[name] = [new_time_total,new_count]

for entry in avg:
    print(entry,avg[entry][0]/avg[entry][1])

Here's another method without using pandas -

from datetime import datetime, timedelta

with open("data.csv", "r") as f:
    f = csv.DictReader(f)
    data = [row for row in f]

diffs = {list(row.values())[0]: [] for row in data}
for row in data:
    vals = list(row.values())
    diffs[vals[0]].append(datetime.strptime(vals[2], "%H:%M") - datetime.strptime(vals[1], "%H:%M"))

diffs_avg = [str(timedelta(seconds = sum(map(timedelta.total_seconds, times)) / len(times))) for times in diffs.values()]
dict(zip(diffs.keys(), diffs_avg))

Output -

{'Alice': '1:00:00', 'Jerry': '2:15:00', 'John': '0:45:00', 'Tom': '1:10:00'}

calculating with pandas mean and max time in minutes for each person:

import pandas as pd

df = (pd.read_csv('file_01.csv',parse_dates=['Start','End']).
      assign(diff=(df1.End-df1.Start).dt.total_seconds()//60).
      groupby('Name')['diff'].
      agg(['mean','max']))

print(df)
'''
        mean    max
Name               
Alice   60.0   60.0
Jerry  135.0  240.0
John    45.0   60.0
Tom     70.0   70.0
'''

df.to_dict()

>>> out
'''
{'mean': {'Alice': 60.0, 'Jerry': 135.0, 'John': 45.0, 'Tom': 70.0},
 'max': {'Alice': 60.0, 'Jerry': 240.0, 'John': 60.0, 'Tom': 70.0}}

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM