简体   繁体   English

如何解析 python 中的 csv 文件得到这个 output?

[英]How to parse csv file in python to get this output?

I have a csv file which contains data like that我有一个 csv 文件,其中包含类似的数据

Sample csv样品 csv

Name姓名 Start开始 End结尾
John约翰 12:00 12:00 13:00 13:00
John约翰 12:10 12:10 13:00 13:00
John约翰 12:20 12:20 13:20 13:20
Tom汤姆 12:00 12:00 13:10 13:10
John约翰 13:50 13:50 14:00 14:00
Jerry杰瑞 14:00 14:00 14:30 14:30
Alice爱丽丝 15:00 15:00 16:00 16:00
Jerry杰瑞 11:00 11:00 15:00 15:00
  1. I need to find the average time taken by each people in python.我需要找到 python 中每个人的平均时间。 How do i do that?我怎么做?

Sample output样品 output

Avg time taken by different people are:不同人花费的平均时间是:

John (60+50+60+10)/4 min Tom (70)/1 min Jerry (30+240)/2 min Alice (60)/1 min约翰 (60+50+60+10)/4 分钟汤姆 (70)/1 分钟杰瑞 (30+240)/2 分钟爱丽丝 (60)/1 分钟

I tried parsing the csv file by python csv我尝试通过 python csv 解析 csv 文件

import datetime
import csv


with open('people.csv', 'r') as file:
    reader = csv.DictReader(file)
    for row in reader:
        print(row['Start'],row['End'])

But i am unable to parse the column with the particular row name belongs to Jerry and find the difference in their time.但是我无法解析具有特定行名属于 Jerry 的列并找到他们时间的差异。

  1. Also Need to find which Person took maximum time还需要找出哪个人花费了最长时间

Here in case Jerry took maximum time以防杰瑞花费最多时间

  1. Also need to perform merge operation还需要进行合并操作

ex - john [12:00,13:00],[12:10,13:00],[12:20,13:20],[13:50,14:00]前 - 约翰 [12:00,13:00],[12:10,13:00],[12:20,13:20],[13:50,14:00]

output - [12:00,13:20],[13:50,14:00] output - [12:00,13:20],[13:50,14:00]

Any help will be appreciated.任何帮助将不胜感激。

If you are willing to use pandas for this, the code below can do the job -如果您愿意为此使用 pandas,下面的代码可以完成这项工作 -

import pandas as pd

df = pd.read_csv("data.csv")
df = df.apply(pd.to_datetime, errors = "ignore")
time_df = df.iloc[:, 1:].diff(axis = 1).drop(columns = "Start").rename(columns = {"End" : "Time Diff"})
time_df["Name"] = df["Name"]
time_df.groupby("Name").mean()

Output - Output -

Time Diff时差
Alice爱丽丝 0 days 01:00:00 0 天 01:00:00
Jerry杰瑞 0 days 02:15:00 0 天 02:15:00
John约翰 0 days 00:45:00 0 天 00:45:00
Tom汤姆 0 days 01:10:00 0 天 01:10:00

Code Explanation -代码说明 -

  • The 3rd line in the code reads the csv file you have and converts it into a pandas dataframe.代码中的第 3 行读取您拥有的 csv 文件并将其转换为 pandas dataframe。 A dataframe is just a table and can be manipulated in the same way. dataframe 只是一个表,可以以相同的方式进行操作。

  • The 4th line in the code coverts all the values in valid columns to datetime which can help you in finding the time difference.代码中的第 4 行将有效列中的所有值转换为日期时间,这可以帮助您找到时差。 I have passed the parameter errors = "ignore" as the first column in the dataframe, the column Name , has string values that cannot be converted to datetime.我已将参数errors = "ignore"作为 dataframe 中的第一列传递,列Name具有无法转换为日期时间的字符串值。 Passing in the errors parameter as ignore would allow me to retain all the original values of that column.errors参数作为ignore传递将允许我保留该列的所有原始值。

  • The 5th line of code selects the columns from index 1 onwards and substracts them.第 5 行代码从索引 1 开始选择列并减去它们。 Once that's done the drop function gets implemented and the redundant column with null values.完成后,将实施drop function 并具有 null 值的冗余列。 Once that's done the rename function kicks in and renames the End column to Time Diff .完成后, rename function 开始并将End列重命名为Time Diff All this is stored in a new variable by the name time_df .所有这些都存储在名为time_df的新变量中。

  • Because time_df doens't have the name column in it, the 6th line adds it.因为time_df中没有 name 列,所以第 6 行添加了它。

  • Once I have the required dataframe, I just group the data based on the Name column, meaning all the data belonging to particular person would be worked on separately.一旦我有了所需的 dataframe,我只需根据Name列对数据进行分组,这意味着属于特定人员的所有数据都将单独处理。 This is ideal for us as we want to find the mean time taken by every person.这对我们来说是理想的,因为我们想找到每个人的平均时间。 To do that we just apply the mean function on the grouped data and voila we have the desired output.为此,我们只需在分组数据上应用mean function,瞧,我们得到了所需的 output。

without Pandas:不带 Pandas:

times = {}

with open('people.csv', 'r') as file:
    reader = csv.DictReader(file)
    for row in reader:
    
        if row['name'] not in times.keys():
            times[row['name']] = []
    
        times[row['name']].append(row['End'] - row['Start'])
    
for person in times.keys():
    print(person + ": " + str(sum(times[person]) / len(times[person])))

A simplified code below.下面是一个简化的代码。 You can write it in fewer lines and further in fewer lines by using pandas.您可以使用 pandas 以更少的行数和更少的行数编写它。

import csv
from datetime import datetime

avg = {}
with open('people.csv', mode='r') as csv_file:
    csv_reader = csv.DictReader(csv_file)
    for row in csv_reader:
        name = row["Name"]
        start_time = datetime.strptime(row["Start"], "%H:%M")
        end_time = datetime.strptime(row["End"], "%H:%M")
        time_taken = (end_time - start_time).total_seconds() / 60.0
        if name not in avg.keys():
            avg[name] = [time_taken,1]
        else:
            prev_time_total = avg[name][0]
            prev_count = avg[name][1]
            new_time_total = prev_time_total + time_taken
            new_count = prev_count + 1
            avg[name] = [new_time_total,new_count]

for entry in avg:
    print(entry,avg[entry][0]/avg[entry][1])

Here's another method without using pandas -这是另一种不使用pandas的方法 -

from datetime import datetime, timedelta

with open("data.csv", "r") as f:
    f = csv.DictReader(f)
    data = [row for row in f]

diffs = {list(row.values())[0]: [] for row in data}
for row in data:
    vals = list(row.values())
    diffs[vals[0]].append(datetime.strptime(vals[2], "%H:%M") - datetime.strptime(vals[1], "%H:%M"))

diffs_avg = [str(timedelta(seconds = sum(map(timedelta.total_seconds, times)) / len(times))) for times in diffs.values()]
dict(zip(diffs.keys(), diffs_avg))

Output - Output -

{'Alice': '1:00:00', 'Jerry': '2:15:00', 'John': '0:45:00', 'Tom': '1:10:00'}

calculating with pandas mean and max time in minutes for each person:用 pandas 计算每个人的平均和最大时间(以分钟为单位):

import pandas as pd

df = (pd.read_csv('file_01.csv',parse_dates=['Start','End']).
      assign(diff=(df1.End-df1.Start).dt.total_seconds()//60).
      groupby('Name')['diff'].
      agg(['mean','max']))

print(df)
'''
        mean    max
Name               
Alice   60.0   60.0
Jerry  135.0  240.0
John    45.0   60.0
Tom     70.0   70.0
'''

df.to_dict()

>>> out
'''
{'mean': {'Alice': 60.0, 'Jerry': 135.0, 'John': 45.0, 'Tom': 70.0},
 'max': {'Alice': 60.0, 'Jerry': 240.0, 'John': 60.0, 'Tom': 70.0}}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM