简体   繁体   English

解析CSV文件并修改列

[英]Parse CSV file and modify columns

I'd like to change a CSV file in a specific way. 我想以特定方式更改CSV文件。 This is my sample CSV file: 这是我的示例CSV文件:

name,time,Operations
Cassandra,2015-10-06T15:07:22.333662984Z,INSERT
Cassandra,2015-10-06T15:07:24.334536781Z,INSERT
Cassandra,2015-10-06T15:07:27.339662984Z,READ
Cassandra,2015-10-06T15:07:28.344493608Z,READ
Cassandra,2015-10-06T15:07:28.345221189Z,READ
Cassandra,2015-10-06T15:07:29.345623750Z,READ
Cassandra,2015-10-06T15:07:31.352725607Z,UPDATE
Cassandra,2015-10-06T15:07:33.360272493Z,UPDATE
Cassandra,2015-10-06T15:07:38.366408708Z,UPDATE

I know how to read from CSV file using python parser but I'm totally a beginner with that. 我知道如何使用python解析器从CSV文件读取内容,但是我完全是一个初学者。 I need to get such an output: 我需要得到这样的输出:

start_time,end_time,operation
2015-10-06T15:07:22.333662984Z,2015-10-06T15:07:24.334536781Z,INSERT    
2015-10-06T15:07:27.339662984Z,2015-10-06T15:07:29.345623750Z,READ
2015-10-06T15:07:31.352725607Z,2015-10-06T15:07:38.366408708Z,UPDATE

Comment: The start time is a timestamp given at beginning of specific query (insert/read,update) and accordingly the end time is completion of the query. 注释:开始时间是在特定查询(插入/读取,更新)开始时给出的时间戳,因此,结束时间是查询完成。

Thanks. 谢谢。

It appears from your sample that you can (presumably) guarantee that the first entry of a certain kind in the Operations column, and the last entry of that same kind are the start and stop times. 从您的示例中可以看出,您可以(大概)保证“操作”列中某种类型的第一个条目以及该类型的最后一个条目是开始时间和停止时间。 If you can't guarantee this, then it's slightly more complicated, but let's assume you can't – to be more robust. 如果您不能保证这一点,那么它会稍微复杂一些,但让我们假设您无法做到 –更强大。

One thing we can assume is that the data represented in the CSV is the entirety. 我们可以假设的一件事是,CSV中表示的数据是完整的。 If you're missing entries for a particular operation, there's little we can do. 如果您缺少某个特定操作的条目,那么我们无能为力。 We also want to read the timestamps, which we can do using the dateutil.parser module. 我们还想阅读时间戳,可以使用dateutil.parser模块来完成。

So we can start by setting up a short dictionary for keeping track of our values, and a function for populating the dictionary, which accepts one line at a time. 因此,我们可以先建立一个简短的字典来跟踪我们的值,再建立一个用于填充字典的函数,该函数一次接受一行。

import dateutil.parser

ops = dict()

def update_ops(opsdict, row):

    # first get the timestamp and op name in a useable format
    timestamp = dateutil.parser.parse(row[1])
    op_name = row[2]

    ## now populate, or update the dictionary
    if op_name not in opsdict:
        # sets a new dict entry with the operation's timestamp.
        # since we don't know what the start time and end time 
        # is yet, for the moment set them both.
        opsdict[op_name] = { 'start_time': timestamp,
                            'end_time': timetstamp }
    else:
        # now evaluate the current timestamp against each start_time
        # and end_time value. Update as needed.
        if opsdict[op_name]['start_time'] > timestamp:
            opsdict[op_name]['start_time'] = timestamp
        if opsdict[op_name]['end_time'] < timestamp:
            opsdict[op_name]['end_time'] = timestamp

Now that we have a function to do the sorting, run through the CSV file reader and populate ops . 现在,我们有了执行排序的功能,可以通过CSV文件阅读器运行并填充ops When we're done, we can generate a new CSV file with the contents from our dictionary. 完成后,我们可以使用字典中的内容生成一个新的CSV文件。

import csv

cr = csv.reader(open('/path/to/your/file.csv'))
cr_head = cr.next()    # throw away the first row

for row in cr:
    update_ops(ops, row)

# Now write a new csv file – csv.writer is your friend :)
with open('new_operation_times.csv', 'w') as newcsv:
    cw = csv.writer(newcsv)

    # first write your header. csv.writer accepts lists for each row.
    header = 'start_time,end_time,operation'.split(',')
    cw.writerow(header)

    # now write out your dict values. You may want them sorted, 
    # but how to do that has been answered elsewhere on SE.
    for opname, timesdict in ops.items():
        row = [ opname, timesdict['start_time'], timesdict['end_time'] ]
        cw.writerow(row)

And you're done! 大功告成! I've tried to make this as elaborate as possible so it's clear what's going on. 我已经尽力使这一过程变得更加详尽,所以很清楚发生了什么。 You can probably collapse a lot of this into fewer, more clever steps (such as reading from one csv and writing it out directly). 您可以将其中的许多步骤分解为更少,更聪明的步骤(例如,从一个csv读取并直接将其写出)。 But if you follow the KISS principle, you'll have an easier time reading this later on, and learning from it again. 但是,如果您遵循KISS原则,那么以后您将可以更轻松地阅读此内容,并再次学习它。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM