简体   繁体   English

如何基于两个值比较数据(.csv)文件中的行,然后使用Python汇总数据?

[英]How to compare lines in data (.csv) file based on two values, then roll-up the data using Python?

I have a .csv file that has 25 columns. 我有一个具有25列的.csv文件。 In this data, column 18 is a People_ID, and column 19 is a Donation Date. 在此数据中,列18是People_ID,列19是捐赠日期。 I have pre-sorted the data using Linux so that all People ID's appear together, sorted by donation date in descending order. 我已经使用Linux对数据进行了预排序,以便所有People ID一起显示,并按捐赠日期降序排序。

Here is where I'm not sure how to proceed. 这是我不确定如何进行的地方。 I need to find all lines that have the same People_ID and Donation Date, sum various values, and then output a single line into the output. 我需要找到所有具有相同People_ID和Donate Date的行,求和各种值,然后将单行输出到输出中。 So essentially, every line in the file would be either a different customer, or a different donation date by the same customer. 因此,从本质上讲,文件中的每一行要么是不同的客户,要么是同一客户的不同的捐赠日期。 Would it be best to use a dictionary using the People_ID as the key? 最好使用以People_ID为键的字典吗? How would this look syntactically? 在语法上看起来如何?

I was thinking something like this: 我在想这样的事情:

with open("file.csv") as csv_file:
for row in csv.reader(csv_file, delimiter=','):
    if row[18] in data_dict:
        # something something

Since you have pre-sorted, it could be organized to call a function once per person, each time called with the rows for a particular person. 由于已经进行了预排序,因此可以将其组织为每个人调用一次函数,每次调用特定人的行。

Since the data is pre-sorted, we assume the rows for person 1 are together, then the rows for person 2 (or 39, or some other number), etc. So we need to detect when the person id in field 18 changes. 由于数据是预先排序的,因此我们假设人员1的行在一起,然后假定人员2的行(或39,或其他某个数字),依此类推。因此,我们需要检测字段18中的人员id何时更改。 To do this, we use a variable last_person to track which person we are processing. 为此,我们使用变量last_person来跟踪我们正在处理的人。 The variable row_cache will collect the rows for a single person. 变量row_cache将收集一个人的行。

def process_person(rows):
    if len(rows)==0:
         return
    # do something with the rows for this person
    # and print the result somewhere useful

last_person = 0
row_cache = []
with open("file.csv") as csv_file:
for row in csv.reader(csv_file, delimiter=','):
    if row[18]==last_person: 
        row_cache.append(row)
    else:
        process_person(row_cache)
        row_cache = [row]
        last_person = row[18]
process_person(row_cache)

I'd recommend an object-oriented approach. 我建议使用一种面向对象的方法。

import csv

class Transaction:
    def __init__(self, fields):
        self.name, self.age, self.car, self.ident = fields # whatever fields you have
        # keep in mind these are all strings,
        # so you may need to process them before analysis
    def calculation(self):
        return self.age + self.id

transactions = {}
with open('csv_file.csv', newline='') as f:
    for row in csv.reader(f):
        bucket = tuple(row[18:20])
        if bucket in transactions:
            transactions[bucket].append(Transaction(row))
        else:
            transactions[bucket] = [Transaction(row)]

for bucket in transactions:
    print(bucket, sum(item.amount for item in bucket.values()))

This defines a Transaction class, instances of which contain the various fields which will come from the CSV file. 这定义了一个Transaction类,其实例包含来自CSV文件的各种字段。 It then starts a dictionary of transactions and looks through the CSV file, adding a new Transaction object into a new bucket (if the given ID and date haven't been seen before) or into an existing bucket (if the given ID and date have been seen before). 然后,它启动一个事务字典,并浏览CSV文件,将新的Transaction对象添加到新的存储桶(如果之前未看到给定的ID和日期)或现有的存储桶(如果给定的ID和日期已被删除)之前见过)。

Then it goes through this dictionary and performs a calculation for each bucket, printing the bucket and the result of the calculation. 然后,它会通过该词典并为每个存储桶执行计算,并打印存储桶和计算结果。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM