简体   繁体   English

使用Python比较多个CSV文件

[英]Compare multiple CSV files with Python

I am looking to compare multiple CSV files with Python, and output a report. 我希望将多个CSV文件与Python进行比较,并输出报告。 The number of CSV files to compare will vary, so I am having it pull a list from a directory. 要比较的CSV文件数量会有所不同,所以我要从目录中提取列表。 Each CSV has 2 columns: the first being an area code and exchange, the second being a price. 每个CSV都有2列:第一列是区号和交换,第二列是价格。 eg 例如

1201007,0.006
1201032,0.0119
1201040,0.0106
1201200,0.0052
1201201,0.0345

The files will not all contain the same area codes and exchanges, so rather than a line by line comparison, I need to use the first field as the key. 文件不会全部包含相同的区号和交换信息,因此,我需要使用第一个字段作为键,而不是逐行比较。 I then need to generate a report that says: file1 had 200 mismatches to file2, 371 lower prices than file2, and 562 higher prices than file2. 然后,我需要生成一个报告,其中说:file1与file2不匹配200,价格比file2低371,价格比file2高562。 I need to generate this to compare each file to each other, so this step would be repeated against file3, file4...., and then file2 against files3, etc. I would consider myself a relative noob to Python. 我需要生成此文件以将每个文件相互比较,因此将针对file3,file4 ....,然后针对file3等对file2重复该步骤。 Below is the code I have so far which just grabs the files in the directory and prints prices from all files with a total tally. 下面是我到目前为止拥有的代码,这些代码仅抓取目录中的文件,并从所有文件中以总计的价格显示价格。

import csv
import os

count = 0
#dir containing CSV files
csvdir="tariff_compare"
dirList=os.listdir(csvdir)
#index all files for later use
for idx, fname in enumerate(dirList):
    print fname
    dic_read = csv.reader(open(fname))
    for row in dic_read:
        key = row[0]
        price = row[1]
        print price
        count += 1
print count

This assumes that all your data can fit in memory; 假设所有数据都可以容纳在内存中; if not, you will have to try loading only some sets of files at a time, or even just two files at a time. 如果没有,您将不得不尝试一次仅加载一些文件集,或者一次仅加载两个文件。

It does the comparison and writes the output to a summary.csv file, one row per pair of files. 它进行比较,并将输出写入summary.csv文件,每对文件一行。

import csv
import glob
import os
import itertools

def get_data(fname):
    """
    Load a .csv file
    Returns a dict of {'exchange':float(price)}
    """
    with open(fname, 'rb') as inf:
        items = (row.split() for row in csv.reader(inf))
        return {item[0]:float(item[1]) for item in items}

def do_compare(a_name, a_data, b_name, b_data):
    """
    Compare two data files of {'key': float(value)}

    Returns a list of
      - the name of the first file
      - the name of the second file
      - the number of keys in A which are not in B
      - the number of keys in B which are not in A
      - the number of values in A less than the corresponding value in B
      - the number of values in A equal to the corresponding value in B
      - the number of values in A greater than the corresponding value in B
    """
    a_keys = set(a_data.iterkeys())
    b_keys = set(b_data.iterkeys())

    unique_to_a = len(a_keys - b_keys)
    unique_to_b = len(b_keys - a_keys)

    lt,eq,gt = 0,0,0
    pairs = ((a_data[key], b_data[key]) for key in a_keys & b_keys)
    for ai,bi in pairs:
        if ai < bi:
            lt +=1 
        elif ai == bi:
            eq += 1
        else:
            gt += 1

    return [a_name, b_name, unique_to_a, unique_to_b, lt, eq, gt]

def main():
    os.chdir('d:/tariff_compare')

    # load data from csv files
    data = {}
    for fname in glob.glob("*.csv"):
        data[fname] = get_data(fname)

    # do comparison
    files = data.keys()
    files.sort()
    with open('summary.csv', 'wb') as outf:
        outcsv = csv.writer(outf)
        outcsv.writerow(["File A", "File B", "Unique to A", "Unique to B", "A<B", "A==B", "A>B"])
        for a,b in itertools.combinations(files, 2):
            outcsv.writerow(do_compare(a, data[a], b, data[b]))

if __name__=="__main__":
    main()

Edit: user1277476 makes a good point; 编辑: user1277476提出了一个很好的观点; if you pre-sort your files by exchange (or if they are already in sorted order), you could iterate simultaneously through all your files, keeping nothing but the current line for each in memory. 如果通过交换对文件进行预排序(或者它们已经按照排序顺序排序),则可以同时遍历所有文件,而在内存中仅保留当前行。

This would let you do a more in-depth comparison for each exchange entry - number of files containing a value, or top or bottom N values, etc. 这样,您就可以对每个交换条目进行更深入的比较-包含值的文件数,或者前N个值或后N个值等。

If your files are small, you could do something basic like this 如果文件很小,则可以执行以下基本操作

data = dict()
for fname in os.listdir(csvDir):
    with open(fname, 'rb') as fin:
        data[fname] = dict((key, value) for key, value in fin.readlines())
# All the data is now loaded into your data dictionary
# data -> {'file1.csv': {1201007: 0.006, 1201032: 0.0119, 1201040: 0.0106}, 'file2.csv': ...}

Now everything is readily accessible for you to compare keys and their resultant values in your data dictionary. 现在,您可以轻松访问所有内容,以在数据字典中比较键及其结果值。

Otherwise, if you have much larger datasets to work with that might not be loadable in memory you might want to consider just working with 2 files at a time, with one being stored in memory. 否则,如果要处理的数据集要大得多,可能无法在内存中加载,则可能需要考虑一次仅处理2个文件,其中一个存储在内存中。 You can create a list of filename combinations with itertools.combinations which is you called like combinations(filenames, 2) would yield you a 2 filename pair out of unique combinations you can use. 您可以使用itertools.combinations创建一个文件名组合列表,就像combinations(filenames, 2)这会从您可以使用的唯一组合中产生2个文件名对。

From there you can still optimize further but that should get you going. 从那里您仍然可以进一步优化,但这应该可以助您一臂之力。

I'd probably sort the files before comparing them. 在比较它们之前,我可能会对文件进行排序。 Then use an algorithm similar to the merge step of mergesort to do the comparisons. 然后使用类似于mergesort的合并步骤的算法进行比较。

You still need to think about what to do with duplicate records - EG, what if file1 has 1234567,0.1 twice, and so does file2? 您仍然需要考虑如何处理重复的记录-EG,如果file1两次具有1234567,0.1,又怎么办,file2也会怎样? And what if file1 has 3 of them, and file2 has 5 - and vice-versa? 如果file1有3个,而file2有5个,反之亦然呢?

http://en.literateprograms.org/Merge_sort_%28Python%29
http://stromberg.dnsalias.org/~strombrg/sort-comparison/
http://en.wikipedia.org/wiki/Merge_sort

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM