简体   繁体   English

组合CSV文件python

[英]Combining Sets of CSV files python

I have a list of 2000 files, which I would like to combine: 我有2000个文件列表,我想将它们合并:

01-0628-11A-01D-0356-01_hg19
01-0628-11A-01D-0356-01_hg20
01-0628-11A-01D-0356-01_hg21
01-1372-11A-01D-0356-01_hg16
01-1372-11A-01D-0356-01_hg17
...


I have already gotten the files into a glob, and used regular expressions to rename the files into the common identifier (the six digit code shown below), however there is a varying number of original files for each identifier. 我已经将文件放到一个全局文件中,并使用正则表达式将文件重命名为通用标识符(如下所示的六位数代码),但是每个标识符都有不同数量的原始文件。

01-0628
01-0628
01-0628
01-1372
01-1372
...

I was originally going to just use a reader and open each file with common names, but I was wondering if there is a more efficient way to do this. 我本来只是使用阅读器并用通用名称打开每个文件,但是我想知道是否有更有效的方法来执行此操作。

The final output I would like is the following, each of the files with common identifiers combined into one: 我想要的最终输出如下,每个具有共同标识符的文件合并为一个:

01-0628
01-1372
...

All of the files contain similarly formatted data, so simply apending the existing files to a new file would not be an issue. 所有文件都包含格式相似的数据,因此只需将现有文件添加到新文件中就不会有问题。

Assuming these csvs have similar or identical fields. 假设这些csv具有相似或相同的字段。 This code should work. 此代码应该起作用。 It uses the DictReader and DictWriter classes of the csv module which convert csv rows to Python dictionaries. 它使用csv模块的DictReader和DictWriter类将csv行转换为Python字典。

1) It opens and reads in the globbed csv files, from in_csv_dir, into a (filename, rows) dictionary. 1)它打开并从in_csv_dir中读取球形的csv文件,并将其读入(文件名,行)字典。

2) It groups the csv rows into a (prefix, rows) dictionary based on filenames and the prefix_length variable. 2)根据文件名和prefix_length变量将csv行分组为(前缀,行)字典。

3) It combines the fields of each prefix grouping and creates a combined csv in the out_csv_dir. 3)它组合每个前缀分组的字段,并在out_csv_dir中创建组合的csv。

4) As dictionary keys are unordered, your csvs may have specific field orders. 4)由于字典键是无序的,因此您的csv可能具有特定的字段顺序。 This can be entered into field_order. 可以输入到field_order中。 This will sort csv fields but won't fail on fields not defined in field_order. 这将对csv字段进行排序,但不会在未在field_order中定义的字段上失败。

import os
import sys

# Import System libraries
from csv import DictReader, DictWriter
import glob

in_csv_dir = ".\\csvs"
out_csv_dir = ".\\combined_csvs"
prefix_length = 2

field_order = ["NAME", "TITLE", "COMPANY", "LOCATION"]
field_check = lambda q: field_order.index(q) if(field_order.count(q)) else sys.maxint
csvs = {}
gotten_files = glob.glob(os.path.join(in_csv_dir, "*.csv"))
for glob_filename in gotten_files:
    print "%-11s%s" % ("Opening:", glob_filename)
    file_obj = open(glob_filename, "rb")
    cur_reader = DictReader(file_obj)
    cur_record = [q for q in cur_reader.__iter__()]
    file_obj.close()
    if(cur_record):
        (path, filename_ext) = os.path.split(glob_filename)
        (filename, ext) = os.path.splitext(filename_ext)
        csvs[filename] = cur_record

csv_prefixes = list(set([x[:prefix_length] for x in csvs.keys()]))
csv_groups = dict([(prefix, []) for prefix in csv_prefixes])
map(lambda (key, value): csv_groups[key[:prefix_length]].extend(value), csvs.items())

for (key, sub_csvs) in csv_groups.items():
    com_keys = list(reduce(lambda x, y: x|set(y.keys()), sub_csvs, set([])))
    com_keys.sort(cmp=lambda x, y: field_check(x) - field_check(y))

    filename = os.path.join(out_csv_dir, "%s.csv" % key)
    print "%-11s%s" % ("Combining:", filename)
    file_obj = open(filename, "wb")
    temp_csv = DictWriter(file_obj, com_keys)

    temp_csv.writerow(dict(zip(com_keys, com_keys)))
    map(lambda x: temp_csv.writerow(x), sub_csvs)
    file_obj.close()

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM