简体   繁体   中英

Combining Sets of CSV files python

I have a list of 2000 files, which I would like to combine:

01-0628-11A-01D-0356-01_hg19
01-0628-11A-01D-0356-01_hg20
01-0628-11A-01D-0356-01_hg21
01-1372-11A-01D-0356-01_hg16
01-1372-11A-01D-0356-01_hg17
...


I have already gotten the files into a glob, and used regular expressions to rename the files into the common identifier (the six digit code shown below), however there is a varying number of original files for each identifier.

01-0628
01-0628
01-0628
01-1372
01-1372
...

I was originally going to just use a reader and open each file with common names, but I was wondering if there is a more efficient way to do this.

The final output I would like is the following, each of the files with common identifiers combined into one:

01-0628
01-1372
...

All of the files contain similarly formatted data, so simply apending the existing files to a new file would not be an issue.

Assuming these csvs have similar or identical fields. This code should work. It uses the DictReader and DictWriter classes of the csv module which convert csv rows to Python dictionaries.

1) It opens and reads in the globbed csv files, from in_csv_dir, into a (filename, rows) dictionary.

2) It groups the csv rows into a (prefix, rows) dictionary based on filenames and the prefix_length variable.

3) It combines the fields of each prefix grouping and creates a combined csv in the out_csv_dir.

4) As dictionary keys are unordered, your csvs may have specific field orders. This can be entered into field_order. This will sort csv fields but won't fail on fields not defined in field_order.

import os
import sys

# Import System libraries
from csv import DictReader, DictWriter
import glob

in_csv_dir = ".\\csvs"
out_csv_dir = ".\\combined_csvs"
prefix_length = 2

field_order = ["NAME", "TITLE", "COMPANY", "LOCATION"]
field_check = lambda q: field_order.index(q) if(field_order.count(q)) else sys.maxint
csvs = {}
gotten_files = glob.glob(os.path.join(in_csv_dir, "*.csv"))
for glob_filename in gotten_files:
    print "%-11s%s" % ("Opening:", glob_filename)
    file_obj = open(glob_filename, "rb")
    cur_reader = DictReader(file_obj)
    cur_record = [q for q in cur_reader.__iter__()]
    file_obj.close()
    if(cur_record):
        (path, filename_ext) = os.path.split(glob_filename)
        (filename, ext) = os.path.splitext(filename_ext)
        csvs[filename] = cur_record

csv_prefixes = list(set([x[:prefix_length] for x in csvs.keys()]))
csv_groups = dict([(prefix, []) for prefix in csv_prefixes])
map(lambda (key, value): csv_groups[key[:prefix_length]].extend(value), csvs.items())

for (key, sub_csvs) in csv_groups.items():
    com_keys = list(reduce(lambda x, y: x|set(y.keys()), sub_csvs, set([])))
    com_keys.sort(cmp=lambda x, y: field_check(x) - field_check(y))

    filename = os.path.join(out_csv_dir, "%s.csv" % key)
    print "%-11s%s" % ("Combining:", filename)
    file_obj = open(filename, "wb")
    temp_csv = DictWriter(file_obj, com_keys)

    temp_csv.writerow(dict(zip(com_keys, com_keys)))
    map(lambda x: temp_csv.writerow(x), sub_csvs)
    file_obj.close()

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM