简体   繁体   中英

Concatenating Multiple Large .CSV Files with Varying Structures

I have approximately 40.csv files (ranging from approx. 100mb - 600mb) that I would like to concatenate into one single.csv file. The data i'm using has been constructed using a prebuilt R code from FNMA which aggregates larger raw data files into a more manageable format. However, when attempting to concatenate the data files, I receive the following error indicating there is a mismatch in field count between files: "ParserError: Error tokenizing data. C error: Expected 75 fields in line 10, saw 76" . I have attempted to essentially inner-join the data vertically in order to resolve this issue as the majority of the fields between files are homogeneous, but I have not found a reasonable solution. I am thinking it may make sense to just trim down the.R code to write the fields I know exist, but I would prefer a solution using Python as this data takes a significant amount of time to process and the prebuilt code is rather hefty. I will include the Python code I have been attempting to use below (I am happy to share the.R code if requested but it is 600+ lines):

import os
import glob
import pandas as pd
#set working directory
os.chdir("folder directory")

#find all csv files in the folder
#use glob pattern matching -> extension = 'csv'
#save result in list -> all_filenames
extension = 'csv'
all_filenames = [i for i in glob.glob('*.{}'.format(extension))]
#print(all_filenames)

#combine all files in the list
combined_csv = pd.concat([pd.read_csv(f) for f in all_filenames ], ignore_index=True, sort=False)
#export to csv
combined_csv.to_csv( "FNMA Data Aggregated.csv", index=False, encoding='utf-8-sig')
combined_csv

If I understand your setup...

Let's say you have the following two files, file1.csv :

c1,c2
1,a
2,b
3,c

and file2.csv :

c1,c3
4,iv
5,v
6,vi

Do you expect stacked.csv to look like the following?

c1,c2,c3
1,a,
2,b,
3,c,
4,,iv
5,,v
6,,vi

If so, you need to break this down into three steps:

  1. Read the columns of all your CSVs, and aggregate those columns into a superset of all individual headers
  2. Create stacked.csv and write the superset header
  3. Read each CSV and write its values in their respective column names under the superset header in stacked.csv
import csv

file_names = [
    'file1.csv',
    'file2.csv',
]


def print_debug(e, row, fname, super_list):
    if not 'dict contains fields' in str(e):
        raise

    print(f'''While trying to write row for file {fname}:
{row}

received error: "{e}"

Superset of headers is:
{super_list}
''')


# Aggregate headers
super_set = set()
for fname in  file_names:
    with open(fname, newline='') as f:
        reader = csv.DictReader(f)
        super_set.update(set(reader.fieldnames))

# DictReader needs this to be a list, not a set
super_set.add('file')
super_list = list(super_set)

with open('stacked.csv', 'w' , newline='') as f_out:
    writer = csv.DictWriter(f_out, fieldnames=super_list)
    writer.writeheader()

    for fname in  file_names:
        with open(fname, newline='') as f_in:
            reader = csv.DictReader(f_in)
            for row in reader:
                row['file'] = fname
                try:
                    writer.writerow(row)
                except ValueError as e:
                    print_debug(e, row, fname, super_list)
                except:
                    raise

When I run that against my two sample inputs, I get:

c1 c2 c3 file
1 a file1.csv
2 b file1.csv
3 c file1.csv
4 iv file2.csv
5 v file2.csv
6 vi file2.csv

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM