I have approximately 40.csv files (ranging from approx. 100mb - 600mb) that I would like to concatenate into one single.csv file. The data i'm using has been constructed using a prebuilt R code from FNMA which aggregates larger raw data files into a more manageable format. However, when attempting to concatenate the data files, I receive the following error indicating there is a mismatch in field count between files: "ParserError: Error tokenizing data. C error: Expected 75 fields in line 10, saw 76" . I have attempted to essentially inner-join the data vertically in order to resolve this issue as the majority of the fields between files are homogeneous, but I have not found a reasonable solution. I am thinking it may make sense to just trim down the.R code to write the fields I know exist, but I would prefer a solution using Python as this data takes a significant amount of time to process and the prebuilt code is rather hefty. I will include the Python code I have been attempting to use below (I am happy to share the.R code if requested but it is 600+ lines):
import os
import glob
import pandas as pd
#set working directory
os.chdir("folder directory")
#find all csv files in the folder
#use glob pattern matching -> extension = 'csv'
#save result in list -> all_filenames
extension = 'csv'
all_filenames = [i for i in glob.glob('*.{}'.format(extension))]
#print(all_filenames)
#combine all files in the list
combined_csv = pd.concat([pd.read_csv(f) for f in all_filenames ], ignore_index=True, sort=False)
#export to csv
combined_csv.to_csv( "FNMA Data Aggregated.csv", index=False, encoding='utf-8-sig')
combined_csv
If I understand your setup...
Let's say you have the following two files, file1.csv :
c1,c2
1,a
2,b
3,c
and file2.csv :
c1,c3
4,iv
5,v
6,vi
Do you expect stacked.csv to look like the following?
c1,c2,c3
1,a,
2,b,
3,c,
4,,iv
5,,v
6,,vi
If so, you need to break this down into three steps:
import csv
file_names = [
'file1.csv',
'file2.csv',
]
def print_debug(e, row, fname, super_list):
if not 'dict contains fields' in str(e):
raise
print(f'''While trying to write row for file {fname}:
{row}
received error: "{e}"
Superset of headers is:
{super_list}
''')
# Aggregate headers
super_set = set()
for fname in file_names:
with open(fname, newline='') as f:
reader = csv.DictReader(f)
super_set.update(set(reader.fieldnames))
# DictReader needs this to be a list, not a set
super_set.add('file')
super_list = list(super_set)
with open('stacked.csv', 'w' , newline='') as f_out:
writer = csv.DictWriter(f_out, fieldnames=super_list)
writer.writeheader()
for fname in file_names:
with open(fname, newline='') as f_in:
reader = csv.DictReader(f_in)
for row in reader:
row['file'] = fname
try:
writer.writerow(row)
except ValueError as e:
print_debug(e, row, fname, super_list)
except:
raise
When I run that against my two sample inputs, I get:
c1 | c2 | c3 | file |
---|---|---|---|
1 | a | file1.csv | |
2 | b | file1.csv | |
3 | c | file1.csv | |
4 | iv | file2.csv | |
5 | v | file2.csv | |
6 | vi | file2.csv |
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.