How to concatenate multiple csv files into one based on column names without having to type every column header in code

Question

I am relatively new to python (about a weeks experience) and I can't seem to find the answer to my problem.

I am trying to merge hundreds of csv files based in my folder Data into a single csv file based on column name.

The solutions I have found require me to type out either each file name or column headers which would take days.

I used this code to create one csv file but the column names move around and therefore the data is not in the same columns over the whole DataFrame:

import pandas as pd
import glob
import os
def concatenate(indir=r"C:\\Users\ge\Documents\d\de", 
outfile = r"C:\Users\ge\Documents\d"):
    os.chdir(indir)
    fileList=glob.glob("*.csv")
    dfList = []
    for filename in fileList:
        print(filename)
        df = pd.read_csv(filename, header = None)
        dfList.append(df)
        concatDf = pd.concat(dfList, axis = 0)
    concatDf.to_csv(outfile, index= None)

Is there quick fire method to do this as I have less than a week to run statistics on the dataset.

Any help would be appreciated.

Answer 1

I am not sure if I understand your problem correctly, but this is one of the ways that you can merge your files without giving any column names:

import pandas as pd
import glob
import os


def concatenate(indir):
    os.chdir(indir)
    fileList=glob.glob("*.csv")
    output_file = pd.concat([pd.read_csv(filename) for filename in fileList])
    output_file.to_csv("_output.csv", index=False)


concatenate(indir= r"C:\\Users\gerardchurch\Documents\Data\dev_en")

Answer 2

Here is one, memory efficient, way to do that.

from pathlib import Path
import csv

indir = Path(r'C:\\Users\gerardchurch\Documents\Data\dev_en')
outfile = Path(r"C:\\Users\gerardchurch\Documents\Data\output.csv")


def find_header_from_all_files(indir):
    columns = set()
    print("Looking for column names in", indir)
    for f in indir.glob('*.csv'):
        with f.open() as sample_csv:
            sample_reader = csv.DictReader(sample_csv)
            try:
                first_row = next(sample_reader)
            except StopIteration:
                print("File {} doesn't contain any data. Double check this".format(f))
                continue
            else:
                columns.update(first_row.keys())
    return columns


columns = find_header_from_all_files(indir)
print("The columns are:", sorted(columns))

with outfile.open('w') as outf:
    wr = csv.DictWriter(outf, fieldnames=list(columns))
    wr.writeheader()
    for inpath in indir.glob('*.csv'):
        print("Parsing", inpath)
        with inpath.open() as infile:
            reader = csv.DictReader(infile)
            wr.writerows(reader)
print("Done, find the output at", outfile)

This should handle case, when one of the input csvs doesn't contain all columns

How to concatenate multiple csv files into one based on column names without having to type every column header in code

Question

2 answers

solution1
1 2018-10-29 12:19:14

solution2
1 ACCPTED 2018-10-29 12:42:44

How to concatenate multiple csv files into one based on column names without having to type every column header in code

Question

2 answers

solution1 1 2018-10-29 12:19:14

solution2 1 ACCPTED 2018-10-29 12:42:44

solution1
1 2018-10-29 12:19:14

solution2
1 ACCPTED 2018-10-29 12:42:44