简体   繁体   中英

How to concatenate multiple csv files into one based on column names without having to type every column header in code

I am relatively new to python (about a weeks experience) and I can't seem to find the answer to my problem.

I am trying to merge hundreds of csv files based in my folder Data into a single csv file based on column name.

The solutions I have found require me to type out either each file name or column headers which would take days.

I used this code to create one csv file but the column names move around and therefore the data is not in the same columns over the whole DataFrame:

import pandas as pd
import glob
import os
def concatenate(indir=r"C:\\Users\ge\Documents\d\de", 
outfile = r"C:\Users\ge\Documents\d"):
    os.chdir(indir)
    fileList=glob.glob("*.csv")
    dfList = []
    for filename in fileList:
        print(filename)
        df = pd.read_csv(filename, header = None)
        dfList.append(df)
        concatDf = pd.concat(dfList, axis = 0)
    concatDf.to_csv(outfile, index= None)

Is there quick fire method to do this as I have less than a week to run statistics on the dataset.

Any help would be appreciated.

I am not sure if I understand your problem correctly, but this is one of the ways that you can merge your files without giving any column names:

import pandas as pd
import glob
import os


def concatenate(indir):
    os.chdir(indir)
    fileList=glob.glob("*.csv")
    output_file = pd.concat([pd.read_csv(filename) for filename in fileList])
    output_file.to_csv("_output.csv", index=False)


concatenate(indir= r"C:\\Users\gerardchurch\Documents\Data\dev_en")

Here is one, memory efficient, way to do that.

from pathlib import Path
import csv

indir = Path(r'C:\\Users\gerardchurch\Documents\Data\dev_en')
outfile = Path(r"C:\\Users\gerardchurch\Documents\Data\output.csv")


def find_header_from_all_files(indir):
    columns = set()
    print("Looking for column names in", indir)
    for f in indir.glob('*.csv'):
        with f.open() as sample_csv:
            sample_reader = csv.DictReader(sample_csv)
            try:
                first_row = next(sample_reader)
            except StopIteration:
                print("File {} doesn't contain any data. Double check this".format(f))
                continue
            else:
                columns.update(first_row.keys())
    return columns


columns = find_header_from_all_files(indir)
print("The columns are:", sorted(columns))

with outfile.open('w') as outf:
    wr = csv.DictWriter(outf, fieldnames=list(columns))
    wr.writeheader()
    for inpath in indir.glob('*.csv'):
        print("Parsing", inpath)
        with inpath.open() as infile:
            reader = csv.DictReader(infile)
            wr.writerows(reader)
print("Done, find the output at", outfile)

This should handle case, when one of the input csvs doesn't contain all columns

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM