如何根据列名将多个 csv 文件连接成一个文件，而无需在代码中键入每个列标题

Question

I am relatively new to python (about a weeks experience) and I can't seem to find the answer to my problem.我对 python 比较陌生（大约一周的经验），我似乎无法找到我的问题的答案。

I am trying to merge hundreds of csv files based in my folder Data into a single csv file based on column name.我正在尝试将基于我的文件夹 Data 中的数百个 csv 文件合并为一个基于列名的 csv 文件。

The solutions I have found require me to type out either each file name or column headers which would take days.我找到的解决方案要求我输入每个文件名或列标题，这需要几天时间。

I used this code to create one csv file but the column names move around and therefore the data is not in the same columns over the whole DataFrame:我使用此代码创建了一个 csv 文件，但列名四处移动，因此数据不在整个 DataFrame 的同一列中：

import pandas as pd
import glob
import os
def concatenate(indir=r"C:\\Users\ge\Documents\d\de", 
outfile = r"C:\Users\ge\Documents\d"):
    os.chdir(indir)
    fileList=glob.glob("*.csv")
    dfList = []
    for filename in fileList:
        print(filename)
        df = pd.read_csv(filename, header = None)
        dfList.append(df)
        concatDf = pd.concat(dfList, axis = 0)
    concatDf.to_csv(outfile, index= None)

Is there quick fire method to do this as I have less than a week to run statistics on the dataset.是否有快速启动方法来执行此操作，因为我只有不到一周的时间来对数据集运行统计数据。

Any help would be appreciated.任何帮助，将不胜感激。

Answer 1

I am not sure if I understand your problem correctly, but this is one of the ways that you can merge your files without giving any column names:我不确定我是否正确理解您的问题，但这是您可以在不提供任何列名的情况下合并文件的方法之一：

import pandas as pd
import glob
import os


def concatenate(indir):
    os.chdir(indir)
    fileList=glob.glob("*.csv")
    output_file = pd.concat([pd.read_csv(filename) for filename in fileList])
    output_file.to_csv("_output.csv", index=False)


concatenate(indir= r"C:\\Users\gerardchurch\Documents\Data\dev_en")

Answer 2

Here is one, memory efficient, way to do that.这是一种内存高效的方法来做到这一点。

from pathlib import Path
import csv

indir = Path(r'C:\\Users\gerardchurch\Documents\Data\dev_en')
outfile = Path(r"C:\\Users\gerardchurch\Documents\Data\output.csv")


def find_header_from_all_files(indir):
    columns = set()
    print("Looking for column names in", indir)
    for f in indir.glob('*.csv'):
        with f.open() as sample_csv:
            sample_reader = csv.DictReader(sample_csv)
            try:
                first_row = next(sample_reader)
            except StopIteration:
                print("File {} doesn't contain any data. Double check this".format(f))
                continue
            else:
                columns.update(first_row.keys())
    return columns


columns = find_header_from_all_files(indir)
print("The columns are:", sorted(columns))

with outfile.open('w') as outf:
    wr = csv.DictWriter(outf, fieldnames=list(columns))
    wr.writeheader()
    for inpath in indir.glob('*.csv'):
        print("Parsing", inpath)
        with inpath.open() as infile:
            reader = csv.DictReader(infile)
            wr.writerows(reader)
print("Done, find the output at", outfile)

This should handle case, when one of the input csvs doesn't contain all columns这应该处理情况，当输入 csvs 之一不包含所有columns

如何根据列名将多个 csv 文件连接成一个文件，而无需在代码中键入每个列标题

问题描述

2 个解决方案

解决方案1
1 2018-10-29 12:19:14

解决方案2
1 已采纳 2018-10-29 12:42:44

如何根据列名将多个 csv 文件连接成一个文件，而无需在代码中键入每个列标题

问题描述

2 个解决方案

解决方案1 1 2018-10-29 12:19:14

解决方案2 1 已采纳 2018-10-29 12:42:44

解决方案1
1 2018-10-29 12:19:14

解决方案2
1 已采纳 2018-10-29 12:42:44