[英]How to concatenate multiple csv files into one based on column names without having to type every column header in code
I am relatively new to python (about a weeks experience) and I can't seem to find the answer to my problem.我对 python 比较陌生(大约一周的经验),我似乎无法找到我的问题的答案。
I am trying to merge hundreds of csv files based in my folder Data into a single csv file based on column name.我正在尝试将基于我的文件夹 Data 中的数百个 csv 文件合并为一个基于列名的 csv 文件。
The solutions I have found require me to type out either each file name or column headers which would take days.我找到的解决方案要求我输入每个文件名或列标题,这需要几天时间。
I used this code to create one csv file but the column names move around and therefore the data is not in the same columns over the whole DataFrame:我使用此代码创建了一个 csv 文件,但列名四处移动,因此数据不在整个 DataFrame 的同一列中:
import pandas as pd
import glob
import os
def concatenate(indir=r"C:\\Users\ge\Documents\d\de",
outfile = r"C:\Users\ge\Documents\d"):
os.chdir(indir)
fileList=glob.glob("*.csv")
dfList = []
for filename in fileList:
print(filename)
df = pd.read_csv(filename, header = None)
dfList.append(df)
concatDf = pd.concat(dfList, axis = 0)
concatDf.to_csv(outfile, index= None)
Is there quick fire method to do this as I have less than a week to run statistics on the dataset.是否有快速启动方法来执行此操作,因为我只有不到一周的时间来对数据集运行统计数据。
Any help would be appreciated.任何帮助,将不胜感激。
I am not sure if I understand your problem correctly, but this is one of the ways that you can merge your files without giving any column names:我不确定我是否正确理解您的问题,但这是您可以在不提供任何列名的情况下合并文件的方法之一:
import pandas as pd
import glob
import os
def concatenate(indir):
os.chdir(indir)
fileList=glob.glob("*.csv")
output_file = pd.concat([pd.read_csv(filename) for filename in fileList])
output_file.to_csv("_output.csv", index=False)
concatenate(indir= r"C:\\Users\gerardchurch\Documents\Data\dev_en")
Here is one, memory efficient, way to do that.这是一种内存高效的方法来做到这一点。
from pathlib import Path
import csv
indir = Path(r'C:\\Users\gerardchurch\Documents\Data\dev_en')
outfile = Path(r"C:\\Users\gerardchurch\Documents\Data\output.csv")
def find_header_from_all_files(indir):
columns = set()
print("Looking for column names in", indir)
for f in indir.glob('*.csv'):
with f.open() as sample_csv:
sample_reader = csv.DictReader(sample_csv)
try:
first_row = next(sample_reader)
except StopIteration:
print("File {} doesn't contain any data. Double check this".format(f))
continue
else:
columns.update(first_row.keys())
return columns
columns = find_header_from_all_files(indir)
print("The columns are:", sorted(columns))
with outfile.open('w') as outf:
wr = csv.DictWriter(outf, fieldnames=list(columns))
wr.writeheader()
for inpath in indir.glob('*.csv'):
print("Parsing", inpath)
with inpath.open() as infile:
reader = csv.DictReader(infile)
wr.writerows(reader)
print("Done, find the output at", outfile)
This should handle case, when one of the input csvs doesn't contain all columns
这应该处理情况,当输入 csvs 之一不包含所有columns
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.