简体   繁体   English

将多个 csv 文件连接成具有相同标头的单个 csv - Python

[英]Concatenating multiple csv files into a single csv with the same header - Python

I am currently using the below code to import 6,000 csv files (with headers) and export them into a single csv file (with a single header row).我目前正在使用以下代码导入 6,000 个 csv 文件(带标题)并将它们导出到单个 csv 文件(带有单个标题行)。

#import csv files from folder
path =r'data/US/market/merged_data'
allFiles = glob.glob(path + "/*.csv")
stockstats_data = pd.DataFrame()
list_ = []

for file_ in allFiles:
    df = pd.read_csv(file_,index_col=None,)
    list_.append(df)
    stockstats_data = pd.concat(list_)
    print(file_ + " has been imported.")

This code works fine, but it is slow.这段代码工作正常,但速度很慢。 It can take up to 2 days to process.最多可能需要 2 天的时间来处理。

I was given a single line script for Terminal command line that does the same (but with no headers).我得到了一个终端命令行的单行脚本,它执行相同的操作(但没有标题)。 This script takes 20 seconds.此脚本需要 20 秒。

 for f in *.csv; do cat "`pwd`/$f" | tail -n +2 >> merged.csv; done 

Does anyone know how I can speed up the first Python script?有谁知道我如何加速第一个 Python 脚本? To cut the time down, I have thought about not importing it into a DataFrame and just concatenating the CSVs, but I cannot figure it out.为了缩短时间,我想过不将它导入到 DataFrame 中,而只是连接 CSV,但我无法弄清楚。

Thanks.谢谢。

If you don't need the CSV in memory, just copying from input to output, it'll be a lot cheaper to avoid parsing at all, and copy without building up in memory:如果您不需要内存中的 CSV,只需从输入复制到输出,那么完全避免解析和复制而不在内存中构建会便宜很多:

import shutil
import glob


#import csv files from folder
path = r'data/US/market/merged_data'
allFiles = glob.glob(path + "/*.csv")
allFiles.sort()  # glob lacks reliable ordering, so impose your own if output order matters
with open('someoutputfile.csv', 'wb') as outfile:
    for i, fname in enumerate(allFiles):
        with open(fname, 'rb') as infile:
            if i != 0:
                infile.readline()  # Throw away header on all but first file
            # Block copy rest of file from input to output without parsing
            shutil.copyfileobj(infile, outfile)
            print(fname + " has been imported.")

That's it;就是这样; shutil.copyfileobj handles efficiently copying the data, dramatically reducing the Python level work to parse and reserialize. shutil.copyfileobj处理有效地复制数据,大大减少了 Python 级别的解析和重新序列化工作。

This assumes all the CSV files have the same format, encoding, line endings, etc., and the header doesn't contain embedded newlines, but if that's the case, it's a lot faster than the alternatives.这假设所有 CSV 文件具有相同的格式、编码、行尾等,并且标题不包含嵌入的换行符,但如果是这种情况,它比替代方案快得多。

Are you required to do this in Python?您是否需要在 Python 中执行此操作? If you are open to doing this entirely in shell, all you'd need to do is first cat the header row from a randomly selected input .csv file into merged.csv before running your one-liner:如果你是开放的完全壳这样做,所有你需要做的是第一cat从随机选择的输入.csv文件到标题行merged.csv运行您的单行之前:

cat a-randomly-selected-csv-file.csv | head -n1 > merged.csv
for f in *.csv; do cat "`pwd`/$f" | tail -n +2 >> merged.csv; done 

You don't need pandas for this, just the simple csv module would work fine.为此,您不需要熊猫,只需简单的csv模块就可以正常工作。

import csv

df_out_filename = 'df_out.csv'
write_headers = True
with open(df_out_filename, 'wb') as fout:
    writer = csv.writer(fout)
    for filename in allFiles:
        with open(filename) as fin:
            reader = csv.reader(fin)
            headers = reader.next()
            if write_headers:
                write_headers = False  # Only write headers once.
                writer.writerow(headers)
            writer.writerows(reader)  # Write all remaining rows.

Here's a simpler approach - you can use pandas (though I am not sure how it will help with RAM usage)-这是一种更简单的方法-您可以使用熊猫(尽管我不确定它对 RAM 使用有何帮助)-

import pandas as pd
import glob

path =r'data/US/market/merged_data'
allFiles = glob.glob(path + "/*.csv")
stockstats_data = pd.DataFrame()
list_ = []

for file_ in allFiles:
    df = pd.read_csv(file_)
    stockstats_data = pd.concat((df, stockstats_data), axis=0)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 根据列值连接多个 CSV 文件,但多个 csv 文件具有相同的 header 但顺序不同 - Concatenating multiple CSV files based on column values,but the multiple csv files have the same header but vary in order 如何使用python pandas连接csv文件组时删除重复的标题(多行) - How to remove the repeated header(multiple rows) while concatenating group of csv files using python pandas 用python很好地连接csv文件 - concatenating csv files nicely with python 使用python将多个具有相同标题但不同csv文件名的CSV文件合并为一个文件 - merging multiple CSV files in one with same header but different csv files name with python 将 CSV 文件读取和连接到单个 dataframe 时出现问题 - Problems reading and concatenating CSV files into a single dataframe Python连接多个没有头的CSV文件 - Python Concatenate Multiple CSV files with no header 在python中将多个文件循环到单个csv文件中 - Looping multiple files into a single csv file in python 在 Apache Beam 中连接多个 csv 文件 - Concatenating multiple csv files in Apache Beam 连接多个具有不同结构的 Large.CSV 文件 - Concatenating Multiple Large .CSV Files with Varying Structures 从单个CSV列表中下载多个CSV文件(Python) - Download multiple CSV files from a list in a single CSV (Python)
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM