[英]Concatenating multiple csv files into a single csv with the same header - Python
I am currently using the below code to import 6,000 csv files (with headers) and export them into a single csv file (with a single header row).我目前正在使用以下代码导入 6,000 个 csv 文件(带标题)并将它们导出到单个 csv 文件(带有单个标题行)。
#import csv files from folder
path =r'data/US/market/merged_data'
allFiles = glob.glob(path + "/*.csv")
stockstats_data = pd.DataFrame()
list_ = []
for file_ in allFiles:
df = pd.read_csv(file_,index_col=None,)
list_.append(df)
stockstats_data = pd.concat(list_)
print(file_ + " has been imported.")
This code works fine, but it is slow.这段代码工作正常,但速度很慢。 It can take up to 2 days to process.
最多可能需要 2 天的时间来处理。
I was given a single line script for Terminal command line that does the same (but with no headers).我得到了一个终端命令行的单行脚本,它执行相同的操作(但没有标题)。 This script takes 20 seconds.
此脚本需要 20 秒。
for f in *.csv; do cat "`pwd`/$f" | tail -n +2 >> merged.csv; done
Does anyone know how I can speed up the first Python script?有谁知道我如何加速第一个 Python 脚本? To cut the time down, I have thought about not importing it into a DataFrame and just concatenating the CSVs, but I cannot figure it out.
为了缩短时间,我想过不将它导入到 DataFrame 中,而只是连接 CSV,但我无法弄清楚。
Thanks.谢谢。
If you don't need the CSV in memory, just copying from input to output, it'll be a lot cheaper to avoid parsing at all, and copy without building up in memory:如果您不需要内存中的 CSV,只需从输入复制到输出,那么完全避免解析和复制而不在内存中构建会便宜很多:
import shutil
import glob
#import csv files from folder
path = r'data/US/market/merged_data'
allFiles = glob.glob(path + "/*.csv")
allFiles.sort() # glob lacks reliable ordering, so impose your own if output order matters
with open('someoutputfile.csv', 'wb') as outfile:
for i, fname in enumerate(allFiles):
with open(fname, 'rb') as infile:
if i != 0:
infile.readline() # Throw away header on all but first file
# Block copy rest of file from input to output without parsing
shutil.copyfileobj(infile, outfile)
print(fname + " has been imported.")
That's it;就是这样;
shutil.copyfileobj
handles efficiently copying the data, dramatically reducing the Python level work to parse and reserialize. shutil.copyfileobj
处理有效地复制数据,大大减少了 Python 级别的解析和重新序列化工作。
This assumes all the CSV files have the same format, encoding, line endings, etc., and the header doesn't contain embedded newlines, but if that's the case, it's a lot faster than the alternatives.这假设所有 CSV 文件具有相同的格式、编码、行尾等,并且标题不包含嵌入的换行符,但如果是这种情况,它比替代方案快得多。
Are you required to do this in Python?您是否需要在 Python 中执行此操作? If you are open to doing this entirely in shell, all you'd need to do is first
cat
the header row from a randomly selected input .csv file into merged.csv
before running your one-liner:如果你是开放的完全壳这样做,所有你需要做的是第一
cat
从随机选择的输入.csv文件到标题行merged.csv
运行您的单行之前:
cat a-randomly-selected-csv-file.csv | head -n1 > merged.csv
for f in *.csv; do cat "`pwd`/$f" | tail -n +2 >> merged.csv; done
You don't need pandas for this, just the simple csv
module would work fine.为此,您不需要熊猫,只需简单的
csv
模块就可以正常工作。
import csv
df_out_filename = 'df_out.csv'
write_headers = True
with open(df_out_filename, 'wb') as fout:
writer = csv.writer(fout)
for filename in allFiles:
with open(filename) as fin:
reader = csv.reader(fin)
headers = reader.next()
if write_headers:
write_headers = False # Only write headers once.
writer.writerow(headers)
writer.writerows(reader) # Write all remaining rows.
Here's a simpler approach - you can use pandas (though I am not sure how it will help with RAM usage)-这是一种更简单的方法-您可以使用熊猫(尽管我不确定它对 RAM 使用有何帮助)-
import pandas as pd
import glob
path =r'data/US/market/merged_data'
allFiles = glob.glob(path + "/*.csv")
stockstats_data = pd.DataFrame()
list_ = []
for file_ in allFiles:
df = pd.read_csv(file_)
stockstats_data = pd.concat((df, stockstats_data), axis=0)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.