[英]Concatenating multiple CSV files into a Dataframe and outputing to Master CSV
I'm looking for someone to help me with the script below.我正在寻找可以帮助我完成以下脚本的人。 I'm trying to concatenate a month worth of csv files into a 'master file'.
我正在尝试将一个月的 csv 文件连接到一个“主文件”中。 The files are really big, so I was hoping to do a few things in the script to shorten them.
这些文件真的很大,所以我希望在脚本中做一些事情来缩短它们。 Here is what I'm having trouble with:
这是我遇到的问题:
Here's my script so far, Im using the file's timestamp to find the correct month '201510' = October到目前为止,这是我的脚本,我使用文件的时间戳找到正确的月份 '201510' = 十月
import csv
import os
import sys
import pandas as pd
Source = r'F:\backup\finalized 2'
Output = r'F:\Tom\Python'
for root, dirs, files in os.walk((os.path.normpath(Source)), topdown=False):
for name in files:
if name.startswith('201510') and name.endswith('client.csv'):
print "Found", name
SourceFolder = os.path.join(root, name)
with open(SourceFolder + "", 'r') as f:
next(f)
for line in csv.reader(f, delimiter=','):
with open('output1.csv','ab') as fout:
wr = csv.writer(fout)
wr.writerow(line)
Here are the calculations I would like to add to the end of the dataframe/CSV:以下是我想添加到数据框/CSV 末尾的计算:
df['ten_avg'] = df.iloc[:, 30:50].sum(axis=1).astype('int64') / 20
df['twenty_avg'] = df.iloc[:, 30:70].sum(axis=1).astype('int64') / 40
I think you can use only pandas for processing.我认为您只能使用熊猫进行处理。
You need headers of all files, because you need concatenate them to one big files by headers of csvs.您需要所有文件的标题,因为您需要通过 csvs 的标题将它们连接到一个大文件。
I think better is to define path and name of output file together: OutputCSV = r'F:\Tom\Python\output.csv'
.我认为最好将输出文件的路径和名称一起定义:
OutputCSV = r'F:\Tom\Python\output.csv'
。
The best method for reading csv
is read only these columns, what exactly need for next processing.读取
csv
的最佳方法是只读取这些列,这正是下一步处理所需要的。 You can use function read_csv with parameter usecol .您可以使用带有参数usecol 的函数 read_csv 。 It is filter of columns and need names of columns.
它是列的过滤器,需要列的名称。 You can get them by reading one file with header (all rows can be deleted).
您可以通过读取一个带有标题的文件来获取它们(所有行都可以删除)。 Column names in list are processing - deleted 3.item (first item has index 0), slicing them by
[2:89]
and then use variable cols
for reading all csvs.列表中的列名正在处理 - 删除 3.item(第一项具有索引 0),将它们切片
[2:89]
,然后使用变量cols
读取所有 csv。
You get all files in loop, get datafarme from function read_csv
with usecols=cols
, which is appending to list of dataframes.您可以循环获取所有文件,使用
usecols=cols
从函数read_csv
获取数据农场,该函数附加到数据帧列表中。 Then this list is concatenated to one big output dataframe df
.然后将此列表连接到一个大输出数据帧
df
。
After processing output df is write to file by function to_csv .处理后输出 df 通过函数to_csv写入文件。
import pandas as pd
import os
Source = r'F:\backup\finalized 2'
OutputCSV = r'F:\Tom\Python\output.csv'
normSource = os.path.normpath(Source)
#find column names and delete 1,2,4, and more as 90th columns
#read one csv
names = pd.read_csv(os.path.join(normSource,'header.csv'), sep=",")
#column names to list
cols = names.columns.tolist()
print cols
#the first item has index 0, so you need delete 0, 1, 3, 89, 90, 91.. item
#delete 3 item
del cols[3]
#get 2,4,5,...89 item
cols = cols[2:89]
print cols
dfs = []
#create empty df for output
d = pd.DataFrame()
for root, dirs, files in os.walk(normSource, topdown=False):
for name in files:
print root
print name
if name.startswith('201510') and name.endswith('client.csv'):
#only read columns from list cols
dfs.append(pd.read_csv(os.path.join(root, name), sep=',', index_col=False, usecols=cols))
df = pd.concat(dfs, ignore_index=True)
#all files in one dataframe
print df.head()
df['ten_avg'] = df.iloc[:, 30:50].sum(axis=1).astype('int64') / 20
df['twenty_avg'] = df.iloc[:, 30:70].sum(axis=1).astype('int64') / 40
print df.head()
#output to csv, remove index
df.to_csv(OutputCSV, sep=",", index=False)
Pandas Dataframe should come handy here. Pandas Dataframe 在这里应该派上用场。
Import each file in the dataframe using:使用以下命令导入数据框中的每个文件:
import pandas as pd
df = pd.DataFrame.from_csv('<csvfilename>',index_col=False,parse_dates=False)
And then append this to master dataframe with:然后将其附加到主数据框:
master = master.append(df,ignore_index=True)
Perform your operations on dataframe and then export using:对数据框执行操作,然后使用以下命令导出:
master.to_csv('<csv_file_name>')
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.