将多个 CSV 文件连接到一个 Dataframe 并输出到 Master CSV

Question

I'm looking for someone to help me with the script below.我正在寻找可以帮助我完成以下脚本的人。 I'm trying to concatenate a month worth of csv files into a 'master file'.我正在尝试将一个月的 csv 文件连接到一个“主文件”中。 The files are really big, so I was hoping to do a few things in the script to shorten them.这些文件真的很大，所以我希望在脚本中做一些事情来缩短它们。 Here is what I'm having trouble with:这是我遇到的问题：

The files are different, but the headers are the same.文件不同，但标题相同。 I'm not sure how to get the header on only the first file.我不确定如何仅在第一个文件上获取标题。 I used next(f) to get rid of the rest.我用 next(f) 摆脱了其余的。
How can I add the 'Output' directory as the target folder for output1.csv如何将“输出”目录添加为 output1.csv 的目标文件夹
Lastly, I've been trying to work with pandas- how can I use them to delete columns 1,2,4 and everything after column 90. I also would like to know how to make this a dataframe first before I write it to csv- I would like to add a few calculations to the end of the output file before I write it.最后，我一直在尝试使用 pandas-如何使用它们删除第 1、2、4 列以及第 90 列之后的所有内容。我还想知道如何在将其写入 csv 之前先将其设为数据框- 在我写之前，我想在输出文件的末尾添加一些计算。

Here's my script so far, Im using the file's timestamp to find the correct month '201510' = October到目前为止，这是我的脚本，我使用文件的时间戳找到正确的月份 '201510' = 十月

import csv
import os
import sys
import pandas as pd

Source = r'F:\backup\finalized 2'
Output = r'F:\Tom\Python'

for root, dirs, files in os.walk((os.path.normpath(Source)), topdown=False):
    for name in files:
        if name.startswith('201510') and name.endswith('client.csv'):
            print "Found", name
            SourceFolder = os.path.join(root, name)
            with open(SourceFolder + "", 'r') as f:
                next(f)
                for line in csv.reader(f, delimiter=','):
                    with open('output1.csv','ab') as fout:
                        wr = csv.writer(fout)
                        wr.writerow(line)

Here are the calculations I would like to add to the end of the dataframe/CSV:以下是我想添加到数据框/CSV 末尾的计算：

df['ten_avg'] = df.iloc[:, 30:50].sum(axis=1).astype('int64') / 20      
df['twenty_avg'] = df.iloc[:, 30:70].sum(axis=1).astype('int64') / 40

Answer 1

I think you can use only pandas for processing.我认为您只能使用熊猫进行处理。

You need headers of all files, because you need concatenate them to one big files by headers of csvs.您需要所有文件的标题，因为您需要通过 csvs 的标题将它们连接到一个大文件。

I think better is to define path and name of output file together: OutputCSV = r'F:\Tom\Python\output.csv' .我认为最好将输出文件的路径和名称一起定义： OutputCSV = r'F:\Tom\Python\output.csv' 。

The best method for reading csv is read only these columns, what exactly need for next processing.读取csv的最佳方法是只读取这些列，这正是下一步处理所需要的。 You can use function read_csv with parameter usecol .您可以使用带有参数usecol 的函数 read_csv 。 It is filter of columns and need names of columns.它是列的过滤器，需要列的名称。 You can get them by reading one file with header (all rows can be deleted).您可以通过读取一个带有标题的文件来获取它们（所有行都可以删除）。 Column names in list are processing - deleted 3.item (first item has index 0), slicing them by [2:89] and then use variable cols for reading all csvs.列表中的列名正在处理 - 删除 3.item（第一项具有索引 0），将它们切片[2:89] ，然后使用变量cols读取所有 csv。

You get all files in loop, get datafarme from function read_csv with usecols=cols , which is appending to list of dataframes.您可以循环获取所有文件，使用usecols=cols从函数read_csv获取数据农场，该函数附加到数据帧列表中。 Then this list is concatenated to one big output dataframe df .然后将此列表连接到一个大输出数据帧df 。

After processing output df is write to file by function to_csv .处理后输出 df 通过函数to_csv写入文件。

import pandas as pd
import os

Source = r'F:\backup\finalized 2'
OutputCSV = r'F:\Tom\Python\output.csv'

normSource = os.path.normpath(Source)

#find column names and delete 1,2,4, and more as 90th columns
#read one csv
names = pd.read_csv(os.path.join(normSource,'header.csv'), sep=",")
#column names to list
cols = names.columns.tolist()
print cols
#the first item has index 0, so you need delete 0, 1, 3, 89, 90, 91.. item

#delete 3 item
del cols[3]

#get 2,4,5,...89 item
cols = cols[2:89]
print cols

dfs = []
#create empty df for output
d = pd.DataFrame()

for root, dirs, files in os.walk(normSource, topdown=False):
    for name in files:
        print root
        print name
        if name.startswith('201510') and name.endswith('client.csv'):
            #only read columns from list cols
            dfs.append(pd.read_csv(os.path.join(root, name), sep=',', index_col=False, usecols=cols))
            df = pd.concat(dfs, ignore_index=True)

#all files in one dataframe
print df.head()

df['ten_avg'] = df.iloc[:, 30:50].sum(axis=1).astype('int64') / 20      
df['twenty_avg'] = df.iloc[:, 30:70].sum(axis=1).astype('int64') / 40
print df.head()

#output to csv, remove index
df.to_csv(OutputCSV, sep=",", index=False)

Answer 2

Pandas Dataframe should come handy here. Pandas Dataframe 在这里应该派上用场。

Import each file in the dataframe using:使用以下命令导入数据框中的每个文件：

import pandas as pd
df = pd.DataFrame.from_csv('<csvfilename>',index_col=False,parse_dates=False)

And then append this to master dataframe with:然后将其附加到主数据框：

master = master.append(df,ignore_index=True)

Perform your operations on dataframe and then export using:对数据框执行操作，然后使用以下命令导出：

master.to_csv('<csv_file_name>')

将多个 CSV 文件连接到一个 Dataframe 并输出到 Master CSV

问题描述

2 个解决方案

解决方案1
1 已采纳 2015-11-18 14:33:12

解决方案2
0 2015-11-18 12:25:59

将多个 CSV 文件连接到一个 Dataframe 并输出到 Master CSV

问题描述

2 个解决方案

解决方案1 1 已采纳 2015-11-18 14:33:12

解决方案2 0 2015-11-18 12:25:59

解决方案1
1 已采纳 2015-11-18 14:33:12

解决方案2
0 2015-11-18 12:25:59