简体   繁体   English

将多个 CSV 文件连接到一个 Dataframe 并输出到 Master CSV

[英]Concatenating multiple CSV files into a Dataframe and outputing to Master CSV

I'm looking for someone to help me with the script below.我正在寻找可以帮助我完成以下脚本的人。 I'm trying to concatenate a month worth of csv files into a 'master file'.我正在尝试将一个月的 csv 文件连接到一个“主文件”中。 The files are really big, so I was hoping to do a few things in the script to shorten them.这些文件真的很大,所以我希望在脚本中做一些事情来缩短它们。 Here is what I'm having trouble with:这是我遇到的问题:

  1. The files are different, but the headers are the same.文件不同,但标题相同。 I'm not sure how to get the header on only the first file.我不确定如何仅在第一个文件上获取标题。 I used next(f) to get rid of the rest.我用 next(f) 摆脱了其余的。
  2. How can I add the 'Output' directory as the target folder for output1.csv如何将“输出”目录添加为 output1.csv 的目标文件夹
  3. Lastly, I've been trying to work with pandas- how can I use them to delete columns 1,2,4 and everything after column 90. I also would like to know how to make this a dataframe first before I write it to csv- I would like to add a few calculations to the end of the output file before I write it.最后,我一直在尝试使用 pandas-如何使用它们删除第 1、2、4 列以及第 90 列之后的所有内容。我还想知道如何在将其写入 csv 之前先将其设为数据框- 在我写之前,我想在输出文件的末尾添加一些计算。

Here's my script so far, Im using the file's timestamp to find the correct month '201510' = October到目前为止,这是我的脚本,我使用文件的时间戳找到正确的月份 '201510' = 十月

import csv
import os
import sys
import pandas as pd

Source = r'F:\backup\finalized 2'
Output = r'F:\Tom\Python'

for root, dirs, files in os.walk((os.path.normpath(Source)), topdown=False):
    for name in files:
        if name.startswith('201510') and name.endswith('client.csv'):
            print "Found", name
            SourceFolder = os.path.join(root, name)
            with open(SourceFolder + "", 'r') as f:
                next(f)
                for line in csv.reader(f, delimiter=','):
                    with open('output1.csv','ab') as fout:
                        wr = csv.writer(fout)
                        wr.writerow(line)

Here are the calculations I would like to add to the end of the dataframe/CSV:以下是我想添加到数据框/CSV 末尾的计算:

df['ten_avg'] = df.iloc[:, 30:50].sum(axis=1).astype('int64') / 20      
df['twenty_avg'] = df.iloc[:, 30:70].sum(axis=1).astype('int64') / 40

I think you can use only pandas for processing.我认为您只能使用熊猫进行处理。

You need headers of all files, because you need concatenate them to one big files by headers of csvs.您需要所有文件的标题,因为您需要通过 csvs 的标题将它们连接到一个大文件。

I think better is to define path and name of output file together: OutputCSV = r'F:\Tom\Python\output.csv' .我认为最好将输出文件的路径和名称一起定义: OutputCSV = r'F:\Tom\Python\output.csv'

The best method for reading csv is read only these columns, what exactly need for next processing.读取csv的最佳方法是只读取这些列,这正是下一步处理所需要的。 You can use function read_csv with parameter usecol .您可以使用带有参数usecol 的函数 read_csv It is filter of columns and need names of columns.它是列的过滤器,需要列的名称。 You can get them by reading one file with header (all rows can be deleted).您可以通过读取一个带有标题的文件来获取它们(所有行都可以删除)。 Column names in list are processing - deleted 3.item (first item has index 0), slicing them by [2:89] and then use variable cols for reading all csvs.列表中的列名正在处理 - 删除 3.item(第一项具有索引 0),将它们切片[2:89] ,然后使用变量cols读取所有 csv。

You get all files in loop, get datafarme from function read_csv with usecols=cols , which is appending to list of dataframes.您可以循环获取所有文件,使用usecols=cols从函数read_csv获取数据农场,该函数附加到数据帧列表中。 Then this list is concatenated to one big output dataframe df .然后将此列表连接到一个大输出数据帧df

After processing output df is write to file by function to_csv .处理后输出 df 通过函数to_csv写入文件。

import pandas as pd
import os

Source = r'F:\backup\finalized 2'
OutputCSV = r'F:\Tom\Python\output.csv'

normSource = os.path.normpath(Source)

#find column names and delete 1,2,4, and more as 90th columns
#read one csv
names = pd.read_csv(os.path.join(normSource,'header.csv'), sep=",")
#column names to list
cols = names.columns.tolist()
print cols
#the first item has index 0, so you need delete 0, 1, 3, 89, 90, 91.. item

#delete 3 item
del cols[3]

#get 2,4,5,...89 item
cols = cols[2:89]
print cols
dfs = []
#create empty df for output
d = pd.DataFrame()

for root, dirs, files in os.walk(normSource, topdown=False):
    for name in files:
        print root
        print name
        if name.startswith('201510') and name.endswith('client.csv'):
            #only read columns from list cols
            dfs.append(pd.read_csv(os.path.join(root, name), sep=',', index_col=False, usecols=cols))
            df = pd.concat(dfs, ignore_index=True)

#all files in one dataframe
print df.head()

df['ten_avg'] = df.iloc[:, 30:50].sum(axis=1).astype('int64') / 20      
df['twenty_avg'] = df.iloc[:, 30:70].sum(axis=1).astype('int64') / 40
print df.head()

#output to csv, remove index
df.to_csv(OutputCSV, sep=",", index=False)

Pandas Dataframe should come handy here. Pandas Dataframe 在这里应该派上用场。

Import each file in the dataframe using:使用以下命令导入数据框中的每个文件:

import pandas as pd
df = pd.DataFrame.from_csv('<csvfilename>',index_col=False,parse_dates=False)

And then append this to master dataframe with:然后将其附加到主数据框:

master = master.append(df,ignore_index=True)

Perform your operations on dataframe and then export using:对数据框执行操作,然后使用以下命令导出:

master.to_csv('<csv_file_name>')

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 将 CSV 文件读取和连接到单个 dataframe 时出现问题 - Problems reading and concatenating CSV files into a single dataframe 在 Apache Beam 中连接多个 csv 文件 - Concatenating multiple csv files in Apache Beam 连接多个具有不同结构的 Large.CSV 文件 - Concatenating Multiple Large .CSV Files with Varying Structures Scrapy csv在多行输出 - Scrapy csv outputing on multiple lines 将多个 csv 文件连接成具有相同标头的单个 csv - Python - Concatenating multiple csv files into a single csv with the same header - Python 循环通过 CSV 文件,执行 function,并连接 DataFrame 对象 - Looping through CSV files, performing a function, and concatenating DataFrame objects 串联和排序数千个CSV文件 - Concatenating and sorting thousands of CSV files 用python很好地连接csv文件 - concatenating csv files nicely with python 根据列值连接多个 CSV 文件,但多个 csv 文件具有相同的 header 但顺序不同 - Concatenating multiple CSV files based on column values,but the multiple csv files have the same header but vary in order Pyspark 将多个 csv 文件读入数据帧(或 RDD?) - Pyspark read multiple csv files into a dataframe (OR RDD?)
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM