简体   繁体   English

合并多个CSV文件而不重复标题(使用Python)

[英]Merging multiple CSV files without headers being repeated (using Python)

I am a beginner with Python. 我是Python的初学者。 I have multiple CSV files (more than 10), and all of them have same number of columns. 我有多个CSV文件(超过10个),并且它们都具有相同数量的列。 I would like to merge all of them into a single CSV file, where I will not have headers repeated. 我想将它们合并到一个CSV文件中,我不会重复标题。

So essentially I need to have just the first row with all the headers and from then I need all the rows from all CSV files merged. 所以基本上我需要只有第一行包含所有标题,然后我需要合并所有CSV文件中的所有行。 How do I do this? 我该怎么做呢?

Here's what I tried so far. 这是我到目前为止所尝试的内容。

import glob
import csv



with open('output.csv','wb') as fout:
    wout = csv.writer(fout,delimiter=',') 
    interesting_files = glob.glob("*.csv") 
    for filename in interesting_files: 
        print 'Processing',filename 
    # Open and process file
        h = True
        with open(filename,'rb') as fin:
                fin.next()#skip header
        for line in csv.reader(fin,delimiter=','):
                wout.writerow(line)

If you are on a linux system: 如果您使用的是Linux系统:

head -1 director/one_file.csv > output csv   ## writing the header to the final file
tail -n +2  director/*.csv >> output.csv  ## writing the content of all csv starting with second line into final file

While I think that the best answer is the one from @valentin, you can do this without using csv module at all: 虽然我认为最好的答案是来自@valentin的答案,但您可以在不使用csv模块的情况下完成此操作:

import glob

interesting_files = glob.glob("*.csv") 

header_saved = False
with open('output.csv','wb') as fout:
    for filename in interesting_files:
        with open(filename) as fin:
            header = next(fin)
            if not header_saved:
                fout.write(header)
                header_saved = True
            for line in fin:
                fout.write(line)

If you dont mind the overhead, you could use pandas which is shipped with common python distributions. 如果您不介意开销,可以使用随附常见python发行版的pandas。 If you plan do more with speadsheet tables, I recommend using pandas rather than trying to write your own libraries. 如果您计划使用speadsheet表做更多,我建议使用pandas而不是尝试编写自己的库。

import pandas as pd
import glob
interesting_files = glob.glob("*.csv")
df_list = []
for filename in sorted(interesting_files):
    df_list.append(pd.read_csv(filename))
full_df = pd.concat(df_list)

full_df.to_csv('output.csv')

Just a little more on pandas. 关于熊猫的更多信息。 Because it is made to deal with spreadsheet like data, it knows the first line is a header. 因为它是为了处理像数据这样的电子表格,所以它知道第一行是标题。 When reading a CSV it separates the data table from the header which is kept as metadata of the dataframe , the standard datatype in pandas. 在读取CSV时,它将数据表与标题分开,标题保存为数据dataframedataframe ,即pandas中的标准数据类型。 If you concat several of these dataframes it concatenates only the dataparts if their headers are the same. 如果你连接其中几个dataframes它只会连接dataframes ,如果它们的标题是相同的。 If the headers are not the same it fails and gives you an error. 如果标题不相同则失败并给出错误。 Probably a good thing in case your directory is polluted with CSV files from another source. 如果您的目录被来自其他来源的CSV文件污染,可能是件好事。

Another thing: I just added sorted() around the interesting_files . 另一件事:我刚刚在interesting_files周围添加了sorted() I assume your files are named in order and this order should be kept. 我假设您的文件按顺序命名,并且应该保留此顺序。 I am not sure about glob, but the os functions are not necessarily returning files sorted by their name. 我不确定glob,但os函数不一定返回按名称排序的文件。

Your indentation is wrong, you need to put the loop inside the with block. 你的缩进是错误的,你需要将循环放在with块中。 You can also pass the file object to writer.writerows. 您还可以将文件对象传递给writer.writerows。

import csv
with open('output.csv','wb') as fout:
    wout = csv.writer(fout)
    interesting_files = glob.glob("*.csv")
    for filename in interesting_files:
        print 'Processing',filename
        with open(filename,'rb') as fin:
                next(fin) # skip header
                wout.writerows(fin)

Your attempt is almost working, but the issues are: 您的尝试几乎正常,但问题是:

  • you're opening the file for reading but closing it before writing the rows. 你打开文件进行阅读,但在写行之前将其关闭。
  • you're never writing the title. 你永远不会写标题。 You have to write it once 你必须写一次
  • Also you have to exclude output.csv from the "glob" else the output is also in input! 你还必须从“glob”中排除 output.csv,否则输出也在输入中!

Here's the corrected code, passing the csv object direcly to csv.writerows method for shorter & faster code. 这是更正后的代码,将csv对象直接传递给csv.writerows方法,以获得更短更快的代码。 Also writing the title from the first file to the output file. 还将标题从第一个文件写入输出文件。

import glob
import csv

output_file = 'output.csv'
header_written = False

with open(output_file,'w',newline="") as fout:  # just "wb" in python 2
    wout = csv.writer(fout,delimiter=',')
    # filter out output
    interesting_files = [x for x in glob.glob("*.csv") if x != output_file]
    for filename in interesting_files:
        print('Processing {}'.format(filename))
        with open(filename) as fin:
            cr = csv.reader(fin,delmiter=",")
            header = cr.next() #skip header
            if not header_written:
                wout.writerow(header)
                header_written = True
            wout.writerows(cr)

Note that solutions using raw line-by-line processing miss an important point: if the header is multi-line, they miserably fail, botching the title line/repeating part of it several time, efficiently corrupting the file. 请注意,使用原始逐行处理的解决方案错过了重要的一点:如果标题是多行的,则它们会失败,使标题行/重复部分多次失败,从而有效地破坏文件。

csv module (or pandas, too) handle those cases gracefully. csv模块(或者pandas)也可以优雅地处理这些情况。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM