[英]Read in csv files in a folder and output to one single csv

I have a folder with many csv files.我有一个包含许多 csv 个文件的文件夹。 I want to read them in and depending on certain criteria output the records to specific output files.我想阅读它们并根据某些标准 output 将记录记录到特定的 output 文件。 So in my case I have 3 different output files.所以就我而言,我有 3 个不同的 output 文件。

So I have many of csv files.所以我有很多 csv 个文件。 Let's consider one file looking like:让我们考虑一个看起来像这样的文件:


And another looking like:另一个看起来像:


Each csv file has an header. The header is always the same.每个 csv 文件都有一个 header。header 始终相同。 In the final csv files I would like to have the header once in the beginning, but not in between the data.在最后的 csv 文件中,我希望在开始时有一次 header,但不在数据之间。

I want to have one file where every record is stored.我想要一个文件来存储每条记录。 In another I would like to have only those records where Column1 begins with '80B'.在另一个中,我只想拥有 Column1 以“80B”开头的那些记录。 In the third file I would like to have those records where Column1 does not begin with '80B' and the fourth character is not equal to 'D'.在第三个文件中,我想要那些 Column1 不以“80B”开头且第四个字符不等于“D”的记录。

So the output should be:所以 output 应该是:

file 'all.csv'文件“all.csv”


file 'subset_1'文件“subset_1”


file 'subset_2'文件“subset_2”


I tried the following code:我尝试了以下代码:

import glob
import csv
import os

path = r'C:\myfolder\test'

all_files=glob.glob(os.path.join(path, "*.csv"))

with open(r'C:\myfolder\all.csv', "w", newline='') as dall, \
open(r'C:\myfolder\subset_1.csv', "w", newline='') as \
subset_1, open(r'C:\myfolder\subset_2.csv', "w", newline='') as subset_2:
    cw_all = csv.writer(dall, delimiter=";", quoting=csv.QUOTE_MINIMAL)
    cw_subset_1 = csv.writer(subset_1, delimiter=";", quoting=csv.QUOTE_MINIMAL)
    cw_subset_2 = csv.writer(subset_2, delimiter=";", quoting=csv.QUOTE_MINIMAL)
    for filename in all_files:
        with open(filename) as infile:
            cr = csv.reader(infile, delimiter=";")
            for line in cr:
            if (
                (line[0][:3] !="80B")
                ): cw_subset_1.writerow(line)
            if (
                (line[0][:3] =="80B") and
                (line[0][3:4] =="D")
                ): cw_subset_2.writerow(line)

For the first try I also ignored the problem with the header and commented out the next(cr).对于第一次尝试,我也忽略了 header 的问题并注释掉了 next(cr)。 But it is not working.但它不起作用。 Somehow the records are not properly stored into the corresponding files.不知何故,记录没有正确存储到相应的文件中。 The line pointer is not putting each record properly into the files.行指针没有将每条记录正确地放入文件中。 Where is my mistake?我的错误在哪里?

I would like to do it on a csv level.我想在 csv 级别上进行。 Without pandas.没有 pandas。

(I want to write it "on the fly" while reading the files, so I do not want to first create a large file with everything, then read this once to create the first subset and then read the large file a second time to create the second subset. This is quite inefficient as I have to read the large file several times.) (我想在读取文件时“即时”写入它,所以我不想先创建一个包含所有内容的大文件,然后读取一次以创建第一个子集,然后再次读取大文件以创建第二个子集。这是非常低效的,因为我必须多次读取大文件。)

There are three problems I see:我看到三个问题:

  1. Uncomment next(cr) so the headers aren't copied into the new files.取消注释next(cr) ,这样标题就不会复制到新文件中。
  2. The if statements should be indented under the for line in cr: line. if语句应该for line in cr:下缩进。
  3. line[0][3:4] == "D" should be be line[0][3:4] != "D" . line[0][3:4] == "D"应该是line[0][3:4] != "D"

Note that line[0][3:4] != "D" can be just line[0][3] != "D" when checking a single character in a string.请注意,当检查字符串中的单个字符时, line[0][3:4] != "D"可以只是line[0][3] != "D"

You description of the 3rd file does not match the desired output. I went with the description below.您对第三个文件的描述与所需的 output 不匹配。我按照下面的描述进行操作。 Comments are from the OP requirements.评论来自 OP 要求。

for filename in all_files:
    with open(filename) as infile:
        cr = csv.reader(infile, delimiter=';')
        next(cr)  # skip the header in each input file
        for line in cr:
            # one file where every record is stored.
            # only those records where Column1 begins with '80B'.
            if line[0][:3] == '80B':
            # those records where Column1 does not begin with '80B'
            # and the fourth character is not equal to 'D'.            
            if line[0][:3] != '80B' and line[0][3] != 'D':

