简体   繁体   English

python脚本读取三个csv文件并写入一个csv文件

[英]python Script to read three csv files and writing in one csv file

I am trying read three csv files and wants to put output in single csv file by making first column as ID so it should not repeat as it's common in all input csv files. 我正在尝试读取三个csv文件,并希望通过将第一列作为ID来将输出放在单个csv文件中,因此它不应重复,因为它在所有输入csv文件中都很常见。 I have written some code but it's giving errors. 我已经写了一些代码,但是它给出了错误。 I am not sure this is best way to perform my task. 我不确定这是执行任务的最佳方法。

code: 码:

#! /usr/bin/python
import csv
from collections import defaultdict

result = defaultdict(dict)
fieldnames = ("ID")

for csvfile in ("FR1.1.csv", "FR2.0.csv", "FR2.5.csv"):
    with open(csvfile, 'rb') as infile:
        reader = csv.DictReader(infile)
        for row in reader:
            id = row.pop("ID")
            for key in row:
                fieldnames.add(key) 
                result[id][key] = row[key]

    with open("out.csv", "w") as outfile:
    writer = csv.DictWriter(outfile, sorted(fieldnames))
    writer.writeheader()
    for item in result:
        result[item]["ID"] = item
        writer.writerow(result[item]

input csv files are listed below: 输入的csv文件如下所示:

FR1.1.csv --> FR1.1.csv >

TEST_Id , RELEASE , COMPILE_STATUS , EXECUTION_STATUS
FC/B_019.config , FR1.1 , COMPILE_PASSED , EXECUTION_PASSED
FC/B_020.config , FR1.1 , COMPILE_PASSED , EXECUTION_PASSED
FC/B_021.config , FR1.1 , COMPILE_FAILED , EXECUTION_FAILED

FR2.0.csv --> FR2.0.csv >

TEST_Id , RELEASE , COMPILE_STATUS , EXECUTION_STATUS
FC/B_019.config , FR2.0 , COMPILE_PASSED , EXECUTION_PASSED
FC/B_020.config , FR2.0 , COMPILE_PASSED , EXECUTION_PASSED
FC/B_021.config , FR2.0 , COMPILE_FAILED , EXECUTION_FAILED

FR2.5.csv --> FR2.5.csv >

TEST_Id , RELEASE , COMPILE_STATUS , EXECUTION_STATUS
FC/B_019.config , FR2.5 , COMPILE_PASSED , EXECUTION_PASSED
FC/B_020.config , FR2.5 , COMPILE_PASSED , EXECUTION_PASSED
FC/B_021.config , FR2.5 , COMPILE_FAILED , EXECUTION_FAILED

out.csv (required)--> out.csv (必填)->

TEST_Id , RELEASE , COMPILE_STATUS , EXECUTION_STATUS , RELEASE , COMPILE_STATUS , EXECUTION_STATUS , RELEASE , COMPILE_STATUS , EXECUTION_STATUS
FC/B_019.config , FR1.1 , COMPILE_PASSED , EXECUTION_PASSED, FR2.0 , COMPILE_PASSED , EXECUTION_PASSED, FR2.5 , COMPILE_PASSED , EXECUTION_PASSED
FC/B_020.config , FR1.1 , COMPILE_PASSED , EXECUTION_PASSED, FR2.0 , COMPILE_PASSED , EXECUTION_PASSED, FR2.5 , COMPILE_PASSED , EXECUTION_PASSED
FC/B_021.config , FR1.1 , COMPILE_FAILED , EXECUTION_FAILED, FR2.0 , COMPILE_PASSED , EXECUTION_PASSED, FR2.5 , COMPILE_PASSED , EXECUTION_PASSED

thanks to post best method to achieve above result. 多亏了发布上述结果的最佳方法。

If you want to just join each CSV row based on ID, then don't use a DictReader . 如果您只想根据ID 连接每个CSV行,则不要使用DictReader Dictionary keys must be unique, but you are producing rows with multiple EXECUTION_STATUS and RELEASE , etc. columns. 字典键必须是唯一的,但是您要生成具有多个EXECUTION_STATUSRELEASE等列的行。

Moreover, how will you handle ids where one or two of the input CSV files has no input? 此外,如果一个或两个输入CSV文件没有输入,您将如何处理ID?

Use regular readers and store each row keyed by filename. 使用常规的读取器并存储以文件名为关键字的每一行。 Make fieldnames a list as well: 也将fieldnames设为列表:

import csv
from collections import defaultdict

result = defaultdict(dict)
filenames = ("FR1.1.csv", "FR2.0.csv", "FR2.5.csv")
lengths = {}
fieldnames = ["TEST_ID"]

for csvfile in filenames:
    with open(csvfile, 'rb') as infile:
        reader = csv.reader(infile)
        headers = next(reader, [])  # read first line, headers
        fieldnames.extend(headers[1:])  # all but the first column name
        lengths[csvfile] = len(headers) - 1  # keep track of how many items to backfill
        for row in reader:
            result[row[0]][csvfile] = row[1:]  # all but the first column

with open("out.csv", "wb") as outfile:
    writer = csv.writer(outfile)
    writer.writerow(fieldnames)
    for id_ in sorted(result):
        row = [id_]
        data = result[id_]
        for filename in filenames:
            row.extend(data.get(filename) or [''] * lengths[filename])
        writer.writerow(row)

This code stores rows per filename, so that you can later build a whole row from each file but still fill in blanks if the row was missing in that file. 该代码按文件名存储行,以便以后可以从每个文件构建整行,但是如果该文件中缺少该行,则仍可以填入空白。

The alternative would be to make column names unique by appending a number or filename to each; 另一种方法是在每个列名称后附加一个数字或文件名,从而使列名称唯一。 that way your DictReader approach could work too. 这样,您的DictReader方法也可以工作。

The above gives: 上面给出:

TEST_ID, RELEASE , COMPILE_STATUS , EXECUTION_STATUS, RELEASE , COMPILE_STATUS , EXECUTION_STATUS, RELEASE , COMPILE_STATUS , EXECUTION_STATUS
FC/B_019.config , FR1.1 , COMPILE_PASSED , EXECUTION_PASSED, FR2.0 , COMPILE_PASSED , EXECUTION_PASSED, FR2.5 , COMPILE_PASSED , EXECUTION_PASSED
FC/B_020.config , FR1.1 , COMPILE_PASSED , EXECUTION_PASSED, FR2.0 , COMPILE_PASSED , EXECUTION_PASSED, FR2.5 , COMPILE_PASSED , EXECUTION_PASSED
FC/B_021.config , FR1.1 , COMPILE_FAILED , EXECUTION_FAILED, FR2.0 , COMPILE_FAILED , EXECUTION_FAILED, FR2.5 , COMPILE_FAILED , EXECUTION_FAILED

If you need to base your order on one of the input files, then omit that input file from the first reading loop; 如果您需要基于一个输入文件来订购,则从第一个阅读循环中忽略该输入文件; instead, read that file while writing the output loop and use its first column to look up the other file data: 而是在写输出循环时读取该文件,并使用其第一列查找其他文件数据:

import csv
from collections import defaultdict

result = defaultdict(dict)
filenames = ("FR2.0.csv", "FR2.5.csv")
lengths = {}
fieldnames = []

for csvfile in filenames:
    with open(csvfile, 'rb') as infile:
        reader = csv.reader(infile)
        headers = next(reader, [])  # read first line, headers
        fieldnames.extend(headers[1:])  # all but the first column name
        lengths[csvfile] = len(headers) - 1  # keep track of how many items to backfill
        for row in reader:
            result[row[0]][csvfile] = row[1:]  # all but the first column

with open("FR1.1.csv", "rb") as infile, open("out.csv", "wb") as outfile:
    reader = csv.reader(infile)
    headers = next(reader, [])  # read first line, headers

    writer = csv.writer(outfile)
    writer.writerow(headers + fieldnames)

    for row in sorted(reader):
        data = result[row[0]]
        for filename in filenames:
            row.extend(data.get(filename) or [''] * lengths[filename])
        writer.writerow(row)

This does mean that any TEST_ID values extra in the other two files are ignored. 这确实意味着将忽略其他两个文件中多余的任何TEST_ID值。

If you wanted to preserve all TEST_ID s then I'd use collections.OrderedDict() ; 如果您想保留所有TEST_ID那么我将使用collections.OrderedDict() ; new TEST_ID s found in the later files will be tacked onto the end: 在以后的文件中找到的新TEST_ID将被附加到末尾:

import csv
from collections import OrderedDict

result = OrderedDict(dict)
filenames = ("FR1.1.csv", "FR2.0.csv", "FR2.5.csv")
lengths = {}
fieldnames = ["TEST_ID"]

for csvfile in filenames:
    with open(csvfile, 'rb') as infile:
        reader = csv.reader(infile)
        headers = next(reader, [])  # read first line, headers
        fieldnames.extend(headers[1:])  # all but the first column name
        lengths[csvfile] = len(headers) - 1  # keep track of how many items to backfill
        for row in reader:
            if row[0] not in result:
                result[row[0]] = {}
            result[row[0]][csvfile] = row[1:]  # all but the first column

with open("out.csv", "wb") as outfile:
    writer = csv.writer(outfile)
    writer.writerow(fieldnames)
    for id_ in result:
        row = [id_]
        data = result[id_]
        for filename in filenames:
            row.extend(data.get(filename) or [''] * lengths[filename])
        writer.writerow(row)

The OrderedDict maintains entries in insertion order; OrderedDict按插入顺序维护条目; so FR1.1.csv sets the order for all keys, but any FR2.0.csv ids not found in the first file are appended to the dictionary at the end, and so on. 因此FR1.1.csv设置了所有密钥的顺序,但是在第一个文件中找不到的任何FR2.0.csv ID都将附加到字典的末尾,依此类推。

For Python versions < 2.7, either install a backport (see OrderedDict for older versions of python ) or track the ID order manually with: 对于2.7以下的Python版本,请安装反向端口( 有关python的较早版本,请参见OrderedDict ),或使用以下命令手动跟踪ID顺序:

import csv
from collections import defaultdict

result = defaultdict(dict)
filenames = ("FR1.1.csv", "FR2.0.csv", "FR2.5.csv")
lengths = {}
fieldnames = ["TEST_ID"]
ids, seen = [], set()

for csvfile in filenames:
    with open(csvfile, 'rb') as infile:
        reader = csv.reader(infile)
        headers = next(reader, [])  # read first line, headers
        fieldnames.extend(headers[1:])  # all but the first column name
        lengths[csvfile] = len(headers) - 1  # keep track of how many items to backfill
        for row in reader:
            id_ = row[0]
            # track ordering
            if id_ not in seen:
                seen.add(id_)
                ids.append(id_)
            result[id_][csvfile] = row[1:]  # all but the first column

with open("out.csv", "wb") as outfile:
    writer = csv.writer(outfile)
    writer.writerow(fieldnames)
    for id_ in ids:
        row = [id_]
        data = result[id_]
        for filename in filenames:
            row.extend(data.get(filename) or [''] * lengths[filename])
        writer.writerow(row)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM