[英]python Script to read three csv files and writing in one csv file
I am trying read three csv files and wants to put output in single csv file by making first column as ID so it should not repeat as it's common in all input csv files. 我正在尝试读取三个csv文件,并希望通过将第一列作为ID来将输出放在单个csv文件中,因此它不应重复,因为它在所有输入csv文件中都很常见。 I have written some code but it's giving errors.
我已经写了一些代码,但是它给出了错误。 I am not sure this is best way to perform my task.
我不确定这是执行任务的最佳方法。
code: 码:
#! /usr/bin/python
import csv
from collections import defaultdict
result = defaultdict(dict)
fieldnames = ("ID")
for csvfile in ("FR1.1.csv", "FR2.0.csv", "FR2.5.csv"):
with open(csvfile, 'rb') as infile:
reader = csv.DictReader(infile)
for row in reader:
id = row.pop("ID")
for key in row:
fieldnames.add(key)
result[id][key] = row[key]
with open("out.csv", "w") as outfile:
writer = csv.DictWriter(outfile, sorted(fieldnames))
writer.writeheader()
for item in result:
result[item]["ID"] = item
writer.writerow(result[item]
input csv files are listed below: 输入的csv文件如下所示:
FR1.1.csv
--> FR1.1.csv
>
TEST_Id , RELEASE , COMPILE_STATUS , EXECUTION_STATUS
FC/B_019.config , FR1.1 , COMPILE_PASSED , EXECUTION_PASSED
FC/B_020.config , FR1.1 , COMPILE_PASSED , EXECUTION_PASSED
FC/B_021.config , FR1.1 , COMPILE_FAILED , EXECUTION_FAILED
FR2.0.csv
--> FR2.0.csv
>
TEST_Id , RELEASE , COMPILE_STATUS , EXECUTION_STATUS
FC/B_019.config , FR2.0 , COMPILE_PASSED , EXECUTION_PASSED
FC/B_020.config , FR2.0 , COMPILE_PASSED , EXECUTION_PASSED
FC/B_021.config , FR2.0 , COMPILE_FAILED , EXECUTION_FAILED
FR2.5.csv
--> FR2.5.csv
>
TEST_Id , RELEASE , COMPILE_STATUS , EXECUTION_STATUS
FC/B_019.config , FR2.5 , COMPILE_PASSED , EXECUTION_PASSED
FC/B_020.config , FR2.5 , COMPILE_PASSED , EXECUTION_PASSED
FC/B_021.config , FR2.5 , COMPILE_FAILED , EXECUTION_FAILED
out.csv
(required)--> out.csv
(必填)->
TEST_Id , RELEASE , COMPILE_STATUS , EXECUTION_STATUS , RELEASE , COMPILE_STATUS , EXECUTION_STATUS , RELEASE , COMPILE_STATUS , EXECUTION_STATUS
FC/B_019.config , FR1.1 , COMPILE_PASSED , EXECUTION_PASSED, FR2.0 , COMPILE_PASSED , EXECUTION_PASSED, FR2.5 , COMPILE_PASSED , EXECUTION_PASSED
FC/B_020.config , FR1.1 , COMPILE_PASSED , EXECUTION_PASSED, FR2.0 , COMPILE_PASSED , EXECUTION_PASSED, FR2.5 , COMPILE_PASSED , EXECUTION_PASSED
FC/B_021.config , FR1.1 , COMPILE_FAILED , EXECUTION_FAILED, FR2.0 , COMPILE_PASSED , EXECUTION_PASSED, FR2.5 , COMPILE_PASSED , EXECUTION_PASSED
thanks to post best method to achieve above result. 多亏了发布上述结果的最佳方法。
If you want to just join each CSV row based on ID, then don't use a DictReader
. 如果您只想根据ID 连接每个CSV行,则不要使用
DictReader
。 Dictionary keys must be unique, but you are producing rows with multiple EXECUTION_STATUS
and RELEASE
, etc. columns. 字典键必须是唯一的,但是您要生成具有多个
EXECUTION_STATUS
和RELEASE
等列的行。
Moreover, how will you handle ids where one or two of the input CSV files has no input? 此外,如果一个或两个输入CSV文件没有输入,您将如何处理ID?
Use regular readers and store each row keyed by filename. 使用常规的读取器并存储以文件名为关键字的每一行。 Make
fieldnames
a list as well: 也将
fieldnames
设为列表:
import csv
from collections import defaultdict
result = defaultdict(dict)
filenames = ("FR1.1.csv", "FR2.0.csv", "FR2.5.csv")
lengths = {}
fieldnames = ["TEST_ID"]
for csvfile in filenames:
with open(csvfile, 'rb') as infile:
reader = csv.reader(infile)
headers = next(reader, []) # read first line, headers
fieldnames.extend(headers[1:]) # all but the first column name
lengths[csvfile] = len(headers) - 1 # keep track of how many items to backfill
for row in reader:
result[row[0]][csvfile] = row[1:] # all but the first column
with open("out.csv", "wb") as outfile:
writer = csv.writer(outfile)
writer.writerow(fieldnames)
for id_ in sorted(result):
row = [id_]
data = result[id_]
for filename in filenames:
row.extend(data.get(filename) or [''] * lengths[filename])
writer.writerow(row)
This code stores rows per filename, so that you can later build a whole row from each file but still fill in blanks if the row was missing in that file. 该代码按文件名存储行,以便以后可以从每个文件构建整行,但是如果该文件中缺少该行,则仍可以填入空白。
The alternative would be to make column names unique by appending a number or filename to each; 另一种方法是在每个列名称后附加一个数字或文件名,从而使列名称唯一。 that way your
DictReader
approach could work too. 这样,您的
DictReader
方法也可以工作。
The above gives: 上面给出:
TEST_ID, RELEASE , COMPILE_STATUS , EXECUTION_STATUS, RELEASE , COMPILE_STATUS , EXECUTION_STATUS, RELEASE , COMPILE_STATUS , EXECUTION_STATUS
FC/B_019.config , FR1.1 , COMPILE_PASSED , EXECUTION_PASSED, FR2.0 , COMPILE_PASSED , EXECUTION_PASSED, FR2.5 , COMPILE_PASSED , EXECUTION_PASSED
FC/B_020.config , FR1.1 , COMPILE_PASSED , EXECUTION_PASSED, FR2.0 , COMPILE_PASSED , EXECUTION_PASSED, FR2.5 , COMPILE_PASSED , EXECUTION_PASSED
FC/B_021.config , FR1.1 , COMPILE_FAILED , EXECUTION_FAILED, FR2.0 , COMPILE_FAILED , EXECUTION_FAILED, FR2.5 , COMPILE_FAILED , EXECUTION_FAILED
If you need to base your order on one of the input files, then omit that input file from the first reading loop; 如果您需要基于一个输入文件来订购,则从第一个阅读循环中忽略该输入文件; instead, read that file while writing the output loop and use its first column to look up the other file data:
而是在写输出循环时读取该文件,并使用其第一列查找其他文件数据:
import csv
from collections import defaultdict
result = defaultdict(dict)
filenames = ("FR2.0.csv", "FR2.5.csv")
lengths = {}
fieldnames = []
for csvfile in filenames:
with open(csvfile, 'rb') as infile:
reader = csv.reader(infile)
headers = next(reader, []) # read first line, headers
fieldnames.extend(headers[1:]) # all but the first column name
lengths[csvfile] = len(headers) - 1 # keep track of how many items to backfill
for row in reader:
result[row[0]][csvfile] = row[1:] # all but the first column
with open("FR1.1.csv", "rb") as infile, open("out.csv", "wb") as outfile:
reader = csv.reader(infile)
headers = next(reader, []) # read first line, headers
writer = csv.writer(outfile)
writer.writerow(headers + fieldnames)
for row in sorted(reader):
data = result[row[0]]
for filename in filenames:
row.extend(data.get(filename) or [''] * lengths[filename])
writer.writerow(row)
This does mean that any TEST_ID
values extra in the other two files are ignored. 这确实意味着将忽略其他两个文件中多余的任何
TEST_ID
值。
If you wanted to preserve all TEST_ID
s then I'd use collections.OrderedDict()
; 如果您想保留所有
TEST_ID
那么我将使用collections.OrderedDict()
; new TEST_ID
s found in the later files will be tacked onto the end: 在以后的文件中找到的新
TEST_ID
将被附加到末尾:
import csv
from collections import OrderedDict
result = OrderedDict(dict)
filenames = ("FR1.1.csv", "FR2.0.csv", "FR2.5.csv")
lengths = {}
fieldnames = ["TEST_ID"]
for csvfile in filenames:
with open(csvfile, 'rb') as infile:
reader = csv.reader(infile)
headers = next(reader, []) # read first line, headers
fieldnames.extend(headers[1:]) # all but the first column name
lengths[csvfile] = len(headers) - 1 # keep track of how many items to backfill
for row in reader:
if row[0] not in result:
result[row[0]] = {}
result[row[0]][csvfile] = row[1:] # all but the first column
with open("out.csv", "wb") as outfile:
writer = csv.writer(outfile)
writer.writerow(fieldnames)
for id_ in result:
row = [id_]
data = result[id_]
for filename in filenames:
row.extend(data.get(filename) or [''] * lengths[filename])
writer.writerow(row)
The OrderedDict
maintains entries in insertion order; OrderedDict
按插入顺序维护条目; so FR1.1.csv
sets the order for all keys, but any FR2.0.csv
ids not found in the first file are appended to the dictionary at the end, and so on. 因此
FR1.1.csv
设置了所有密钥的顺序,但是在第一个文件中找不到的任何FR2.0.csv
ID都将附加到字典的末尾,依此类推。
For Python versions < 2.7, either install a backport (see OrderedDict for older versions of python ) or track the ID order manually with: 对于2.7以下的Python版本,请安装反向端口( 有关python的较早版本,请参见OrderedDict ),或使用以下命令手动跟踪ID顺序:
import csv
from collections import defaultdict
result = defaultdict(dict)
filenames = ("FR1.1.csv", "FR2.0.csv", "FR2.5.csv")
lengths = {}
fieldnames = ["TEST_ID"]
ids, seen = [], set()
for csvfile in filenames:
with open(csvfile, 'rb') as infile:
reader = csv.reader(infile)
headers = next(reader, []) # read first line, headers
fieldnames.extend(headers[1:]) # all but the first column name
lengths[csvfile] = len(headers) - 1 # keep track of how many items to backfill
for row in reader:
id_ = row[0]
# track ordering
if id_ not in seen:
seen.add(id_)
ids.append(id_)
result[id_][csvfile] = row[1:] # all but the first column
with open("out.csv", "wb") as outfile:
writer = csv.writer(outfile)
writer.writerow(fieldnames)
for id_ in ids:
row = [id_]
data = result[id_]
for filename in filenames:
row.extend(data.get(filename) or [''] * lengths[filename])
writer.writerow(row)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.