[英]Python: parsing structured text to CSV format
I want to convert plain structured text files to the CSV format using Python. 我想使用Python将纯结构化文本文件转换为CSV格式。
The input looks like this 输入看起来像这样
[-------- 1 -------]
Version: 2
Stream: 5
Account: A
[...]
[------- 2 --------]
Version: 3
Stream: 6
Account: B
[...]
The output is supposed to look like this: 输出应该看起来像这样:
Version; Stream; Account; [...]
2; 5; A; [...]
3; 6; B; [...]
Ie the input is structured text records delimited by [----<sequence number>----]
and containing <key>: <values>
-pairs and the ouput should be CSV containing one record per line. 即,输入是由
[----<sequence number>----]
分隔并包含<key>: <values>
结构化文本记录<key>: <values>
-pairs,输出应为CSV,每行包含一个记录。
I am able to retrive the <key>: <values>
-pairs into CSV format via 我可以通过以下方式将
<key>: <values>
-pairs转换为CSV格式
colonseperated = re.compile(' *(.+) *: *(.+) *')
fixedfields = re.compile('(\d{3} \w{7}) +(.*)')
-- but I have trouble to recognize beginning and end of the structured text records and with the re-writing as CSV line-records. -但是我很难识别结构化文本记录的开头和结尾,并且很难将其重写为CSV行记录。 Furthermore I would like to be able to separate different type of records, ie distinguish between - say -
Version: 2
and Version: 3
type of records. 此外,我希望能够分离不同类型的记录,即区分(例如-
Version: 2
和Version: 3
类型的记录。
Reading the list is not that hard: 阅读列表并不难:
def read_records(iterable):
record = {}
for line in iterable:
if line.startswith('[------'):
# new record, yield previous
if record:
yield record
record = {}
continue
key, value = line.strip().split(':', 1)
record[key.strip()] = value.strip()
# file done, yield last record
if record:
yield record
This produces dictionaries from your input file. 这将根据您的输入文件生成字典。
From this you can produce CSV output using the csv
module, specifically the csv.DictWriter()
class : 由此,您可以使用
csv
模块(特别是csv.DictWriter()
类csv.DictWriter()
生成CSV输出:
# List *all* possible keys, in the order the output file should list them
headers = ('Version', 'Stream', 'Account', ...)
with open(inputfile) as infile, open(outputfile, 'wb') as outfile:
records = read_records(infile)
writer = csv.DictWriter(outfile, headers, delimiter=';')
writer.writeheader()
# and write
writer.writerows(records)
Any header keys missing from a record will leave that column empty for that record. 记录中缺少的任何标题键都将使该记录的该列留空。 Any extra headers you missed will raise an exception;
您错过的所有多余标题都会引发异常; either add those to the
headers
tuple, or set the extrasaction
keyword to the DictWriter()
constructor to 'ignore'
. 要么将其添加到
headers
元组,要么将extrasaction
关键字设置为DictWriter()
构造函数为'ignore'
。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.