Python：将结构化文本解析为CSV格式

Question

I want to convert plain structured text files to the CSV format using Python. 我想使用Python将纯结构化文本文件转换为CSV格式。

The input looks like this 输入看起来像这样

[-------- 1 -------]
Version: 2
 Stream: 5
 Account: A
[...]
[------- 2 --------]
 Version: 3
 Stream: 6
 Account: B
[...]

The output is supposed to look like this: 输出应该看起来像这样：

Version; Stream; Account; [...]
2; 5; A; [...]
3; 6; B; [...]

Ie the input is structured text records delimited by [----<sequence number>----] and containing <key>: <values> -pairs and the ouput should be CSV containing one record per line. 即，输入是由[----<sequence number>----]分隔并包含<key>: <values>结构化文本记录<key>: <values> -pairs，输出应为CSV，每行包含一个记录。

I am able to retrive the <key>: <values> -pairs into CSV format via 我可以通过以下方式将<key>: <values> -pairs转换为CSV格式

colonseperated = re.compile(' *(.+) *: *(.+) *')
fixedfields = re.compile('(\d{3} \w{7}) +(.*)')

-- but I have trouble to recognize beginning and end of the structured text records and with the re-writing as CSV line-records. -但是我很难识别结构化文本记录的开头和结尾，并且很难将其重写为CSV行记录。 Furthermore I would like to be able to separate different type of records, ie distinguish between - say - Version: 2 and Version: 3 type of records. 此外，我希望能够分离不同类型的记录，即区分（例如- Version: 2和Version: 3类型的记录。

Answer 1

Reading the list is not that hard: 阅读列表并不难：

def read_records(iterable):
    record = {}
    for line in iterable:
        if line.startswith('[------'):
            # new record, yield previous
            if record:
                yield record
            record = {}
            continue
        key, value = line.strip().split(':', 1)
        record[key.strip()] = value.strip()

    # file done, yield last record
    if record:
        yield record

This produces dictionaries from your input file. 这将根据您的输入文件生成字典。

From this you can produce CSV output using the csv module, specifically the csv.DictWriter() class : 由此，您可以使用csv模块（特别是csv.DictWriter()类csv.DictWriter()生成CSV输出：

# List *all* possible keys, in the order the output file should list them
headers = ('Version', 'Stream', 'Account', ...)

with open(inputfile) as infile, open(outputfile, 'wb') as outfile:
    records = read_records(infile)

    writer = csv.DictWriter(outfile, headers, delimiter=';')
    writer.writeheader()

    # and write
    writer.writerows(records)

Any header keys missing from a record will leave that column empty for that record. 记录中缺少的任何标题键都将使该记录的该列留空。 Any extra headers you missed will raise an exception; 您错过的所有多余标题都会引发异常； either add those to the headers tuple, or set the extrasaction keyword to the DictWriter() constructor to 'ignore' . 要么将其添加到headers元组，要么将extrasaction关键字设置为DictWriter()构造函数为'ignore' 。

Python：将结构化文本解析为CSV格式

问题描述

1 个解决方案

解决方案1
1 已采纳 2013-10-17 21:12:45

Python：将结构化文本解析为CSV格式

问题描述

1 个解决方案

解决方案1 1 已采纳 2013-10-17 21:12:45

解决方案1
1 已采纳 2013-10-17 21:12:45