用Python分析带有缩进的结构化文本

Question

我一直在努力寻找一种有效的方法来解析一些由缩进构成的纯文本（来自Word文档）。 示例（注意：下面的缩进不能在SO的移动版本上呈现）：

Attendance records 8 F 1921-2010 Box 2 1921-1927, 1932-1944 1937-1939,1948-1966, 1971-1979, 1989-1994, 2010 Number of meetings attended each year 1 F 1991-1994 Box 2 Papers re: Safaris 10 F 1951-2011 Box 2 Incomplete; Includes correspondence about beginning “Safaris” may also include announcements, invitations, reports, attendance, and charges; some photographs. See also: Correspondence and Minutes

因此，未缩进的文本是父记录数据，并且在每个父数据行下方的每组缩进文本都是该数据的一些注释（它们本身也分成多行）。

到目前为止，我有一个粗略的脚本来解析未缩进的父行，以便获得字典项列表：

import re

f = open('example_text.txt', 'r')

lines = f.readlines()

records = []

for line in lines:

if line[0].isalpha():
        processed = re.split('\s{2,}', line)


        for i in processed:
        title = processed[0]
        rec_id = processed[1]
        years = processed[2]
        location = processed[3]

    records.append({
        "title": title,
        "id": rec_id,
        "years": years,
        "location": location
    })


elif not line[0].isalpha():

    print "These are the notes, but attaching them to the above records is not clear"


print records`

这会产生：

[{'id': '8 F', 'location': 'Box 2', 'title': 'Attendance records', 'years': '1921-2010'}, {'id': '1 F', 'location': 'Box 2', 'title': 'Number of meetings attended each year', 'years': '1991-1994'}, {'id': '10 F', 'location': 'Box 2', 'title': 'Papers re: Safaris', 'years': '1951-2011'}]

但是现在我想在每个记录中添加以下注释：

[{'id': '8 F', 'location': 'Box 2', 'title': 'Attendance records', 'years': '1921-2010', 'notes': '1921-1927, 1932-1944 1937-1939,1948-1966, 1971-1979, 1989-1994, 2010' }, ...]

令我感到困惑的是，我正在逐行地采用这种程序化方法，而且我不确定是否还有一种更Python化的方法可以做到这一点。 我更习惯于抓取网页，并且至少拥有选择器，在这里很难一遍又一遍地翻页，而我希望有人能够动摇我的思想并提供新的观点更好的方法来解决这个问题。

更新只需在缩进行上添加以下答案所建议的条件就可以了：

import re
import repr as _repr
from pprint import pprint


f = open('example_text.txt', 'r')

lines = f.readlines()

records = []

for line in lines:

    if line[0].isalpha():
        processed = re.split('\s{2,}', line)

        #print processed

        for i in processed:
            title = processed[0]
            rec_id = processed[1]
            years = processed[2]
            location = processed[3]

    if not line[0].isalpha():


        record['notes'].append(line)
        continue

    record = { "title": title,
               "id": rec_id,
               "years": years,
               "location": location,
               "notes": []}

    records.append(record)





pprint(records)

Answer 1

在您已经解决了记录的解析问题之后，我将仅着重于如何阅读每个记录的注释：

records = []

with open('data.txt', 'r') as lines:
    for line in lines:
        if line.startswith ('\t'):
            record ['notes'].append (line [1:])
            continue
        record = {'title': line, 'notes': [] }
        records.append (record)

for record in records:
    print ('Record is', record ['title'] )
    print ('Notes are', record ['notes'] )
    print ()

用Python分析带有缩进的结构化文本

问题描述

1 个解决方案

解决方案1
1 已采纳 2014-04-24 16:46:41

用Python分析带有缩进的结构化文本

问题描述

1 个解决方案

解决方案1 1 已采纳 2014-04-24 16:46:41

解决方案1
1 已采纳 2014-04-24 16:46:41