简体   繁体   English

用Python分析带有缩进的结构化文本

[英]Parsing Text Structured with Indents in Python

I am getting stuck trying to figure out an efficient way to parse some plaintext that is structured with indents (from a word doc). 我一直在努力寻找一种有效的方法来解析一些由缩进构成的纯文本(来自Word文档)。 Example (note: indentation below not rendering on mobile version of SO): 示例(注意:下面的缩进不能在SO的移动版本上呈现):

Attendance records 8 F 1921-2010 Box 2 1921-1927, 1932-1944 1937-1939,1948-1966, 1971-1979, 1989-1994, 2010 Number of meetings attended each year 1 F 1991-1994 Box 2 Papers re: Safaris 10 F 1951-2011 Box 2 Incomplete; Includes correspondence about beginning “Safaris” may also include announcements, invitations, reports, attendance, and charges; some photographs. See also: Correspondence and Minutes

So the unindented text is the parent record data and each set of indented text below each parent data line are some notes for that data (which are also split into multiple lines themselves). 因此,未缩进的文本是父记录数据,并且在每个父数据行下方的每组缩进文本都是该数据的一些注释(它们本身也分成多行)。

So far I have a crude script to parse out the unindented parent lines so that I get a list of dictionary items: 到目前为止,我有一个粗略的脚本来解析未缩进的父行,以便获得字典项列表:

import re

f = open('example_text.txt', 'r')

lines = f.readlines()

records = []

for line in lines:

if line[0].isalpha():
        processed = re.split('\s{2,}', line)


        for i in processed:
        title = processed[0]
        rec_id = processed[1]
        years = processed[2]
        location = processed[3]

    records.append({
        "title": title,
        "id": rec_id,
        "years": years,
        "location": location
    })


elif not line[0].isalpha():

    print "These are the notes, but attaching them to the above records is not clear"


print records`

and this produces: 这会产生:

[{'id': '8 F', 'location': 'Box 2', 'title': 'Attendance records', 'years': '1921-2010'}, {'id': '1 F', 'location': 'Box 2', 'title': 'Number of meetings attended each year', 'years': '1991-1994'}, {'id': '10 F', 'location': 'Box 2', 'title': 'Papers re: Safaris', 'years': '1951-2011'}]

But now I want to add to each record the notes to the effect of: 但是现在我想在每个记录中添加以下注释:

[{'id': '8 F', 'location': 'Box 2', 'title': 'Attendance records', 'years': '1921-2010', 'notes': '1921-1927, 1932-1944 1937-1939,1948-1966, 1971-1979, 1989-1994, 2010' }, ...]

What's confusing me is that I am assuming this procedural approach, line by line, and I'm not sure if there is a more Pythonic way to do this. 令我感到困惑的是,我正在逐行地采用这种程序化方法,而且我不确定是否还有一种更Python化的方法可以做到这一点。 I am more used to working with scraping webpages and with those at least you have selectors, here it's hard to double back going one by one down the line and I was hoping someone might be able to shake my thinking loose and provide a fresh view on a better way to attack this. 我更习惯于抓取网页,并且至少拥有选择器,在这里很难一遍又一遍地翻页,而我希望有人能够动摇我的思想并提供新的观点更好的方法来解决这个问题。

Update Just adding the condition suggested by answer below over the indented lines worked fine: 更新只需在缩进行上添加以下答案所建议的条件就可以了:

import re
import repr as _repr
from pprint import pprint


f = open('example_text.txt', 'r')

lines = f.readlines()

records = []

for line in lines:

    if line[0].isalpha():
        processed = re.split('\s{2,}', line)

        #print processed

        for i in processed:
            title = processed[0]
            rec_id = processed[1]
            years = processed[2]
            location = processed[3]

    if not line[0].isalpha():


        record['notes'].append(line)
        continue

    record = { "title": title,
               "id": rec_id,
               "years": years,
               "location": location,
               "notes": []}

    records.append(record)





pprint(records)

As you have already solved the parsing of the records, I will only focus on how to read the notes of each one: 在您已经解决了记录的解析问题之后,我将仅着重于如何阅读每个记录的注释:

records = []

with open('data.txt', 'r') as lines:
    for line in lines:
        if line.startswith ('\t'):
            record ['notes'].append (line [1:])
            continue
        record = {'title': line, 'notes': [] }
        records.append (record)

for record in records:
    print ('Record is', record ['title'] )
    print ('Notes are', record ['notes'] )
    print ()

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM