用Python解析结构化文本文件（pyparsing）

Question

For reasons I really do not understand, a REST API I'm using, instead of outputting JSON or XML, uses a peculiar structured text format. 由于我真的不明白的原因，我正在使用的REST API不使用输出JSON或XML，而是使用特殊的结构化文本格式。 In its simplest form 最简单的形式

SECTION_NAME    entry  other qualifying bits of the entry
                entry2 other qualifying bits
                ...

They are not tab-delimited, as the structure may seem, but instead space-delimited, and the qualifying bits may contain words with spaces. 它们不是制表符分隔的，因为结构可能看似，而是以空格分隔，并且限定位可能包含带空格的单词。 The space between SECTION_NAME and the entries is also variable, ranging from 1 to several (6 or more) spaces. SECTION_NAME与条目之间的空间也是可变的，范围从1到几个（6个或更多）空格。

Also, one part of the format contains entries in the form 此外，格式的一部分包含表单中的条目

SECTION_NAME entry
  SUB_SECTION more information
  SUB_SECTION2 more information

For reference, an extract of real data (some sections omitted), which shows the use of the structure: 供参考，实际数据的摘录（某些部分省略），显示结构的使用：

ENTRY       hsa04064                    Pathway
NAME        NF-kappa B signaling pathway - Homo sapiens (human)
DRUG        D09347  Fostamatinib (USAN)
            D09348  Fostamatinib disodium (USAN)
            D09692  Veliparib (USAN/INN)
            D09730  Olaparib (JAN/INN)
            D09913  Iniparib (USAN/INN)
REFERENCE   PMID:21772278
  AUTHORS   Oeckinghaus A, Hayden MS, Ghosh S
  TITLE     Crosstalk in NF-kappaB signaling pathways.
  JOURNAL   Nat Immunol 12:695-708 (2011)

As I'm trying to parse this weird format into something saner (a dictionary which can then be converted to JSON), I'm unsure on what to do: splitting blindly on spaces causes a mess (it also affects information with spaces), and I'm not sure on how I can figure when a section starts or not. 当我试图将这种奇怪的格式解析为更健全的东西（一个字典然后可以转换为JSON）时，我不确定该怎么做：盲目地拆分空间会导致混乱（它也会影响带空格的信息），并且我不确定如何在一个部分开始时能够计算出来。 Is text manipulation enough for the job or should I use more sophisticated methods? 文本操作对于工作是否足够，还是应该使用更复杂的方法？

EDIT: 编辑：

I started using pyparsing for the job, but multiple-line records baffle me, here's an example with DRUG: 我开始使用pyparsing来完成这项工作，但是多行记录让我感到困惑，这是DRUG的一个例子：

 from pyparsing import *
 punctuation = ",.'`&-"
 special_chars = "\()[]"

 drug = Keyword("DRUG")
 drug_content = Word(alphanums) + originalTextFor(OneOrMore(Word(
      alphanums + special_chars))) + ZeroOrMore(LineEnd())
 drug_lines = OneOrMore(drug_content)
 drug_parser = drug + drug_lines

When applied to the first 3 lines of DRUG in the example, I get a wrong result(\\n converted to actual returns to ease readability): 当在示例中应用于DRUG的前3行时，我得到错误的结果（\\ n转换为实际返回以便于阅读）：

 ['DRUG', ['D09347', 'Fostamatinib (USAN)
        D09348  Fostamatinib disodium      (USAN)
        D09692  Veliparib (USAN']]

As you can see, the subsequent entries get lumped all together, while I'd expect: 正如你所看到的，随后的条目总是混在一起，而我期望：

 ['DRUG', [['D09347', 'Fostamatinib (USAN)'], ["D09348", "Fostamatinib disodium (USAN)"],
           ['D09692', ' Veliparib (USAN)']]]

Answer 1

I'd recommend you use a parser-based approach. 我建议你使用基于解析器的方法。 For example, Python PLY can be used for the task at hand. 例如， Python PLY可用于手头的任务。

Answer 2

The best approach is to use regular expressions, like: 最好的方法是使用正则表达式，例如：

m = re.compile('^ENTRY\s+(.*)$')
m.search(line)
if m:
   m.groups()[0].strip()

for lines without entry, you should use the last entry you detected. 对于没有输入的行，您应该使用检测到的最后一个条目。

A simpler approach is split by entry, for example: 更简单的方法是通过输入分割，例如：

vals = line.split('DRUG')
if len(vals) > 1:
     drug_field = vals[1].strip()

用Python解析结构化文本文件（pyparsing）

问题描述

2 个解决方案

解决方案1
3 2012-07-04 08:38:46

解决方案2
0 2012-07-04 08:40:54

用Python解析结构化文本文件（pyparsing）

问题描述

2 个解决方案

解决方案1 3 2012-07-04 08:38:46

解决方案2 0 2012-07-04 08:40:54

解决方案1
3 2012-07-04 08:38:46

解决方案2
0 2012-07-04 08:40:54