简体   繁体   English

如何解析代码(在Python中)?

[英]How to parse code (in Python)?

I need to parse some special data structures. 我需要解析一些特殊的数据结构。 They are in some somewhat-like-C format that looks roughly like this: 它们有点像C格式,看起来大致如下:

Group("GroupName") {
    /* C-Style comment */
    Group("AnotherGroupName") {
        Entry("some","variables",0,3.141);
        Entry("other","variables",1,2.718);
    }
    Entry("linebreaks",
          "allowed",
          3,
          1.414
         );
}

I can think of several ways to go about this. 我可以想到几种方法来解决这个问题。 I could 'tokenize' the code using regular expressions. 我可以使用正则表达式“代码化”代码。 I could read the code one character at a time and use a state machine to construct my data structure. 我可以一次读取一个字符的代码,并使用状态机来构建我的数据结构。 I could get rid of comma-linebreaks and read the thing line by line. 我可以摆脱逗号界线并逐行阅读。 I could write some conversion script that converts this code to executable Python code. 我可以编写一些转换脚本,将此代码转换为可执行的Python代码。

Is there a nice pythonic way to parse files like this? 是否有一个很好的pythonic方法来解析这样的文件?
How would you go about parsing it? 你会如何解析它?

This is more a general question about how to parse strings and not so much about this particular file format. 这是关于如何解析字符串的一般问题,而不是关于这种特定文件格式的问题。

Using pyparsing (Mark Tolonen, I was just about to click "Submit Post" when your post came thru), this is pretty straightforward - see comments embedded in the code below: 使用pyparsing(Mark Tolonen,当你的帖子通过时我刚刚点击“Submit Post”),这非常简单 - 请参阅下面代码中嵌入的注释:

data = """Group("GroupName") { 
    /* C-Style comment */ 
    Group("AnotherGroupName") { 
        Entry("some","variables",0,3.141); 
        Entry("other","variables",1,2.718); 
    } 
    Entry("linebreaks", 
          "allowed", 
          3, 
          1.414 
         ); 
} """

from pyparsing import *

# define basic punctuation and data types
LBRACE,RBRACE,LPAREN,RPAREN,SEMI = map(Suppress,"{}();")
GROUP = Keyword("Group")
ENTRY = Keyword("Entry")

# use parse actions to do parse-time conversion of values
real = Regex(r"[+-]?\d+\.\d*").setParseAction(lambda t:float(t[0]))
integer = Regex(r"[+-]?\d+").setParseAction(lambda t:int(t[0]))

# parses a string enclosed in quotes, but strips off the quotes at parse time
string = QuotedString('"')

# define structure expressions
value = string | real | integer
entry = Group(ENTRY + LPAREN + Group(Optional(delimitedList(value)))) + RPAREN + SEMI

# since Groups can contain Groups, need to use a Forward to define recursive expression
group = Forward()
group << Group(GROUP + LPAREN + string("name") + RPAREN + 
            LBRACE + Group(ZeroOrMore(group | entry))("body") + RBRACE)

# ignore C style comments wherever they occur
group.ignore(cStyleComment)

# parse the sample text
result = group.parseString(data)

# print out the tokens as a nice indented list using pprint
from pprint import pprint
pprint(result.asList())

Prints 打印

[['Group',
  'GroupName',
  [['Group',
    'AnotherGroupName',
    [['Entry', ['some', 'variables', 0, 3.141]],
     ['Entry', ['other', 'variables', 1, 2.718]]]],
   ['Entry', ['linebreaks', 'allowed', 3, 1.4139999999999999]]]]]

(Unfortunately, there may be some confusion since pyparsing defines a "Group" class, for imparting structure to the parsed tokens - note how the value lists in an Entry get grouped because the list expression is enclosed within a pyparsing Group.) (不幸的是,由于pyparsing定义了一个“Group”类,用于将结构赋予解析的标记,因此可能存在一些混淆 - 请注意条目中的值列表如何分组,因为列表表达式包含在一个pyparsing Group中。)

Check out pyparsing . 查看pyparsing It has lots of parsing examples . 它有很多解析示例

Depends on how often you need this and if the syntax stays the same. 取决于您需要多长时间以及语法是否保持不变。 If the answers are "quite often" and "more or less yes" then I would look at a way to express the syntax and write a specific parser to that language with a tool like PyPEG or LEPL . 如果答案是“经常”和“或多或少是”,那么我会看一种表达语法的方法,并使用像PyPEGLEPL这样的工具为该语言编写特定的解析器。 Defining the parser rules is the big job so unless you need to parse the same kind of files often it might not necessarily be effective, though. 定义解析器规则是一项重要工作,因此,除非您需要经常解析相同类型的文件,否则它可能不一定有效。

But if you look at the PyPEG page it tells you how to output the parsed data to XML so if that tool doesn't give enough power to you, you could use it to generate the XML and then use eg lxml to parse the xml. 但是如果你看一下PyPEG页面它会告诉你如何将解析后的数据输出到XML,所以如果那个工具没有给你足够的力量,你可以用它来生成XML,然后使用例如lxml来解析xml。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM