简体   繁体   English

使用python以下格式解析文件的最佳方法(防错/万无一失)是什么?

[英]What's the best way(error proof / foolproof) to parse a file using python with following format?

########################################
# some comment
# other comment
########################################

block1 {
    value=data
    some_value=some other kind of data
    othervalue=032423432
    }

block2 {
    value=data
    some_value=some other kind of data
    othervalue=032423432
    }

The best way would be to use an existing format such as JSON. 最好的方法是使用现有格式,如JSON。

Here's an example parser for your format: 这是您的格式的示例解析器:

from lepl import (AnyBut, Digit, Drop, Eos, Integer, Letter,
                  NON_GREEDY, Regexp, Space, Separator, Word)

# EBNF
# name = ( letter | "_" ) , { letter | "_" | digit } ;
name = Word(Letter() | '_',
            Letter() | '_' | Digit())
# words = word , space+ , word , { space+ , word } ;
# two or more space-separated words (non-greedy to allow comment at the end)
words = Word()[2::NON_GREEDY, ~Space()[1:]] > list
# value = integer | word | words  ;
value = (Integer() >> int) | Word() | words
# comment = "#" , { all characters - "\n" } , ( "\n" | EOF ) ;
comment = '#' & AnyBut('\n')[:] & ('\n' | Eos())

with Separator(~Regexp(r'\s*')):
    # statement = name , "=" , value ;
    statement = name & Drop('=') & value > tuple
    # suite     = "{" , { comment | statement } , "}" ;
    suite     = Drop('{') & (~comment | statement)[:] & Drop('}') > dict
    # block     = name , suite ;
    block     = name & suite > tuple
    # config    = { comment | block } ;
    config    = (~comment | block)[:] & Eos() > dict

from pprint import pprint

pprint(config.parse(open('input.cfg').read()))

Output: 输出:

[{'block1': {'othervalue': 32423432,
             'some_value': ['some', 'other', 'kind', 'of', 'data'],
             'value': 'data'},
  'block2': {'othervalue': 32423432,
             'some_value': ['some', 'other', 'kind', 'of', 'data'],
             'value': 'data'}}]

Well, the data looks pretty regular. 好吧,数据看起来非常规律。 So you could do something like this (untested): 所以你可以做这样的事情(未经测试):

class Block(object):
    def __init__(self, name):
        self.name = name

infile = open(...)  # insert filename here
current = None
blocks = []

for line in infile:
    if line.lstrip().startswith('#'):
        continue
    elif line.rstrip().endswith('{'):
        current = Block(line.split()[0])
    elif '=' in line:
        attr, value = line.strip().split('=')
        try:
            value = int(value)
        except ValueError:
            pass
        setattr(current, attr, value)
    elif line.rstrip().endswith('}'):
        blocks.append(current)

The result will be a list of Block instances, where block.name will be the name ( 'block1' , 'block2' , etc.) and other attributes correspond to the keys in your data. 结果将是Block实例列表,其中block.name将是名称( 'block1''block2'等),其他属性对应于数据中的键。 So, blocks[0].value will be 'data', etc. Note that this only handles strings and integers as values. 因此, blocks[0].value将是'data'等。请注意,这仅将字符串和整数作为值处理。

(there is an obvious bug here if your keys can ever include 'name'. You might like to change self.name to self._name or something if this can happen) (如果您的密钥可以包含'name',那么这里有一个明显的错误。您可能希望将self.name更改为self._name或者如果可能发生这种情况的话。

HTH! HTH!

If you do not really mean parsing, but rather text processing and the input data is really that regular, then go with John's solution. 如果你不是真的意味着解析,而是文本处理和输入数据真的那么规律,那么请使用John的解决方案。 If you really need some parsing (like there are some a little more complex rules to the data that you are getting), then depending on the amount of data that you need to parse, I'd go either with pyparsing or simpleparse . 如果你真的需要一些解析(就像你得到的数据有一些更复杂的规则),那么根据你需要解析的数据量,我会选择pyparsing或simpleparse I've tried both of them, but actually pyparsing was too slow for me. 我试过他们两个,但实际上pyparsing对我来说太慢了。

你可能会研究像pyparsing这样的东西。

Grako (for grammar compiler) allows to separate the input format specification (grammar) from its interpretation (semantics). Grako(用于语法编译器)允许将输入格式规范(语法)与其解释(语义)分开。 Here's grammar for your input format in Grako's variety of EBNF : 这是Grako各种EBNF中输入格式的语法:

(* a file contains zero or more blocks *)
file = {block} $;
(* a named block has at least one assignment statement *)
block = name '{' {assignment}+ '}';
assignment = name '=' value NEWLINE;
name = /[a-z][a-z0-9_]*/;
value = integer | string;
NEWLINE = /\n/;
integer = /[0-9]+/;
(* string value is everything until the next newline *)
string = /[^\n]+/;

To install grako , run pip install grako . 要安装grako ,请运行pip install grako To generate the PEG parser from the grammar: 从语法生成PEG解析器:

$ grako -o config_parser.py Config.ebnf

To convert stdin into json using the generated config_parser module: 使用生成的config_parser模块将stdin转换为json:

#!/usr/bin/env python
import json
import string
import sys
from config_parser import ConfigParser

class Semantics(object):
    def file(self, ast):
        # file = {block} $
        # all blocks should have unique names within the file
        return dict(ast)
    def block(self, ast):
        # block = name '{' {assignment}+ '}'
        # all assignment statements should use unique names
        return ast[0], dict(ast[2])
    def assignment(self, ast):
        # assignment = name '=' value NEWLINE
        # value = integer | string
        return ast[0], ast[2] # name, value
    def integer(self, ast):
        return int(ast)
    def string(self, ast):
        return ast.strip() # remove leading/trailing whitespace

parser = ConfigParser(whitespace='\t\n\v\f\r ', eol_comments_re="#.*?$")
ast = parser.parse(sys.stdin.read(), rule_name='file', semantics=Semantics())
json.dump(ast, sys.stdout, indent=2, sort_keys=True)

Output 产量

{
  "block1": {
    "othervalue": 32423432,
    "some_value": "some other kind of data",
    "value": "data"
  },
  "block2": {
    "othervalue": 32423432,
    "some_value": "some other kind of data",
    "value": "data"
  }
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM