简体   繁体   English

Python +解析自定义配置文件

[英]Python+parsing custom config file

I have a quite big custom made config file I need to extract data from once a week. 我有一个很大的定制配置文件,我需要每周一次提取数据。 This is an "in house" config file which doesn't comply to any know standard like INI or such. 这是一个“内部”配置文件,不符合INI等已知标准。

My quick and dirty approach was to use re to search for the section header I want and then extract the one or 2 lines of information under this header that I want. 我的快速而肮脏的方法是使用re搜索我想要的节标题,然后在我想要的此标题下提取一两行信息。 This is proving quite a challenge and I'm thinking there must be a easier/more reliable way of doing this, but I keep thinking that I will need to implement a full parser to parse this file and then only to extract the 5 lines of data I need. 事实证明,这是一个很大的挑战,我认为必须有一种更简单/更可靠的方式来执行此操作,但我一直认为我需要实现一个完整的解析器来解析此文件,然后仅提取其中的5行我需要的数据。

The "sections" looks something like this: “节”看起来像这样:

Registry com.name.version =
Registry "unique-name I search for using re" =
    String name = "modulename";
    String timestamp = "not specified";
    String java = "not specified";
    String user = "not specified";
    String host = "not specified";
    String system = "not specified";
    String version = "This I want";
    String "version-major" = "not specified";
    String "version-minor" = "not specified";
    String scm = "not specified";
    String scmrevision = "not specified";
    String mode = "release";
    String teamCityBuildNumber = "not specified";
;

A simple parser using pyparsing can give you something close to a deserializer, that would let you access fields by key name (like in a dict), or as attributes. 一个使用pyparsing的简单解析器可以使您接近反序列化器,从而可以通过键名(如dict)或作为属性来访问字段。 Here is the parser: 这是解析器:

from pyparsing import (Suppress,quotedString,removeQuotes,Word,alphas,
        alphanums, printables,delimitedList,Group,Dict,ZeroOrMore,OneOrMore)

# define punctuation and constants - suppress from parsed output
EQ,SEMI = map(Suppress,"=;")
REGISTRY = Suppress("Registry")
STRING = Suppress("String")

# define some basic building blocks
quotedString.setParseAction(removeQuotes)
ident = quotedString | Word(printables)
value = quotedString
java_path = delimitedList(Word(alphas,alphanums+"_"), '.', combine=True)

# define the config file sections
string_defn = Group(STRING + ident + EQ + value + SEMI)
registry_section = Group(REGISTRY + ident + EQ + Dict(ZeroOrMore(string_defn)))

# special definition for leading java module
java_module = REGISTRY + java_path("path") + EQ

# define the overall config file format
config = java_module("java") + Dict(OneOrMore(registry_section))

Here is a test using your data (read from your data file into config_source): 这是使用您的数据的测试(从您的数据文件读入config_source):

data = config.parseString(config_source)
print data.dump()
print data["unique-name I search for using re"].version
print data["unique-name I search for using re"].mode
print data["unique-name I search for using re"]["version-major"]

Prints: 打印:

['com.name.version', ['unique-name I search for using re', ...
- java: ['com.name.version']
  - path: com.name.version
- path: com.name.version
- unique-name I search for using re: [['name', 'modulename'], ...
  - host: not specified
  - java: not specified
  - mode: release
  - name: modulename
  - scm: not specified
  - scmrevision: not specified
  - system: not specified
  - teamCityBuildNumber: not specified
  - timestamp: not specified
  - user: not specified
  - version: This I want
  - version-major: not specified
  - version-minor: not specified
This I want
release
not specified

If you only look for special content, using regexp is fine; 如果只查找特殊内容,则使用regexp是可以的; if you need to read everything, you should rather build yourself a parser. 如果您需要阅读所有内容,则应该自己构建一个解析器。

>> s = ''' ... ''' # as above
>> t = re.search( 'Registry "unique-name" =(.*?)\n;', s, re.S ).group( 1 )
>> u = re.findall( '^\s*(\w+) "?(.*?)"? = "(.*?)";\s*$', t, re.M )
>> for x in u:
       print( x )

('String', 'name', 'modulename')
('String', 'timestamp', 'not specified')
('String', 'java', 'not specified')
('String', 'user', 'not specified')
('String', 'host', 'not specified')
('String', 'system', 'not specified')
('String', 'version', 'This I want')
('String', 'version-major', 'not specified')
('String', 'version-minor', 'not specified')
('String', 'scm', 'not specified')
('String', 'scmrevision', 'not specified')
('String', 'mode', 'release')

edit: Although the above version should work for multiple Registry sections, here is a more stricter version: 编辑:虽然以上版本应适用于多个注册表部分,但这是一个更严格的版本:

t = re.search( 'Registry "unique-name"\s*=\s*((?:\s*\w+ "?[^"=]+"?\s*=\s*"[^"]*?";\s*)+)\s*;', s ).group( 1 )
u = re.findall( '^\s*(\w+) "?([^"=]+)"?\s*=\s*"([^"]*?)";\s*$', t, re.M )

Regexp have no state, so you can't use them to parse a complex input. 正则表达式没有状态,因此您不能使用它们来解析复杂的输入。 But you can load the file into a string, use a regexp to find a substring and then cut the string at that place. 但是您可以将文件加载到字符串中,使用正则表达式查找子字符串,然后在该位置剪切字符串。

In your case, search for r'unique-name I search for using re"\\s*=\\s*' , then cut after the match. Then search for r'\\n\\s*;\\s*\\n' and cut before the match. This leaves you with the values which you can chop using another regexp. 在您的情况下,搜索r'unique-name I search for using re"\\s*=\\s*' ,然后r'unique-name I search for using re"\\s*=\\s*' ,然后在匹配后剪切。然后搜索r'\\n\\s*;\\s*\\n'然后在比赛之前进行剪切,这将为您提供可以使用其他正则表达式进行切碎的值。

I think you should create simple parser which create dictionaries of sections with dictionaries of keys. 我认为您应该创建简单的解析器,该解析器使用键的字典创建各部分的字典。 Something like: 就像是:

#!/usr/bin/python

import re

re_section = re.compile('Registry (.*)=', re.IGNORECASE)
re_value = re.compile('\s+String\s+(\S+)\s*=\s*(.*);')

txt = '''
Registry com.name.version =
Registry "unique-name I search for using re" =
        String name = "modulename";
        String timestamp = "not specified";
        String java = "not specified";
        String user = "not specified";
        String host = "not specified";
        String system = "not specified";
        String version = "This I want";
        String "version-major" = "not specified";
        String "version-minor" = "not specified";
        String scm = "not specified";
        String scmrevision = "not specified";
        String mode = "release";
        String teamCityBuildNumber = "not specified";
'''

my_config = {}
section = ''
lines = txt.split('\n')
for l in lines:
    rx = re_section.search(l)
    if rx:
        section = rx.group(1)
        section = section.strip('" ')
        continue
    rx = re_value.search(l)
    if rx:
        (k, v) = (rx.group(1).strip('" '), rx.group(2).strip('" '))
        try:
            my_config[section][k] = v
        except KeyError:
            my_config[section] = {k: v}

Then if you: 然后,如果您:

print my_config["unique-name I search for using re"]['version']

it will output: 它将输出:

This I want

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM