简体   繁体   中英

Python+parsing custom config file

I have a quite big custom made config file I need to extract data from once a week. This is an "in house" config file which doesn't comply to any know standard like INI or such.

My quick and dirty approach was to use re to search for the section header I want and then extract the one or 2 lines of information under this header that I want. This is proving quite a challenge and I'm thinking there must be a easier/more reliable way of doing this, but I keep thinking that I will need to implement a full parser to parse this file and then only to extract the 5 lines of data I need.

The "sections" looks something like this:

Registry com.name.version =
Registry "unique-name I search for using re" =
    String name = "modulename";
    String timestamp = "not specified";
    String java = "not specified";
    String user = "not specified";
    String host = "not specified";
    String system = "not specified";
    String version = "This I want";
    String "version-major" = "not specified";
    String "version-minor" = "not specified";
    String scm = "not specified";
    String scmrevision = "not specified";
    String mode = "release";
    String teamCityBuildNumber = "not specified";
;

A simple parser using pyparsing can give you something close to a deserializer, that would let you access fields by key name (like in a dict), or as attributes. Here is the parser:

from pyparsing import (Suppress,quotedString,removeQuotes,Word,alphas,
        alphanums, printables,delimitedList,Group,Dict,ZeroOrMore,OneOrMore)

# define punctuation and constants - suppress from parsed output
EQ,SEMI = map(Suppress,"=;")
REGISTRY = Suppress("Registry")
STRING = Suppress("String")

# define some basic building blocks
quotedString.setParseAction(removeQuotes)
ident = quotedString | Word(printables)
value = quotedString
java_path = delimitedList(Word(alphas,alphanums+"_"), '.', combine=True)

# define the config file sections
string_defn = Group(STRING + ident + EQ + value + SEMI)
registry_section = Group(REGISTRY + ident + EQ + Dict(ZeroOrMore(string_defn)))

# special definition for leading java module
java_module = REGISTRY + java_path("path") + EQ

# define the overall config file format
config = java_module("java") + Dict(OneOrMore(registry_section))

Here is a test using your data (read from your data file into config_source):

data = config.parseString(config_source)
print data.dump()
print data["unique-name I search for using re"].version
print data["unique-name I search for using re"].mode
print data["unique-name I search for using re"]["version-major"]

Prints:

['com.name.version', ['unique-name I search for using re', ...
- java: ['com.name.version']
  - path: com.name.version
- path: com.name.version
- unique-name I search for using re: [['name', 'modulename'], ...
  - host: not specified
  - java: not specified
  - mode: release
  - name: modulename
  - scm: not specified
  - scmrevision: not specified
  - system: not specified
  - teamCityBuildNumber: not specified
  - timestamp: not specified
  - user: not specified
  - version: This I want
  - version-major: not specified
  - version-minor: not specified
This I want
release
not specified

If you only look for special content, using regexp is fine; if you need to read everything, you should rather build yourself a parser.

>> s = ''' ... ''' # as above
>> t = re.search( 'Registry "unique-name" =(.*?)\n;', s, re.S ).group( 1 )
>> u = re.findall( '^\s*(\w+) "?(.*?)"? = "(.*?)";\s*$', t, re.M )
>> for x in u:
       print( x )

('String', 'name', 'modulename')
('String', 'timestamp', 'not specified')
('String', 'java', 'not specified')
('String', 'user', 'not specified')
('String', 'host', 'not specified')
('String', 'system', 'not specified')
('String', 'version', 'This I want')
('String', 'version-major', 'not specified')
('String', 'version-minor', 'not specified')
('String', 'scm', 'not specified')
('String', 'scmrevision', 'not specified')
('String', 'mode', 'release')

edit: Although the above version should work for multiple Registry sections, here is a more stricter version:

t = re.search( 'Registry "unique-name"\s*=\s*((?:\s*\w+ "?[^"=]+"?\s*=\s*"[^"]*?";\s*)+)\s*;', s ).group( 1 )
u = re.findall( '^\s*(\w+) "?([^"=]+)"?\s*=\s*"([^"]*?)";\s*$', t, re.M )

Regexp have no state, so you can't use them to parse a complex input. But you can load the file into a string, use a regexp to find a substring and then cut the string at that place.

In your case, search for r'unique-name I search for using re"\\s*=\\s*' , then cut after the match. Then search for r'\\n\\s*;\\s*\\n' and cut before the match. This leaves you with the values which you can chop using another regexp.

I think you should create simple parser which create dictionaries of sections with dictionaries of keys. Something like:

#!/usr/bin/python

import re

re_section = re.compile('Registry (.*)=', re.IGNORECASE)
re_value = re.compile('\s+String\s+(\S+)\s*=\s*(.*);')

txt = '''
Registry com.name.version =
Registry "unique-name I search for using re" =
        String name = "modulename";
        String timestamp = "not specified";
        String java = "not specified";
        String user = "not specified";
        String host = "not specified";
        String system = "not specified";
        String version = "This I want";
        String "version-major" = "not specified";
        String "version-minor" = "not specified";
        String scm = "not specified";
        String scmrevision = "not specified";
        String mode = "release";
        String teamCityBuildNumber = "not specified";
'''

my_config = {}
section = ''
lines = txt.split('\n')
for l in lines:
    rx = re_section.search(l)
    if rx:
        section = rx.group(1)
        section = section.strip('" ')
        continue
    rx = re_value.search(l)
    if rx:
        (k, v) = (rx.group(1).strip('" '), rx.group(2).strip('" '))
        try:
            my_config[section][k] = v
        except KeyError:
            my_config[section] = {k: v}

Then if you:

print my_config["unique-name I search for using re"]['version']

it will output:

This I want

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM