简体   繁体   中英

Reg-ex to parse a list with bracketed nested sub lists?

I'm trying to refactor some Python code that parses a string in the following format:

thing_1,thing_2,things_3[thing_31,thing_32,thing_34[thing_341]],thing_5

where the end result is a structured result like:

{
   "thing_1": True,
   "thing_2": True,
   "thing_3": {
      "thing_31": True,
      "thing_32": True,
      "thing_34: {
          "thing_341": True
      }
   },
   "thing_5": True,
}

In practice this is a field list for an API request (where only the given fields are returned) with support for defining required fields for nested objects.

I've been trying various approaches on how to write the reg ex (if that is at all possible). My thought was to parse it first on the brackets' contents, while retaining the first element before each bracket, and in the end I'm left with just the outer top level list. But that proves more difficult to describe in regex than it is to 'say' it in English.

Some notable attempts are below but the grouping is all wrong there.

(([a-z0-9_]+)(\[[a-z0-9,_*]+\]*)+)

([a-z0-9_]+)(\[[a-z0-9,_*]+\]*)

(?<=[a-z0-9_])(\[[a-z0-9,_*]+\]*)

Is this even possible to do in an elegant way?

Thank you!

Since you already have a parser and just wanted to know of an alternative way, you may consider

import json, re
s = "thing_1,thing_2,things_3[thing_31,thing_32,thing_34[thing_341]],thing_5"
s = re.sub(r'\w+(?![\[\w])', r'"\g<0>": true', s)
js = json.loads('{' + re.sub(r'\w+(?=\[)', r'"\g<0>":', s).replace('[', '{').replace(']', '}') + '}')
print (json.dumps(js, indent=4, sort_keys=True))

Output:

{
    "thing_1": true,
    "thing_2": true,
    "thing_5": true,
    "things_3": {
        "thing_31": true,
        "thing_32": true,
        "thing_34": {
            "thing_341": true
        }
    }
}

See the Python demo online .

NOTES:

  • re.sub(r'\w+(?,[\[\w])': r'"\g<0>", true', s) - wraps all chunks of 1+ word chars that are not immediately followed with [ with double quotation marks, and appends : true after them
  • re.sub(r'\w+(?=\[)', r'"\g<0>":', s) - wraps all chunks of 1+ word chars that ARE immediately followed with [ with double quotation marks, and appends : after them
  • .replace('[', '{').replace(']', '}') replaces all [ with { and ] with }
  • To parse the string as JSON, the result is wrapped with {...}
  • After parsing with json.loads(s) , json.dumps(js, indent=4, sort_keys=True) pretty-prints the json and dumps it.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM