简体   繁体   中英

REGEX in Python: what's wrong with (?<!\\)\“.+(?<!\\)\”?

trying to parse JSON key names within quotes, including escaped quotes. my thinking is: take anything between quotes not prefixed with \\

(?<!\\)\".+(?<!\\)\"

where (?<!\\\\)\\" should screen for " but not \\" but Python complains about unbalanced parenthesis. if I use (?<!\\\\\\)\\" Python is happy , but this doesn't work:

re.findall('(?<!\\\)\".+(?<!\\\)\"','"this is \"the\". key"."and this.is.the.child"')

leads:

['"this is "the". key"."and this.is.the.child"']

when I expect:

['"this is "the". key"', '"and this.is.the.child"']

split at the dot which is enclosed with " without escape.

I feel like i need an 'anything but not escaped double quote ' in the middle, but if [^"] screens for anything but a double quote, I don't know how to negate the (?<!\\\\\\)\\" expression within a [ ] set that takes characters as literals. i would want something like [^(?<!\\\\\\)\\"] but that doesn't work.

I tried things like [[^"]|(\\")]+ (anything but a double quote, or a \\" ) but that doesn't seem to work either...

Can;t seem to find the right way to do this... Any ideas?

Thanks for help

EDIT:

My real goal is to be able to split full 'text' JSON key names to transform them into alphanum only values. The transform is irrelevant here, but the goal is to split the keys to represent the hierarchy properly. The keys are in text form.

EDIT 2:

even though OmnipotentEntity is most likely right, writing a parser will have to wait.. This solution below doesn't support the "\\" or "\\\\" cases as indicated in his comments.

I settled with

"(?:\\"|[^"])*?"|(?<=\.)[^".]+?(?=\.)|^[^".]+?(?=\.)|(?<=\.)[^".]+?$

inspired by the answer from Avinash Raj but adding support for keys that are not enclosed in double quotes: no quotes beginning of line ending with . .key. and .lastkey when substituting [empty] with the same regex, one should find 1 less element than the number of found strings, or there is an error. something like .. outside "" will fail that test

Fundamentally, using a regular expression to match quoted strings is impossible in the general case. JSON is not a regular language (all regular languages are LL(1) but not all LL(1) languages are regular, JSON is one of these), so it cannot be matched by a regular expression.

Avinash Raj's regular expression (?<!\\\\)".*?(?<!\\\\)" , for instance, fails on the the case "\\\\" . Because the quote is preceded by a \\ but the backslash doesn't function as an escape. But you can't special case this situation because then "\\\\\\"" will fail. And if you special case this situation, you can just use 4 \\ and then 5 \\ etc.

Lookbehinds aren't part of standard regular expressions so they can match more grammars than simply regular ones. So you might be able to come up with a regular expression that works in this case. However, I would recommend writing a parser instead, they are very easy to do for LL(1) grammars. It will be easier, more understandable, less brittle, and give you more leverage to deal with non-conformant JSON and give you the ability to write better diagnostic messages in this case.

Try to define your regex as raw string notation.

>>> s = r'"this is \"the\". key"."and this.is.the.child"'
>>> re.findall(r'"(?:\\"|[^"])*?"', s)
['"this is \\"the\\". key"', '"and this.is.the.child"']

DEMO

OR

>>> re.findall(r'(?<!\\)".*?(?<!\\)"', s)
['"this is \\"the\\". key"', '"and this.is.the.child"']
  • (?<!\\\\) called negative lookbehind which asserts that the match won't be preceded by a backslash.

  • " Matches a double quotes.

  • .*?(?<!\\\\)" Matches all the characters non-greedily upto the double quotes which is not preceded by a backslash.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM