简体   繁体   中英

Python Regex: Ignore Escaped Character

Alright, I'm currently using Python's regular expression library to split up the following string into groups of semicolon delimited fields.

'key1:"this is a test phrase"; key2:"this is another test phrase"; key3:"ok this is a gotcha\\; but you should get it";'

Regex: \\s*([^;]+[^\\\\])\\s*;

I'm currently using the pcre above, which was working fine until I encountered a case where an escaped semicolon is included in one of the phrases as noted above by key3.

How can I modify this expression to only split on the non-escaped semicolons?

The basic version of this is where you want to ignore any ; that's preceded by a backslash, regardless of anything else. That's relatively simple:

\s*([^;]*[^;\\]);

What will make this tricky is if you want escaped backslashes in the input to be treated as literals. For example:

"You may want to split here\\;"
"But not here\;"

If that's something you want to take into account, try this (edited) :

\s*((?:[^;\\]|\\.)+);

Why so complicated? Because if escaped backslashes are allowed, then you have to account for things like this:

"0 slashes; 2 slashes\\; 5 slashes\\\\\; 6 slashes\\\\\\;"

Each pair of doubled backslashes would be treated as a literal \\ . That means a ; would only be escaped if there were an odd number of backslashes before it. So the above input would be grouped like this:

#1: '0 slashes'
#2: '2 slashes\'
#3: '5 slashes\\; 6 slashes\\\'

Hence the different parts of the pattern:

\s*            #Whitespace
((?:
    [^;\\]     #One character that's not ; or \
  |            #Or...
    \\.        #A backslash followed by any character, even ; or another backslash
)+);           #Repeated one or more times, followed by ;

Requiring a character after a backslash ensures that the second character is always escaped properly, even if it's another backslash.

If the string may contain semicolons and escaped quotes (or escaped anything ), I would suggest parsing each valid key:"value"; sequence. Like so:

import re
s = r'''
    key1:"this is a test phrase";
    key2:"this is another test phrase";
    key3:"ok this is a gotcha\; but you should get it";
    key4:"String with \" escaped quote";
    key5:"String with ; unescaped semi-colon";
    key6:"String with \\; escaped-escape before semi-colon";
    '''
result = re.findall(r'\w+:"[^"\\]*(?:\\.[^"\\]*)*";', s)
print (result)

Note that this correctly handles any escapes within the double quoted string.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM