Alright, I'm currently using Python's regular expression library to split up the following string into groups of semicolon delimited fields.
'key1:"this is a test phrase"; key2:"this is another test phrase"; key3:"ok this is a gotcha\\; but you should get it";'
Regex: \\s*([^;]+[^\\\\])\\s*;
I'm currently using the pcre above, which was working fine until I encountered a case where an escaped semicolon is included in one of the phrases as noted above by key3.
How can I modify this expression to only split on the non-escaped semicolons?
The basic version of this is where you want to ignore any ;
that's preceded by a backslash, regardless of anything else. That's relatively simple:
\s*([^;]*[^;\\]);
What will make this tricky is if you want escaped backslashes in the input to be treated as literals. For example:
"You may want to split here\\;"
"But not here\;"
If that's something you want to take into account, try this (edited) :
\s*((?:[^;\\]|\\.)+);
Why so complicated? Because if escaped backslashes are allowed, then you have to account for things like this:
"0 slashes; 2 slashes\\; 5 slashes\\\\\; 6 slashes\\\\\\;"
Each pair of doubled backslashes would be treated as a literal \\
. That means a ;
would only be escaped if there were an odd number of backslashes before it. So the above input would be grouped like this:
#1: '0 slashes'
#2: '2 slashes\'
#3: '5 slashes\\; 6 slashes\\\'
Hence the different parts of the pattern:
\s* #Whitespace
((?:
[^;\\] #One character that's not ; or \
| #Or...
\\. #A backslash followed by any character, even ; or another backslash
)+); #Repeated one or more times, followed by ;
Requiring a character after a backslash ensures that the second character is always escaped properly, even if it's another backslash.
If the string may contain semicolons and escaped quotes (or escaped anything ), I would suggest parsing each valid key:"value";
sequence. Like so:
import re
s = r'''
key1:"this is a test phrase";
key2:"this is another test phrase";
key3:"ok this is a gotcha\; but you should get it";
key4:"String with \" escaped quote";
key5:"String with ; unescaped semi-colon";
key6:"String with \\; escaped-escape before semi-colon";
'''
result = re.findall(r'\w+:"[^"\\]*(?:\\.[^"\\]*)*";', s)
print (result)
Note that this correctly handles any escapes within the double quoted string.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.