简体   繁体   中英

Negative lookbehind to not match escaped characters, fails on escaped backslash

Say I want to split a string at any separator char, but not escaped ones, I can usually use a negative lookbehind and string.split(regex).

For example:

const regex = /(?<!\\)\,/;
'abc,def'.split(regex); 
'abc\\,def'.split(regex); 

splits at the , in abc,def , but not in abc\\,def . This is fine!

But if the separator character itself a backslash, the negative lookbehind seems to not work as expected:

const regex = /(?<!\\)\\/;
'abc\\def'.split(regex); 
'abc\\\\def'.split(regex); 

splits both at the first \\ in abc\\def AND in abc\\\\def .

Naively I would have expected that the negative lookbehind will not match a \\ preceded by a \\ .

See: https://regex101.com/r/ozkZR1/1

How can I achieve a string.split(regex) at any non-escaped character that doesn't fall apart with special characters like a backslash or a line-break (one should be able to escape them too)?

Naive solution

In the case where your separator is the same as your delimiter, you can you a negative look-ahead after the separator, on top of the negative look behind:

/(?<!\\)\\(?!\\)/

Caveats

There are a lot of problems with this approach, and I would not recommend solving it with a regular expression, and I would especially not recommend allowing the separator and escape characters to be the same.

  • With , as your separator, a literal character at the end of a field will fool the regex, eg, abc\\\\,def will not get split.
  • With \\ as your separator and escape character, you can't have empty fields: abc,,def would be three fields, including an empty one, but abc\\\\def would be just one field.
  • What about abc\\\\\\def ? Does that have a literal \\ at the end of the first field or at the beginning of the second? Either way, my regex would not split on it.

If you are willing to ban the use of the escape character literally at the boundaries, and not allow empty fields, my regex would work when the escape and separator are the same, and yours in the other case.

Otherwise, I would recommend a different solution where you parse the string from left to right, interpreting the escapes as you meet them, and splitting when an unescaped separator is seen, so that abc\\\\,def would be split correctly.

The solution was to reverse the operation:

Instead of looking for the delimiters, I could look for the delimited character sequences. So in case of a , delimiter I would look for: ((\\\\,)|[^,])([^,]*?(\\\\,)?)* : Either an escaped comma or a non-comma character, followed by any number (potentially empty) group of non-commas (reluctant, so it doesn't catch the \\ of an escape) which is followed by an optional escaped comma.

let separator = ','; // get from sanitized input
separator = separator === '\\' ? '\\\\' : separator;
const groups = new RegExp(`((\\\\${separator})|[^${separator}])([^${separator}]*?(\\\\${separator})?)+`, 'g');
let columns = line.match(groups);

This works for , as well as for \\ as separators and will not split on \\, and \\\\ respectively.

The hardest part of that expression was to get all the escapes right.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM