Why does LF and CRLF behave differently with /^\s*$/gm regex?

Question

I've been seeing this issue on Windows. When I try to clear any whitespace on each line on Unix:

const input =
`===

HELLO

WOLRD

===`
console.log(input.replace(/^\s+$/gm, ''))

This produces what I expect:

===

HELLO

WOLRD

===

ie if there werespaces on blank lines, they'd get removed. On the other hand, on Windows, the regex clears the WHOLE string. To illustrate:

const input =
`===

HELLO

WOLRD

===`.replace(/\r?\n/g, '\r\n')
console.log(input.replace(/^\s+$/gm, ''))

(template literals will always print only \\n in JS, so I had to replace with \\r\\n to emulate Windows ( ? after \\r just to be sure for those who don't believe). The result:

===
HELLO
WOLRD
===

The whole line is gone! But my regex has ^ and $ with the m flag set, so it's kind of /^-to-$/m . What's the difference between \\r and \\r\\n then that makes it produce different results?

when I do some logging

console.log(input.replace(/^\s*$/gm, (m) => {
  console.log('matched')
  return ''
}))

With \\r\\n I'm seeing

matched
matched
matched
matched
matched
matched
===
HELLO
WOLRD
===

and with \\n only

matched
matched
matched
===

HELLO

WOLRD

===

Answer 1

TL;DR a pattern including whitespace and line breaks will also match characters part of a \\r\\n sequence, if you let it.

First of all, let's actually examine what characters are there and aren't there when you do a replacement. Starting with a string that only uses line feeds:

 const inputLF = `=== HELLO WOLRD ===`.replace(/\\r?\\n/g, "\\n"); console.log('------------ INPUT ') console.log(inputLF); console.log('------------') debugPrint(inputLF, 2); debugPrint(inputLF, 3); debugPrint(inputLF, 4); debugPrint(inputLF, 5); const replaceLF = inputLF.replace(/^\\s+$/gm, ''); console.log('------------ REPLACEMENT') console.log(replaceLF); console.log('------------') debugPrint(replaceLF, 2); debugPrint(replaceLF, 3); debugPrint(replaceLF, 4); debugPrint(replaceLF, 5); console.log(`charcode ${replaceLF.charCodeAt(2)} : ${replaceLF.charAt(2)}`); console.log(`charcode ${replaceLF.charCodeAt(3)} : ${replaceLF.charAt(3)}`); console.log(`charcode ${replaceLF.charCodeAt(4)} : ${replaceLF.charAt(4)}`); console.log(`charcode ${replaceLF.charCodeAt(5)} : ${replaceLF.charAt(5)}`); console.log('------------') console.log('inputLF === replaceLF :', inputLF === replaceLF) function debugPrint(str, charIndex) { console.log(`index: ${charIndex} charcode: ${str.charCodeAt(charIndex)} character: ${str.charAt(charIndex)}` ); }

Each line ends with char code 10 which is the Line Feed (LF) character that is represented in a string literal with \\n . Before and after the replacement, the two strings are the same - not only look the same but actually equal each other, so the replacement did nothing.

Now let's examine the other case:

 const inputCRLF = `=== HELLO WOLRD ===`.replace(/\\r?\\n/g, "\\r\\n") console.log('------------ INPUT ') console.log(inputCRLF); console.log('------------') debugPrint(inputCRLF, 2); debugPrint(inputCRLF, 3); debugPrint(inputCRLF, 4); debugPrint(inputCRLF, 5); debugPrint(inputCRLF, 6); debugPrint(inputCRLF, 7); const replaceCRLF = inputCRLF.replace(/^\\s+$/gm, '');; console.log('------------ REPLACEMENT') console.log(replaceCRLF); console.log('------------') debugPrint(replaceCRLF, 2); debugPrint(replaceCRLF, 3); debugPrint(replaceCRLF, 4); debugPrint(replaceCRLF, 5); function debugPrint(str, charIndex) { console.log(`index: ${charIndex} charcode: ${str.charCodeAt(charIndex)} character: ${str.charAt(charIndex)}` ); }

This time each line ends with char code 13 which is the Carriage Return (CR) character that is represented in a string literal with \\r and then the LF follows. After the replacement, instead of having a sequence of =\\r\\n\\r\\nH instead it's not just =\\r\\nH . Let's look at why.

Here is what MDN says about the meta character ^ :

Matches the beginning of input. If the multiline flag is set to true, also matches immediately after a line break character.

And here is what MDN says about the meta character $

Matches the end of input. If the multiline flag is set to true, also matches immediately before a line break character.

So they match after and before a line break character. In that, MDN means the LF or the CR. This can be seen if we test a string that contains different line breaks:

 const stringLF = "hello\\nworld"; const stringCRLF = "hello\\r\\nworld"; const regexStart = /^\\s/m; const regexEnd = /\\s$/m; console.log(regexStart.exec(stringLF)); console.log(regexStart.exec(stringCRLF)); console.log(regexEnd.exec(stringLF)); console.log(regexEnd.exec(stringCRLF));

If we try to match whitespace near a line break, this doesn't match anything if there is an LF but it does match the CR with CRLF. So, in that case $ would match here:

"hello\r\nworld"
        ^^ what `^\s` matches

"hello\r\nworld"
      ^^ what `\s$` matches

So both ^ and $ recognise either of the CRLF sequence as end of line. This will make a difference when you do a search and replace. Since your regex specifies ^\\s+$ that means that when you have a line that is entirely \\r\\n then it matches . But for a reason that is not obvious:

 const re = /^\\s+$/m; const sringLF = "hello\\n\\nworld"; const stringCRLF = "hello\\r\\n\\r\\nworld"; console.log(re.exec(sringLF)); console.log(re.exec(stringCRLF));

So, the regex doesn't match an \\r\\n but rather \\n\\r (two whitespace characters) between two other line breakcharacters. That's because + is eager and will consume as much of the character sequence as it can get away with. Here is what the regex engine will try. Somewhat simplified for brevity:

input = "hello\r\n\r\nworld
regex = /^\s+$/

Step 1
hello[\r]\n\r\nworld
    matches `^`, symbol satisfied -> continue with next symbol in regex

Step 2
hello[\r\n]\r\nworld
    matches `^\s+` -> continue matching to satisfy `+` quantifier

Step 3
hello[\r\n\r]\nworld
    matches `^\s+` -> continue matching to satisfy `+` quantifier

Step 4
hello[\r\n\r\n]world
    matches `^\s+` -> continue matching to satisfy `+` quantifier

Step 5
hello[\r\n\r\nw]orld
    does not match `\s` -> backtrack

Step 6
hello[\r\n\r\n]world
    matches `^\s+`, quantifier satisfied -> continue to next symbol in regex

Step 7
hello[\r\n\r\nw]orld
    does not match `$` in `^\s+$` -> backtrack

Step 8
hello[\r\n\r\n]world
    matches `^\s+$`, last symbol satisfied -> finish

Lastly, there is something slightly hidden here - it matters that you're matching whitespace. This is because it will behave differently to most other symbols in that it explicitly matches a line break character, whereas . will not :

Matches any single character except line terminators

So, if you specify \\s$ this will match the CR in \\r\\n because the regex engine is forced to look for a match for both \\s and $ , therefore it finds the \\r before the \\n . However, this will not happen for many other patterns, since $ will usually be satisfied when it's before CR (or at the end of the string).

Same with ^\\s it will explicitly look for a whitespace character after a linebreak which is satisfied by the LF in CRLF, however if you're not seeking that, then it will happily match after the LF:

 const stringLF = "hello\\nworld"; const stringCRLF = "hello\\r\\nworld"; const regexStartAll = /^./mg; const regexEndAll = /.$/gm; console.log(stringLF.match(regexStartAll)); console.log(stringCRLF.match(regexStartAll)); console.log(stringLF.match(regexEndAll)); console.log(stringCRLF.match(regexEndAll));

So, all of this means that ^\\s+$ has some unintuitive behaviour yet perfectly coherent once you understand that the regex engine matches exactly what you tell it to.

Why does LF and CRLF behave differently with /^\s*$/gm regex?

Question

1 answers

solution1
3 ACCPTED 2020-03-17 21:17:34

Why does LF and CRLF behave differently with /^\s*$/gm regex?

Question

1 answers

solution1 3 ACCPTED 2020-03-17 21:17:34

solution1
3 ACCPTED 2020-03-17 21:17:34