简体   繁体   中英

Why does LF and CRLF behave differently with /^\s*$/gm regex?

I've been seeing this issue on Windows. When I try to clear any whitespace on each line on Unix:

const input =
`===

HELLO

WOLRD

===`
console.log(input.replace(/^\s+$/gm, ''))

This produces what I expect:

===

HELLO

WOLRD

===

ie if there werespaces on blank lines, they'd get removed. On the other hand, on Windows, the regex clears the WHOLE string. To illustrate:

const input =
`===

HELLO

WOLRD

===`.replace(/\r?\n/g, '\r\n')
console.log(input.replace(/^\s+$/gm, ''))

(template literals will always print only \\n in JS, so I had to replace with \\r\\n to emulate Windows ( ? after \\r just to be sure for those who don't believe). The result:

===
HELLO
WOLRD
===

The whole line is gone! But my regex has ^ and $ with the m flag set, so it's kind of /^-to-$/m . What's the difference between \\r and \\r\\n then that makes it produce different results?

when I do some logging

console.log(input.replace(/^\s*$/gm, (m) => {
  console.log('matched')
  return ''
}))

With \\r\\n I'm seeing

matched
matched
matched
matched
matched
matched
===
HELLO
WOLRD
===

and with \\n only

matched
matched
matched
===

HELLO

WOLRD

===

TL;DR a pattern including whitespace and line breaks will also match characters part of a \\r\\n sequence, if you let it.

First of all, let's actually examine what characters are there and aren't there when you do a replacement. Starting with a string that only uses line feeds:

 const inputLF = `=== HELLO WOLRD ===`.replace(/\\r?\\n/g, "\\n"); console.log('------------ INPUT ') console.log(inputLF); console.log('------------') debugPrint(inputLF, 2); debugPrint(inputLF, 3); debugPrint(inputLF, 4); debugPrint(inputLF, 5); const replaceLF = inputLF.replace(/^\\s+$/gm, ''); console.log('------------ REPLACEMENT') console.log(replaceLF); console.log('------------') debugPrint(replaceLF, 2); debugPrint(replaceLF, 3); debugPrint(replaceLF, 4); debugPrint(replaceLF, 5); console.log(`charcode ${replaceLF.charCodeAt(2)} : ${replaceLF.charAt(2)}`); console.log(`charcode ${replaceLF.charCodeAt(3)} : ${replaceLF.charAt(3)}`); console.log(`charcode ${replaceLF.charCodeAt(4)} : ${replaceLF.charAt(4)}`); console.log(`charcode ${replaceLF.charCodeAt(5)} : ${replaceLF.charAt(5)}`); console.log('------------') console.log('inputLF === replaceLF :', inputLF === replaceLF) function debugPrint(str, charIndex) { console.log(`index: ${charIndex} charcode: ${str.charCodeAt(charIndex)} character: ${str.charAt(charIndex)}` ); }

Each line ends with char code 10 which is the Line Feed (LF) character that is represented in a string literal with \\n . Before and after the replacement, the two strings are the same - not only look the same but actually equal each other, so the replacement did nothing.

Now let's examine the other case:

 const inputCRLF = `=== HELLO WOLRD ===`.replace(/\\r?\\n/g, "\\r\\n") console.log('------------ INPUT ') console.log(inputCRLF); console.log('------------') debugPrint(inputCRLF, 2); debugPrint(inputCRLF, 3); debugPrint(inputCRLF, 4); debugPrint(inputCRLF, 5); debugPrint(inputCRLF, 6); debugPrint(inputCRLF, 7); const replaceCRLF = inputCRLF.replace(/^\\s+$/gm, '');; console.log('------------ REPLACEMENT') console.log(replaceCRLF); console.log('------------') debugPrint(replaceCRLF, 2); debugPrint(replaceCRLF, 3); debugPrint(replaceCRLF, 4); debugPrint(replaceCRLF, 5); function debugPrint(str, charIndex) { console.log(`index: ${charIndex} charcode: ${str.charCodeAt(charIndex)} character: ${str.charAt(charIndex)}` ); }

This time each line ends with char code 13 which is the Carriage Return (CR) character that is represented in a string literal with \\r and then the LF follows. After the replacement, instead of having a sequence of =\\r\\n\\r\\nH instead it's not just =\\r\\nH . Let's look at why.

Here is what MDN says about the meta character ^ :

Matches the beginning of input. If the multiline flag is set to true, also matches immediately after a line break character.

And here is what MDN says about the meta character $

Matches the end of input. If the multiline flag is set to true, also matches immediately before a line break character.

So they match after and before a line break character. In that, MDN means the LF or the CR. This can be seen if we test a string that contains different line breaks:

 const stringLF = "hello\\nworld"; const stringCRLF = "hello\\r\\nworld"; const regexStart = /^\\s/m; const regexEnd = /\\s$/m; console.log(regexStart.exec(stringLF)); console.log(regexStart.exec(stringCRLF)); console.log(regexEnd.exec(stringLF)); console.log(regexEnd.exec(stringCRLF));

If we try to match whitespace near a line break, this doesn't match anything if there is an LF but it does match the CR with CRLF. So, in that case $ would match here:

"hello\r\nworld"
        ^^ what `^\s` matches

"hello\r\nworld"
      ^^ what `\s$` matches

So both ^ and $ recognise either of the CRLF sequence as end of line. This will make a difference when you do a search and replace. Since your regex specifies ^\\s+$ that means that when you have a line that is entirely \\r\\n then it matches . But for a reason that is not obvious:

 const re = /^\\s+$/m; const sringLF = "hello\\n\\nworld"; const stringCRLF = "hello\\r\\n\\r\\nworld"; console.log(re.exec(sringLF)); console.log(re.exec(stringCRLF));

So, the regex doesn't match an \\r\\n but rather \\n\\r (two whitespace characters) between two other line breakcharacters. That's because + is eager and will consume as much of the character sequence as it can get away with. Here is what the regex engine will try. Somewhat simplified for brevity:

input = "hello\r\n\r\nworld
regex = /^\s+$/

Step 1
hello[\r]\n\r\nworld
    matches `^`, symbol satisfied -> continue with next symbol in regex

Step 2
hello[\r\n]\r\nworld
    matches `^\s+` -> continue matching to satisfy `+` quantifier

Step 3
hello[\r\n\r]\nworld
    matches `^\s+` -> continue matching to satisfy `+` quantifier

Step 4
hello[\r\n\r\n]world
    matches `^\s+` -> continue matching to satisfy `+` quantifier

Step 5
hello[\r\n\r\nw]orld
    does not match `\s` -> backtrack

Step 6
hello[\r\n\r\n]world
    matches `^\s+`, quantifier satisfied -> continue to next symbol in regex

Step 7
hello[\r\n\r\nw]orld
    does not match `$` in `^\s+$` -> backtrack

Step 8
hello[\r\n\r\n]world
    matches `^\s+$`, last symbol satisfied -> finish

Lastly, there is something slightly hidden here - it matters that you're matching whitespace. This is because it will behave differently to most other symbols in that it explicitly matches a line break character, whereas . will not :

Matches any single character except line terminators

So, if you specify \\s$ this will match the CR in \\r\\n because the regex engine is forced to look for a match for both \\s and $ , therefore it finds the \\r before the \\n . However, this will not happen for many other patterns, since $ will usually be satisfied when it's before CR (or at the end of the string).

Same with ^\\s it will explicitly look for a whitespace character after a linebreak which is satisfied by the LF in CRLF, however if you're not seeking that, then it will happily match after the LF:

 const stringLF = "hello\\nworld"; const stringCRLF = "hello\\r\\nworld"; const regexStartAll = /^./mg; const regexEndAll = /.$/gm; console.log(stringLF.match(regexStartAll)); console.log(stringCRLF.match(regexStartAll)); console.log(stringLF.match(regexEndAll)); console.log(stringCRLF.match(regexEndAll));

So, all of this means that ^\\s+$ has some unintuitive behaviour yet perfectly coherent once you understand that the regex engine matches exactly what you tell it to.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM