為什么 LF 和 CRLF 與 /^\\s*$/gm 正則表達式的行為不同？

Question

我一直在 Windows 上看到這個問題。 當我嘗試清除 Unix 上每一行上的任何空格時：

const input =
`===

HELLO

WOLRD

===`
console.log(input.replace(/^\s+$/gm, ''))

這產生了我所期望的：

===

HELLO

WOLRD

===

即如果有空行上的空格，它們會被刪除。 另一方面，在 Windows 上，正則表達式會清除整個字符串。 為了顯示：

const input =
`===

HELLO

WOLRD

===`.replace(/\r?\n/g, '\r\n')
console.log(input.replace(/^\s+$/gm, ''))

（模板文字在 JS 中總是只打印\\n ，所以我不得不用\\r\\n替換來模擬 Windows（ ?在\\r只是為了確保那些不相信的人）。結果：

===
HELLO
WOLRD
===

整條線都沒了！ 但是我的正則表達式有^和$設置了m標志，所以它有點像/^-to-$/m 。 \\r和\\r\\n之間的區別是什么使它產生不同的結果？

當我做一些日志記錄時

console.log(input.replace(/^\s*$/gm, (m) => {
  console.log('matched')
  return ''
}))

隨着 \\r\\n 我看到

matched
matched
matched
matched
matched
matched
===
HELLO
WOLRD
===

並且只有 \\n

matched
matched
matched
===

HELLO

WOLRD

===

Answer 1

TL;DR包含空格和換行符的模式也將匹配\\r\\n序列的字符部分，如果你允許的話。

首先，讓我們實際檢查一下替換時哪些字符存在，哪些不存在。 從僅使用換行符的字符串開始：

 const inputLF = `=== HELLO WOLRD ===`.replace(/\\r?\\n/g, "\\n"); console.log('------------ INPUT ') console.log(inputLF); console.log('------------') debugPrint(inputLF, 2); debugPrint(inputLF, 3); debugPrint(inputLF, 4); debugPrint(inputLF, 5); const replaceLF = inputLF.replace(/^\\s+$/gm, ''); console.log('------------ REPLACEMENT') console.log(replaceLF); console.log('------------') debugPrint(replaceLF, 2); debugPrint(replaceLF, 3); debugPrint(replaceLF, 4); debugPrint(replaceLF, 5); console.log(`charcode ${replaceLF.charCodeAt(2)} : ${replaceLF.charAt(2)}`); console.log(`charcode ${replaceLF.charCodeAt(3)} : ${replaceLF.charAt(3)}`); console.log(`charcode ${replaceLF.charCodeAt(4)} : ${replaceLF.charAt(4)}`); console.log(`charcode ${replaceLF.charCodeAt(5)} : ${replaceLF.charAt(5)}`); console.log('------------') console.log('inputLF === replaceLF :', inputLF === replaceLF) function debugPrint(str, charIndex) { console.log(`index: ${charIndex} charcode: ${str.charCodeAt(charIndex)} character: ${str.charAt(charIndex)}` ); }

每行以字符代碼 10 結尾，它是換行 (LF) 字符，用\\n表示在字符串文字中。 在替換之前和之后，兩個字符串是相同的——不僅看起來相同而且實際上彼此相等，因此替換什么也沒做。

現在讓我們檢查另一種情況：

 const inputCRLF = `=== HELLO WOLRD ===`.replace(/\\r?\\n/g, "\\r\\n") console.log('------------ INPUT ') console.log(inputCRLF); console.log('------------') debugPrint(inputCRLF, 2); debugPrint(inputCRLF, 3); debugPrint(inputCRLF, 4); debugPrint(inputCRLF, 5); debugPrint(inputCRLF, 6); debugPrint(inputCRLF, 7); const replaceCRLF = inputCRLF.replace(/^\\s+$/gm, '');; console.log('------------ REPLACEMENT') console.log(replaceCRLF); console.log('------------') debugPrint(replaceCRLF, 2); debugPrint(replaceCRLF, 3); debugPrint(replaceCRLF, 4); debugPrint(replaceCRLF, 5); function debugPrint(str, charIndex) { console.log(`index: ${charIndex} charcode: ${str.charCodeAt(charIndex)} character: ${str.charAt(charIndex)}` ); }

這次每一行都以字符代碼 13 結尾，這是回車 (CR) 字符，用\\r表示在字符串文字中，然后是 LF。 替換后，不是具有=\\r\\n\\r\\nH序列，而是不僅僅是=\\r\\nH 。 讓我們來看看為什么。

以下是 MDN關於元字符^ ：

匹配輸入的開頭。 如果 multiline 標志設置為 true，也會在換行符后立即匹配。

這是 MDN 關於元字符$

匹配輸入的結尾。 如果 multiline 標志設置為 true，則還匹配緊接在換行符之前的字符。

所以他們在和換行符前匹配。 其中，MDN 表示 LF或CR。 如果我們測試包含不同換行符的字符串，就可以看到這一點：

 const stringLF = "hello\\nworld"; const stringCRLF = "hello\\r\\nworld"; const regexStart = /^\\s/m; const regexEnd = /\\s$/m; console.log(regexStart.exec(stringLF)); console.log(regexStart.exec(stringCRLF)); console.log(regexEnd.exec(stringLF)); console.log(regexEnd.exec(stringCRLF));

如果我們嘗試匹配換行符附近的空格，如果有 LF，這不會匹配任何內容，但它確實將 CR 與 CRLF 匹配。 因此，在這種情況下， $將在此處匹配：

"hello\r\nworld"
        ^^ what `^\s` matches

"hello\r\nworld"
      ^^ what `\s$` matches

所以^和$都將 CRLF 序列中的任何一個識別為行尾。 當您進行搜索和替換時，這將有所作為。 由於您的正則表達式指定^\\s+$這意味着當您有一行完全是\\r\\n它匹配. 但有一個不明顯的原因：

 const re = /^\\s+$/m; const sringLF = "hello\\n\\nworld"; const stringCRLF = "hello\\r\\n\\r\\nworld"; console.log(re.exec(sringLF)); console.log(re.exec(stringCRLF));

因此，正則表達式不匹配\\r\\n而是匹配其他兩個換行符之間的\\n\\r （兩個空白字符）。 這是因為+是急切的，並且會盡可能多地消耗字符序列。 這是正則表達式引擎將嘗試的內容。 為簡潔起見有些簡化：

input = "hello\r\n\r\nworld
regex = /^\s+$/

Step 1
hello[\r]\n\r\nworld
    matches `^`, symbol satisfied -> continue with next symbol in regex

Step 2
hello[\r\n]\r\nworld
    matches `^\s+` -> continue matching to satisfy `+` quantifier

Step 3
hello[\r\n\r]\nworld
    matches `^\s+` -> continue matching to satisfy `+` quantifier

Step 4
hello[\r\n\r\n]world
    matches `^\s+` -> continue matching to satisfy `+` quantifier

Step 5
hello[\r\n\r\nw]orld
    does not match `\s` -> backtrack

Step 6
hello[\r\n\r\n]world
    matches `^\s+`, quantifier satisfied -> continue to next symbol in regex

Step 7
hello[\r\n\r\nw]orld
    does not match `$` in `^\s+$` -> backtrack

Step 8
hello[\r\n\r\n]world
    matches `^\s+$`, last symbol satisfied -> finish

最后，這里有一些隱藏的東西 - 匹配空格很重要。 這是因為它與大多數其他符號的行為不同，因為它明確匹配換行符，而. 不會：

匹配除行終止符以外的任何單個字符

因此，如果您指定\\s$這將與\\r\\n中的 CR 匹配，因為正則表達式引擎被迫為\\s和$尋找匹配項，因此它會在\\n之前找到\\r 。 但是，對於許多其他模式不會發生這種情況，因為$通常在CR之前（或字符串末尾）時會得到滿足。

與^\\s相同，它將在 CRLF 中的 LF 滿足的換行符后顯式查找空格字符，但是如果您不尋找它，那么它會在 LF 之后愉快地匹配：

 const stringLF = "hello\\nworld"; const stringCRLF = "hello\\r\\nworld"; const regexStartAll = /^./mg; const regexEndAll = /.$/gm; console.log(stringLF.match(regexStartAll)); console.log(stringCRLF.match(regexStartAll)); console.log(stringLF.match(regexEndAll)); console.log(stringCRLF.match(regexEndAll));

因此，所有這一切都意味着^\\s+$具有一些不直觀的行為，但一旦您了解正則表達式引擎與您告訴它的完全匹配，就會完全一致。

為什么 LF 和 CRLF 與 /^\\s*$/gm 正則表達式的行為不同？

問題描述

1 個解決方案

解決方案1
3 已采納 2020-03-17 21:17:34

為什么 LF 和 CRLF 與 /^\\s*$/gm 正則表達式的行為不同？

問題描述

1 個解決方案

解決方案1 3 已采納 2020-03-17 21:17:34

解決方案1
3 已采納 2020-03-17 21:17:34