简体   繁体   English

为什么 LF 和 CRLF 与 /^\\s*$/gm 正则表达式的行为不同?

[英]Why does LF and CRLF behave differently with /^\s*$/gm regex?

I've been seeing this issue on Windows.我一直在 Windows 上看到这个问题。 When I try to clear any whitespace on each line on Unix:当我尝试清除 Unix 上每一行上的任何空格时:

const input =
`===

HELLO

WOLRD

===`
console.log(input.replace(/^\s+$/gm, ''))

This produces what I expect:这产生了我所期望的:

===

HELLO

WOLRD

===

ie if there were即如果有spaces on blank lines, they'd get removed.空行上的空格,它们会被删除。 On the other hand, on Windows, the regex clears the WHOLE string.另一方面,在 Windows 上,正则表达式会清除整个字符串。 To illustrate:为了显示:

const input =
`===

HELLO

WOLRD

===`.replace(/\r?\n/g, '\r\n')
console.log(input.replace(/^\s+$/gm, ''))

(template literals will always print only \\n in JS, so I had to replace with \\r\\n to emulate Windows ( ? after \\r just to be sure for those who don't believe). The result: (模板文字在 JS 中总是只打印\\n ,所以我不得不用\\r\\n替换来模拟 Windows( ?\\r只是为了确保那些不相信的人)。结果:

===
HELLO
WOLRD
===

The whole line is gone!整条线都没了! But my regex has ^ and $ with the m flag set, so it's kind of /^-to-$/m .但是我的正则表达式有^$设置了m标志,所以它有点像/^-to-$/m What's the difference between \\r and \\r\\n then that makes it produce different results? \\r\\r\\n之间的区别是什么使它产生不同的结果?

when I do some logging当我做一些日志记录时

console.log(input.replace(/^\s*$/gm, (m) => {
  console.log('matched')
  return ''
}))

With \\r\\n I'm seeing随着 \\r\\n 我看到

matched
matched
matched
matched
matched
matched
===
HELLO
WOLRD
===

and with \\n only并且只有 \\n

matched
matched
matched
===

HELLO

WOLRD

===

TL;DR a pattern including whitespace and line breaks will also match characters part of a \\r\\n sequence, if you let it. TL;DR包含空格换行符的模式也将匹配\\r\\n序列的字符部分,如果你允许的话。

First of all, let's actually examine what characters are there and aren't there when you do a replacement.首先,让我们实际检查一下替换时哪些字符存在,哪些不存在。 Starting with a string that only uses line feeds:从仅使用换行符的字符串开始:

 const inputLF = `=== HELLO WOLRD ===`.replace(/\\r?\\n/g, "\\n"); console.log('------------ INPUT ') console.log(inputLF); console.log('------------') debugPrint(inputLF, 2); debugPrint(inputLF, 3); debugPrint(inputLF, 4); debugPrint(inputLF, 5); const replaceLF = inputLF.replace(/^\\s+$/gm, ''); console.log('------------ REPLACEMENT') console.log(replaceLF); console.log('------------') debugPrint(replaceLF, 2); debugPrint(replaceLF, 3); debugPrint(replaceLF, 4); debugPrint(replaceLF, 5); console.log(`charcode ${replaceLF.charCodeAt(2)} : ${replaceLF.charAt(2)}`); console.log(`charcode ${replaceLF.charCodeAt(3)} : ${replaceLF.charAt(3)}`); console.log(`charcode ${replaceLF.charCodeAt(4)} : ${replaceLF.charAt(4)}`); console.log(`charcode ${replaceLF.charCodeAt(5)} : ${replaceLF.charAt(5)}`); console.log('------------') console.log('inputLF === replaceLF :', inputLF === replaceLF) function debugPrint(str, charIndex) { console.log(`index: ${charIndex} charcode: ${str.charCodeAt(charIndex)} character: ${str.charAt(charIndex)}` ); }

Each line ends with char code 10 which is the Line Feed (LF) character that is represented in a string literal with \\n .每行以字符代码 10 结尾,它是换行 (LF) 字符,用\\n表示在字符串文字中。 Before and after the replacement, the two strings are the same - not only look the same but actually equal each other, so the replacement did nothing.在替换之前和之后,两个字符串是相同的——不仅看起来相同而且实际上彼此相等,因此替换什么也没做。

Now let's examine the other case:现在让我们检查另一种情况:

 const inputCRLF = `=== HELLO WOLRD ===`.replace(/\\r?\\n/g, "\\r\\n") console.log('------------ INPUT ') console.log(inputCRLF); console.log('------------') debugPrint(inputCRLF, 2); debugPrint(inputCRLF, 3); debugPrint(inputCRLF, 4); debugPrint(inputCRLF, 5); debugPrint(inputCRLF, 6); debugPrint(inputCRLF, 7); const replaceCRLF = inputCRLF.replace(/^\\s+$/gm, '');; console.log('------------ REPLACEMENT') console.log(replaceCRLF); console.log('------------') debugPrint(replaceCRLF, 2); debugPrint(replaceCRLF, 3); debugPrint(replaceCRLF, 4); debugPrint(replaceCRLF, 5); function debugPrint(str, charIndex) { console.log(`index: ${charIndex} charcode: ${str.charCodeAt(charIndex)} character: ${str.charAt(charIndex)}` ); }

This time each line ends with char code 13 which is the Carriage Return (CR) character that is represented in a string literal with \\r and then the LF follows.这次每一行都以字符代码 13 结尾,这是回车 (CR) 字符,用\\r表示在字符串文字中,然后是 LF。 After the replacement, instead of having a sequence of =\\r\\n\\r\\nH instead it's not just =\\r\\nH .替换后,不是具有=\\r\\n\\r\\nH序列,而是不仅仅是=\\r\\nH Let's look at why.让我们来看看为什么。

Here is what MDN says about the meta character ^ : 以下是 MDN关于元字符^

Matches the beginning of input.匹配输入的开头。 If the multiline flag is set to true, also matches immediately after a line break character.如果 multiline 标志设置为 true,也会在换行符后立即匹配。

And here is what MDN says about the meta character $这是 MDN 关于元字符$

Matches the end of input.匹配输入的结尾。 If the multiline flag is set to true, also matches immediately before a line break character.如果 multiline 标志设置为 true,则还匹配紧接在换行符之前的字符。

So they match after and before a line break character.所以他们和换行符匹配。 In that, MDN means the LF or the CR.其中,MDN 表示 LFCR。 This can be seen if we test a string that contains different line breaks:如果我们测试包含不同换行符的字符串,就可以看到这一点:

 const stringLF = "hello\\nworld"; const stringCRLF = "hello\\r\\nworld"; const regexStart = /^\\s/m; const regexEnd = /\\s$/m; console.log(regexStart.exec(stringLF)); console.log(regexStart.exec(stringCRLF)); console.log(regexEnd.exec(stringLF)); console.log(regexEnd.exec(stringCRLF));

If we try to match whitespace near a line break, this doesn't match anything if there is an LF but it does match the CR with CRLF.如果我们尝试匹配换行符附近的空格,如果有 LF,这不会匹配任何内容,但它确实将 CR 与 CRLF 匹配。 So, in that case $ would match here:因此,在这种情况下, $将在此处匹配:

"hello\r\nworld"
        ^^ what `^\s` matches

"hello\r\nworld"
      ^^ what `\s$` matches

So both ^ and $ recognise either of the CRLF sequence as end of line.所以^$都将 CRLF 序列中的任何一个识别为行尾。 This will make a difference when you do a search and replace.当您进行搜索和替换时,这将有所作为。 Since your regex specifies ^\\s+$ that means that when you have a line that is entirely \\r\\n then it matches .由于您的正则表达式指定^\\s+$这意味着当您有一行完全是\\r\\n它匹配. But for a reason that is not obvious:但有一个不明显的原因:

 const re = /^\\s+$/m; const sringLF = "hello\\n\\nworld"; const stringCRLF = "hello\\r\\n\\r\\nworld"; console.log(re.exec(sringLF)); console.log(re.exec(stringCRLF));

So, the regex doesn't match an \\r\\n but rather \\n\\r (two whitespace characters) between two other line breakcharacters.因此,正则表达式不匹配\\r\\n而是匹配其他两个换行符之间的\\n\\r (两个空白字符)。 That's because + is eager and will consume as much of the character sequence as it can get away with.这是因为+是急切的,并且会尽可能多地消耗字符序列。 Here is what the regex engine will try.这是正则表达式引擎将尝试的内容。 Somewhat simplified for brevity:为简洁起见有些简化:

input = "hello\r\n\r\nworld
regex = /^\s+$/

Step 1
hello[\r]\n\r\nworld
    matches `^`, symbol satisfied -> continue with next symbol in regex

Step 2
hello[\r\n]\r\nworld
    matches `^\s+` -> continue matching to satisfy `+` quantifier

Step 3
hello[\r\n\r]\nworld
    matches `^\s+` -> continue matching to satisfy `+` quantifier

Step 4
hello[\r\n\r\n]world
    matches `^\s+` -> continue matching to satisfy `+` quantifier

Step 5
hello[\r\n\r\nw]orld
    does not match `\s` -> backtrack

Step 6
hello[\r\n\r\n]world
    matches `^\s+`, quantifier satisfied -> continue to next symbol in regex

Step 7
hello[\r\n\r\nw]orld
    does not match `$` in `^\s+$` -> backtrack

Step 8
hello[\r\n\r\n]world
    matches `^\s+$`, last symbol satisfied -> finish

Lastly, there is something slightly hidden here - it matters that you're matching whitespace.最后,这里有一些隐藏的东西 - 匹配空格很重要。 This is because it will behave differently to most other symbols in that it explicitly matches a line break character, whereas .这是因为它与大多数其他符号的行为不同,因为它明确匹配换行符,而. will not : 不会

Matches any single character except line terminators匹配行终止符以外的任何单个字符

So, if you specify \\s$ this will match the CR in \\r\\n because the regex engine is forced to look for a match for both \\s and $ , therefore it finds the \\r before the \\n .因此,如果您指定\\s$\\r\\n中的 CR 匹配,因为正则表达式引擎被迫为\\s$寻找匹配项,因此它会在\\n之前找到\\r However, this will not happen for many other patterns, since $ will usually be satisfied when it's before CR (or at the end of the string).但是,对于许多其他模式不会发生这种情况,因为$通常CR之前(或字符串末尾)时会得到满足。

Same with ^\\s it will explicitly look for a whitespace character after a linebreak which is satisfied by the LF in CRLF, however if you're not seeking that, then it will happily match after the LF:^\\s相同,它将在 CRLF 中的 LF 满足的换行符显式查找空格字符,但是如果您不寻找它,那么它会在 LF 之后愉快地匹配:

 const stringLF = "hello\\nworld"; const stringCRLF = "hello\\r\\nworld"; const regexStartAll = /^./mg; const regexEndAll = /.$/gm; console.log(stringLF.match(regexStartAll)); console.log(stringCRLF.match(regexStartAll)); console.log(stringLF.match(regexEndAll)); console.log(stringCRLF.match(regexEndAll));

So, all of this means that ^\\s+$ has some unintuitive behaviour yet perfectly coherent once you understand that the regex engine matches exactly what you tell it to.因此,所有这一切都意味着^\\s+$具有一些不直观的行为,但一旦您了解正则表达式引擎与您告诉它的完全匹配,就会完全一致

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 为什么jquery.serialize将LF更改为CRLF? - Why does jquery.serialize change LF to CRLF? 为什么jQuery的行为与javascript不同? - Why does jQuery behave differently than javascript? 为什么/ * * /注释的行为不同? Javascript错误? - Why does /* */ comment behave differently ? Javascript bug? 为什么相同的一次性正则表达式在两个平台上表现不同? - Why does the same single-use regex behave differently in two platforms? 为什么 oninput 事件在 Angular 中的行为与在 JavaScript 中的行为不同? - Why does the oninput event behave differently in Angular than it does in JavaScript? 为什么重新定义自己的功能在Chrome / IE和Firefox中表现不同? - Why does a function redefining itself behave differently in Chrome/IE and Firefox? 为什么removeChild函数的行为与列表项和i标记不同? - Why does removeChild function behave differently with list items and i tag? 为什么这个函数表达式的行为与函数声明不同? - Why does this function expression behave differently than a function declaration? 为什么此全局Javascript变量在函数内部和外部的行为有所不同? - Why does this global Javascript variable behave differently inside and outside of a function? 为什么这个javascript对象在有和没有模块模式的情况下表现不同? - Why does this javascript object behave differently with and without a module pattern?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM