简体   繁体   English

sed和grep中单词的开头和结尾

[英]Beginning and end of words in sed and grep

I don't understand the difference between \\b and \\< in GNU sed and GNU grep. 我不明白\\b\\<在GNU sed和GNU grep之间的区别。 It seems to me \\b can always replace \\< and \\\\> without changing the set of matching strings. 在我看来\\b总是可以替换\\<\\\\>而不更改匹配字符串的集合。

More specifically, I am trying to find examples in which \\bsomething and \\\\< something do not match exactly the same strings. 更具体地说,我试图找到其中\\bsomething\\\\< something与完全相同的字符串不匹配的示例。

Same question for something\\b and something\\\\> . 对于something\\bsomething\\\\>相同问题。

Thank you 谢谢

I suspect that it very rarely makes a difference whether you use (the more common) \\b or (the more specific) \\< and \\> , but I can think of an example where it would. 我怀疑你是否使用(更常见的) \\b或(更具体的) \\<\\> ,它很少有所作为,但我可以想到一个例子。 This is quite contrived, and I suspect that in most real-world regex use it wouldn't make a difference, but this should demonstrate that it at least could make a difference in some cases. 这是非常人为的,我怀疑在大多数现实世界的正则表达式中使用它并没有什么区别,但这应该证明它至少可以在某些情况下产生影响。

If I have the following text: 如果我有以下文字:

this is his pig

and I want to know if /\\bis\\b/ matches, it wouldn't matter if I instead used /\\<is\\>/ or I instead used /\\>is\\</ 我想知道如果/\\bis\\b/匹配,如果我改为使用/\\<is\\>/或者我改为使用/\\>is\\</

But what if my text was instead 但是,如果我的文字是相反的

is this his pig

There's no longer a word-final boundary before the 'is', only a word-initial boundary. 在'是'之前不再有单词最终边界,只有单词初始边界。 Using /\\bis\\b/ matches, and of course /\\<is\\>/ does too, but /\\>is\\</ does not. 使用/\\bis\\b/匹配,当然/\\<is\\>/也可以,但/\\>is\\</不。

In real life, though, I think it is not common that you really need to be able to make this distinction, which is why (at least outside of sed) \\b is the normal word boundary marker for regular expressions. 但是在现实生活中,我认为你真的需要能够做出这种区分并不常见,这就是为什么(至少在sed之外) \\b是正则表达式的正常单词边界标记。

\\< matches the transition from non-word to word. \\<匹配从非单词到单词的过渡。

\\> matches the transition from word to non-word. \\>匹配从单词到非单词的过渡。

\\b is equivalent to (\\<|\\>) in extended regex. \\b等效于扩展正则表达式中的(\\<|\\>)

So I won't say \\b and \\< are the same. 所以我不会说\\b\\<是一样的。 I'd say \\b is a superset of \\< . 我会说\\b\\<的超集。 Vice versa for \\b and \\> . 反之亦然\\b\\>

I stumbled upon such an example earlier. 我早先偶然发现了这样一个例子。
\\<.\\> matches a one letter word. \\ <。\\>匹配单个字母的单词。
Using \\b you would need to put something like \\b[^ ]\\b, because \\b.\\b matches a space between two words. 使用\\ b你需要输入类似\\ b [^] \\ b的东西,因为\\ b。\\ b匹配两个单词之间的空格。

According to LinuxTopia the only difference between the two type of word boundaries is that whilst \\< and \\> work in most sed versions; 根据LinuxTopia ,两种类型的单词边界之间的唯一区别是,虽然\\<\\>在大多数sed版本中工作; the latter \\b works only if your system is using gsed 后者\\b仅在您的系统使用gsed时有效

And a quotation from the wiki: 来自维基的引文:

These symbols include '\\<' and '>' (gsed, ssed, sed15, sed16, sedmod) and '\\b' and '\\B' (gsed only). 这些符号包括'\\ <'和'>'(gsed,ssed,sed15,sed16,sedmod)和'\\ b'和'\\ B'(仅限gsed)。

Other than that the two are identical. 除此之外,两者是相同的。 Also here is a table that explains all possible scenarios that use word boundaries: 此处还有一个表格,解释了使用单词边界的所有可能方案:

  Match position      Possible word boundaries   HHsed   GNU sed
  ---------------------------------------------------------------
  start of word    [nonword char]^[word char]      \<    \< or \b
  end of word         [word char]^[nonword char]   \>    \> or \b
  middle of word      [word char]^[word char]     none      \B
  outside of word  [nonword char]^[nonword char]  none      \B
  ---------------------------------------------------------------

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM