简体   繁体   English

为什么\\ B有效但\\ b不起作用

[英]why does \B works but not \b

Wanted to match a word that ends with # like 想要匹配以#结尾的单词 ,例如

hi hello# world# 嗨, 你好, 世界#

I tried to use boundary 我试图使用边界

\b\w+#\b

and it doesn't match.I thought \\b is a non word boundary but it doesn't seem so from this case 它不匹配。我以为\\b是一个非单词边界,但在这种情况下似乎不是这样


Surprisingly 出奇

\b\w+#\B

matches! 火柴!

So why does \\B works here and not \\b !Also why doesn't \\b work in this case! 那么,为什么\\B在这里工作,而不是\\b !另外,为什么不\\b在这种情况下工作!


NOTE: Yes we can use \\b\\w+#(?=\\s|$) but I want to know why \\B works in this case! 注意:是的,我们可以使用\\b\\w+#(?=\\s|$)但我想知道为什么\\B在这种情况下有效!

Definition of word boundary \\b 单词边界\\b定义

Defining word boundary in word is imprecise. 在单词中定义单词边界是不准确的。 Let me define the word boundary with look-ahead , look-behind , and short-hand word character class \\w . 让我用前向后向和简写单词字符类\\w定义单词边界。

A word boundary \\b is equivalent to: 单词边界\\b等效于:

(?:(?<!\w)(?=\w)|(?<=\w)(?!\w))

Which means: 意思是:

  • Right ahead, there is (at least) a character that is a word character, and right behind, we cannot find a word character (either the character is not a word character, or it is the start of the string). 就在前面,至少有一个字符是单词字符, 在后面,我们找不到单词字符(该字符不是单词字符,或者它是字符串的开头)。

    OR 要么

  • Right behind, there is (at least) a character that is a word character, and right ahead, we cannot find a word character (either the character is not a word character, or it is the end of the string). 至少在后面有一个字符,它是单词字符, 在前面,我们找不到单词字符(该字符不是单词字符,或者它是字符串的结尾)。

(Note how similar this is to the expansion of XOR into conjunction and disjunction) (请注意,这与将XOR扩展为合取和析取有多相似)

A non-word boundary \\B is equivalent to: 非单词边界\\B等效于:

(?:(?<!\w)(?!\w)|(?<=\w)(?=\w))

Which means: 意思是:

  • Right ahead and right behind, we cannot find any word character. 就在前面和后面,我们找不到任何文字字符。 Note that empty string is consider a non-word boundary under this definition. 请注意,在此定义下,空字符串被视为非单词边界。

    OR 要么

  • Right ahead and right behind, both sides are word characters. 在前面和后面,双方都是文字字符。 Note that this branch requires 2 characters, ie cannot occur at the beginning or the end of a non-empty string. 请注意,此分支需要2个字符,即不能出现在非空字符串的开头或结尾。

(Note how similar this is to the expansion of XNOR into conjunction and disjunction). (请注意,这与将XNOR扩展为合取和析取有多相似)。

Definition of word character \\w 字字符\\w定义

Since the definition of \\b and \\B depends on definition of \\w 1 , you need to consult the specific documentation to know exactly what \\w matches. 由于\\b\\B定义取决于\\w 1的定义,因此您需要查阅特定文档以确切了解\\w匹配项。

1 Most of the regex flavors define \\b based on \\w . 1大多数正则表达式都基于\\w定义\\b Well, except for Java [Point 9] , where in default mode, \\w is ASCII-only and \\b is partially Unicode-aware. 好吧, 除了Java [Point 9]之外 ,在默认模式下, \\w仅是ASCII,而\\b是部分Unicode感知的。

Answer to the question 回答问题

With the definition above, answering the question becomes easy: 使用上面的定义,回答问题变得容易:

"hi hello# world#"

In hello# , after # is space ( U+0020, in Zs category ), which is not a word character, and # is not a word character itself ( in Unicode, it is in Po category ). hello##是空格( 在Zs类别中U + 0020 ),它不是单词字符,而#本身不是单词在字符中( 在Unicode中,它属于Po类别 )。 Therefore, \\B can match here. 因此, \\B可以在此处匹配。 The branch (?<!\\w)(?!\\w) is used in this case. 在这种情况下使用分支(?<!\\w)(?!\\w)

In world# , after # is end of string. world##是字符串的结尾。 Since # is not a word character, and we cannot find any word character ahead (there is nothing there), \\B can match the empty string just after # . 由于#不是单词字符,并且我们在前面找不到任何单词字符(那里什么也没有),因此\\B可以匹配#之后的空字符串。 The branch (?<!\\w)(?!\\w) is also used in this case. 在这种情况下也使用分支(?<!\\w)(?!\\w)

Addendum 附录

Alan Moore gives quite a good summary in the comment : 艾伦·摩尔(Alan Moore)在评论中给出了很好的总结:

I think the key point to remember is that regexes can't read. 我认为要记住的关键是正则表达式不能阅读。 That is, they don't deal in words, only in characters. 也就是说,它们不处理单词,仅处理字符。 When we say \\b matches the beginning or end of a word, we don't mean it identifies a word and then seeks out its endpoints, like a human would. 当我们说\\b匹配单词的开头或结尾时,并不是说它先识别单词,然后像人类一样寻找其终点。 All it can see is the character before the current position and the character after the current position. 它只能看到当前位置之前的字符和当前位置之后的字符。 Thus, \\b only indicates that the current position could be a word boundary. 因此, \\b仅表示当前位置可能是单词边界。 It's up to you to make sure the characters on either side what they should be. 由您决定两边的字符应该是什么。

The pound # symbol is not considered a "word boundary". 井号#符号不被视为“单词边界”。

\\b\\w+#\\b doesn't work because w+# is not considered a word, therefore it will not match world# . \\b\\w+#\\b不起作用,因为w+#不被视为一个单词,因此不会与world#匹配。
\\b\\w+6\\b on the other hand is, therefore it will match world6 . \\b\\w+6\\b ,另一方面是,因此它将匹配world6

"Word Characters" are defined by: [A-Za-z0-9_] . “文字字符”的定义如下: [A-Za-z0-9_]

Simply put: \\b allows you to perform a "whole words only" search using a regular expression in the form of \\bword\\b . 简而言之: \\b允许您使用\\bword\\b形式的正则表达式执行“仅整个单词”搜索。 A "word character" is a character that can be used to form words. “单词字符”是可用于形成单词的字符。 All characters that are not "word characters" are "non-word characters". 不是“单词字符”的所有字符都是“非单词字符”。

http://www.regular-expressions.info/wordboundaries.html http://www.regular-expressions.info/wordboundaries.html

The # and the space are both non-word characters, so the invisible boundary between them is not a word boundary. #和空格都是非单词字符,因此它们之间的不可见边界不是单词边界。 Therefore \\b will not match it and \\B will match it. 因此\\b将不匹配它,而\\B将匹配它。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 为什么“a ^ = b ^ = a ^ = b;”不同于“a ^ = b; b ^ = A;一个^ = B;”? - Why is “a^=b^=a^=b;” different from “a^=b; b^=a; a^=b;”? 为什么这个'b'链表改变了? - Why does this 'b' linked list changed? 为什么Java允许将Object类型的表达式显式转换为A <B<C> &gt;,类型A <?> 到A <B<C> &gt;但不是A型 <B<?> &gt;至A <B<C> &gt;? - Why does Java allow the explicit conversion of expressions of type Object to A<B<C>>, of type A<?> to A<B<C>> but not of type A<B<?>> to A<B<C>>? Ctrl+B 键盘绑定如何工作? - How does Ctrl+B keyboard binding works? 当&#39;B&#39;延伸&#39;A&#39;时,为什么要实现接口&#39;A&#39;和&#39;B&#39; - Why implement interfaces 'A' and 'B', when 'B' extends 'A' 为什么 (a*b?= 0) 在 Java 中比 (a != 0 && b != 0) 快? - Why is (a*b != 0) faster than (a != 0 && b != 0) in Java? 为什么第 4 行只打印出“B move”? - Why does line 4 only print out "B move"? 为什么JVM允许将B []传递给期望A []的方法? - Why does the JVM allow be to pass a B[] to a method that expects an A[]? 为什么Maven选择版本1.0.b2超过1.3.03 - Why does Maven choose version 1.0.b2 over 1.3.03 正则表达式:为什么 ^.*(\\ba\\w*\\b)?.*$ 没有捕获任何东西? - regex: why does ^.*(\ba\w*\b)?.*$ not capture anything?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM