[英]why does \B works but not \b
Wanted to match a word that ends with #
like 想要匹配以
#
结尾的单词 ,例如
hi hello# world# 嗨, 你好, 世界#
I tried to use boundary 我试图使用边界
\b\w+#\b
and it doesn't match.I thought \\b
is a non word boundary but it doesn't seem so from this case 它不匹配。我以为
\\b
是一个非单词边界,但在这种情况下似乎不是这样
Surprisingly 出奇
\b\w+#\B
matches! 火柴!
So why does \\B
works here and not \\b
!Also why doesn't \\b
work in this case! 那么,为什么
\\B
在这里工作,而不是\\b
!另外,为什么不\\b
在这种情况下工作!
NOTE: Yes we can use \\b\\w+#(?=\\s|$)
but I want to know why \\B
works in this case! 注意:是的,我们可以使用
\\b\\w+#(?=\\s|$)
但我想知道为什么\\B
在这种情况下有效!
\\b
\\b
定义 Defining word boundary in word is imprecise. 在单词中定义单词边界是不准确的。 Let me define the word boundary with look-ahead , look-behind , and short-hand word character class
\\w
. 让我用前向 , 后向和简写单词字符类
\\w
定义单词边界。
A word boundary \\b
is equivalent to: 单词边界
\\b
等效于:
(?:(?<!\w)(?=\w)|(?<=\w)(?!\w))
Which means: 意思是:
Right ahead, there is (at least) a character that is a word character, and right behind, we cannot find a word character (either the character is not a word character, or it is the start of the string). 就在前面,至少有一个字符是单词字符, 而在后面,我们找不到单词字符(该字符不是单词字符,或者它是字符串的开头)。
OR 要么
(Note how similar this is to the expansion of XOR into conjunction and disjunction) (请注意,这与将XOR扩展为合取和析取有多相似)
A non-word boundary \\B
is equivalent to: 非单词边界
\\B
等效于:
(?:(?<!\w)(?!\w)|(?<=\w)(?=\w))
Which means: 意思是:
Right ahead and right behind, we cannot find any word character. 就在前面和后面,我们找不到任何文字字符。 Note that empty string is consider a non-word boundary under this definition.
请注意,在此定义下,空字符串被视为非单词边界。
OR 要么
(Note how similar this is to the expansion of XNOR into conjunction and disjunction). (请注意,这与将XNOR扩展为合取和析取有多相似)。
\\w
\\w
定义 Since the definition of \\b
and \\B
depends on definition of \\w
1 , you need to consult the specific documentation to know exactly what \\w
matches. 由于
\\b
和\\B
定义取决于\\w
1的定义,因此您需要查阅特定文档以确切了解\\w
匹配项。
1 Most of the regex flavors define \\b
based on \\w
. 1大多数正则表达式都基于
\\w
定义\\b
。 Well, except for Java [Point 9] , where in default mode, \\w
is ASCII-only and \\b
is partially Unicode-aware. 好吧, 除了Java [Point 9]之外 ,在默认模式下,
\\w
仅是ASCII,而\\b
是部分Unicode感知的。
In JavaScript , it would be [A-Za-z0-9_]
in default mode. 在JavaScript中 ,默认模式为
[A-Za-z0-9_]
。
In .NET , \\w
by default would match [\\p{Ll}\\p{Lu}\\p{Lt}\\p{Lo}\\P{Lm}\\p{Nd}\\p{Pc}]
, and it will have the same behaviour as JavaScript if ECMAScript option is specified. 在.NET中 ,
\\w
默认情况下将匹配[\\p{Ll}\\p{Lu}\\p{Lt}\\p{Lo}\\P{Lm}\\p{Nd}\\p{Pc}]
,并且如果指定ECMAScript选项,它将具有与JavaScript相同的行为。 In the list of characters in Pc category , you only have to know that space (ASCII 32) is not included. 在“ PC”类别中的字符列表中 ,您只需要知道不包含空格(ASCII 32)。
With the definition above, answering the question becomes easy: 使用上面的定义,回答问题变得容易:
"hi hello# world#"
In hello#
, after #
is space ( U+0020, in Zs category ), which is not a word character, and #
is not a word character itself ( in Unicode, it is in Po category ). 在
hello#
, #
是空格( 在Zs类别中为U + 0020 ),它不是单词字符,而#
本身不是单词在字符中( 在Unicode中,它属于Po类别 )。 Therefore, \\B
can match here. 因此,
\\B
可以在此处匹配。 The branch (?<!\\w)(?!\\w)
is used in this case. 在这种情况下使用分支
(?<!\\w)(?!\\w)
。
In world#
, after #
is end of string. 在
world#
, #
是字符串的结尾。 Since #
is not a word character, and we cannot find any word character ahead (there is nothing there), \\B
can match the empty string just after #
. 由于
#
不是单词字符,并且我们在前面找不到任何单词字符(那里什么也没有),因此\\B
可以匹配#
之后的空字符串。 The branch (?<!\\w)(?!\\w)
is also used in this case. 在这种情况下也使用分支
(?<!\\w)(?!\\w)
。
Alan Moore gives quite a good summary in the comment : 艾伦·摩尔(Alan Moore)在评论中给出了很好的总结:
I think the key point to remember is that regexes can't read.
我认为要记住的关键是正则表达式不能阅读。 That is, they don't deal in words, only in characters.
也就是说,它们不处理单词,仅处理字符。 When we say
\\b
matches the beginning or end of a word, we don't mean it identifies a word and then seeks out its endpoints, like a human would.当我们说
\\b
匹配单词的开头或结尾时,并不是说它先识别单词,然后像人类一样寻找其终点。 All it can see is the character before the current position and the character after the current position.它只能看到当前位置之前的字符和当前位置之后的字符。 Thus,
\\b
only indicates that the current position could be a word boundary.因此,
\\b
仅表示当前位置可能是单词边界。 It's up to you to make sure the characters on either side what they should be.由您决定两边的字符应该是什么。
The pound #
symbol is not considered a "word boundary". 井号
#
符号不被视为“单词边界”。
\\b\\w+#\\b
doesn't work because w+#
is not considered a word, therefore it will not match world#
. \\b\\w+#\\b
不起作用,因为w+#
不被视为一个单词,因此不会与world#
匹配。
\\b\\w+6\\b
on the other hand is, therefore it will match world6
. \\b\\w+6\\b
,另一方面是,因此它将匹配world6
。
"Word Characters" are defined by: [A-Za-z0-9_]
. “文字字符”的定义如下:
[A-Za-z0-9_]
。
Simply put:
\\b
allows you to perform a "whole words only" search using a regular expression in the form of\\bword\\b
.简而言之:
\\b
允许您使用\\bword\\b
形式的正则表达式执行“仅整个单词”搜索。 A "word character" is a character that can be used to form words.“单词字符”是可用于形成单词的字符。 All characters that are not "word characters" are "non-word characters".
不是“单词字符”的所有字符都是“非单词字符”。
— http://www.regular-expressions.info/wordboundaries.html— http://www.regular-expressions.info/wordboundaries.html
The #
and the space are both non-word characters, so the invisible boundary between them is not a word boundary. #
和空格都是非单词字符,因此它们之间的不可见边界不是单词边界。 Therefore \\b
will not match it and \\B
will match it. 因此
\\b
将不匹配它,而\\B
将匹配它。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.