简体   繁体   English

使用\\ b和国际字符的Javascript正则表达式问题

[英]Javascript regular expression problem with \b and international characters

I'm having a lot of problems with a simple regular expression match. 我在简单的正则表达式匹配方面遇到了很多问题。

I have this string with accented characters (this is just an example) "Botó Entrepà Nadó Facebook! " and I want to match words using words from another list. 我有这个带有重音字符的字符串(这只是一个例子) "Botó Entrepà Nadó Facebook! "我希望使用其他列表中的单词匹配单词。

This is a simplified version of my code. 这是我的代码的简化版本。 For example to match " Botó " 例如,匹配“ Botó

var matchExpr = new RegExp ('\\b' + 'Botó' + '\\b','i'); 
"Botó Entrepà Nadó Facebook! ".match(matchExpr);

If I run it, it doesn't match " Botó " as expected (Firefox, IE and Chrome). 如果我运行它,它与预期的“ Botó ”不匹配(Firefox,IE和Chrome)。

I thought it was an error on my side. 我认为这是我的错误。 But here comes the fun... 但有趣的是......

If I modify the string like this "Botón Entrepà Nadó Facebook! " (notice the " n " after " Botó ") and I run the same code: 如果我像这样修改字符串"Botón Entrepà Nadó Facebook! " (注意“ Botó ”之后的“ n ”)并运行相同的代码:

var matchExpr = new RegExp ('\\b' + 'Botó' + '\\b','i'); 
"Botón Entrepà Nadó Facebook! ".match(matchExpr);

It matches " Botó "!!!!????? 它匹配“ Botó ”!!!! ????? (at least in Firefox). (至少在Firefox中)。 This does't make sense for me as " n " is NOT a word boundary (that is matched by \\b ). 这对我来说没有意义,因为“ n ”不是单词边界(与\\b匹配)。

If you try to match the whole word: 如果您尝试匹配整个单词:

var matchExpr = new RegExp ('\\b' + 'Botón' + '\\b','i'); 
"Botón Entrepà Nadó Facebook! ".match(matchExpr);

It works. 有用。

To make it a little bit more weird, we add another accented letter at the end. 为了使它更奇怪,我们在最后添加另一个带重音的字母。

var matchExpr = new RegExp ('\\b' + 'Botóñ' + '\\b','i'); 
"Botóñ Entrepà Nadó Facebook! ".match(matchExpr);

If we try to match this, it matches nothing. 如果我们尝试匹配它,它什么都不匹配。 BUT, if we try this 但是,如果我们试试这个

var matchExpr = new RegExp ('\\b' + 'Botóñ' + '\\b','i'); 
"Botóña Entrepà Nadó Facebook! ".match(matchExpr);

it matches " Botóñ ". 它匹配“ Botóñ ”。 Which is wrong. 哪个错了。

If we try to match "Facebook" it works as expected. 如果我们尝试匹配“Facebook”,它按预期工作。 If you try to match words with accents in the middle, it works as expected. 如果您尝试在中间匹配带重音的单词,它会按预期工作。 But if you try to match words with an accent at the end, it fails. 但是如果你尝试在最后匹配带有重音的单词,它就会失败。

What am I doing wrong? 我究竟做错了什么? Is this the expected behaviour? 这是预期的行为吗?

Unfortunately, the shorthand character classes in Javascript don't support unicode (or even high ASCII). 不幸的是,Javascript中的速记字符类不支持unicode(甚至高ASCII)。

Take a look at the answers to this question: Javascript + Unicode . 看看这个问题的答案: Javascript + Unicode This article, linked in that question, JavaScript, Regex, and Unicode , says that \\b is defined by a word boundary, which is defined as: 本文与该问题相关联, JavaScript,Regex和Unicode ,表示\\b由单词边界定义,其定义为:

→ Word character — The characters AZ, az, 0-9, and _ only. →单词字符 - 仅限字符AZ,az,0-9和_。
→ Word boundary — The position between a word character and non-word character. →单词边界 - 单词字符和非单词字符之间的位置。

So it will work for words with AZ, az, 0-9, and _ at the end, but not with accented characters at the end. 所以它适用于最后带有AZ, az, 0-9, and _单词,但最后不带有重音字符。

From the ES3 spec: 从ES3规范:

The internal helper function IsWordChar takes an integer parameter e and performs the following: 内部帮助函数IsWordChar采用整数参数e并执行以下操作:

  1. If e == –1 or e == InputLength, return false. 如果e == -1或e == InputLength,则返回false。
  2. Let c be the character Input[e]. 设c为字符Input [e]。
  3. If c is one of the sixty-three characters in the table below, return true. 如果c是下表中的63个字符之一,则返回true。

     abcdefghijklmnopqrstu vwxyz ABCDEFGHIJKLMNOPQRSTU VWXYZ 0 1 2 3 4 5 6 7 8 9 _ 
  4. Return false. 返回false。

The "IsWordChar()" internal (possibly hypothetical) function is the basis of behavior for the "\\b" assertion. “IsWordChar()”内部(可能是假设的)函数是“\\ b”断言行为的基础。

edit — it's no better in ES5. 编辑 - 在ES5中没有比这更好的了。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM