简体   繁体   English

使用 Javascript 正则表达式匹配重音字符

[英]Matching accented characters with Javascript regexes

Here's a fun snippet I ran into today:这是我今天遇到的一个有趣的片段:

/\ba/.test("a") --> true
/\bà/.test("à") --> false

However,然而,

/à/.test("à") --> true

Firstly, wtf?首先, wtf?

Secondly, if I want to match an accented character at the start of a word, how can I do that?其次,如果我想在单词的开头匹配一个带重音的字符,我该怎么做? (I'd really like to avoid using over-the-top selectors like /(?:^|\\s|'|\\(\\) .... ) (我真的很想避免使用像/(?:^|\\s|'|\\(\\) ....这样的过度选择器)

This worked for me:这对我有用:

/^[a-z\u00E0-\u00FC]+$/i

With help from here此处的帮助下

The reason why /\\bà/.test("à") doesn't match is because "à" is not a word character. /\\bà/.test("à")不匹配的原因是因为 "à" 不是单词字符。 The escape sequence \\b matches only between a boundary of word character and a non word character.转义序列\\b仅在单词字符的边界和非单词字符之间匹配。 /\\ba/.test("a") matches because "a" is a word character. /\\ba/.test("a")匹配,因为 "a" 是一个单词字符。 Because of that, there is a boundary between the beginning of the string (which is not a word character) and the letter "a" which is a word character.因此,字符串的开头(不是单词字符)和作为单词字符的字母“a”之间存在边界。

Word characters in JavaScript's regex is defined as [a-zA-Z0-9_] . JavaScript 正则表达式中的单词字符定义为[a-zA-Z0-9_]

To match an accented character at the start of a string, just use the ^ character at the beginning of the regex (eg /^à/ ).要匹配字符串开头的重音字符,只需在正则表达式的开头使用^字符(例如/^à/ )。 That character means the beginning of the string (unlike \\b which matches at any word boundary within the string).该字符表示字符串的开头(与匹配字符串内任何单词边界的\\b不同)。 It's most basic and standard regular expression, so it's definitely not over the top.它是最基本和标准的正则表达式,所以它绝对不是最重要的。

Stack Overflow had also an issue with non ASCII characters in regex, you can find it here . Stack Overflow 也存在正则表达式中非 ASCII 字符的问题,您可以在此处找到它。 They are not coping with word boundaries, but maybe gives you anyway useful hints.它们不处理单词边界,但可能会给您提供有用的提示。

There is another page , but he wants to match strings and not words.还有另一个page ,但他想匹配字符串而不是单词。

I don't know, and did not find now, an anchor for your problem, but when I see what monster regexes in my first link are used, your group, that you want to avoid, is not over the top and to my opinion your solution.我不知道,现在也没有找到解决您问题的锚点,但是当我看到我的第一个链接中使用了哪些怪物正则表达式时,您想要避免的组并没有超出我的意见你的解决方案。

 const regex = /^[\\-/A-Za-z\À-\ſ ]+$/; const test1 = regex.test("à"); const test2 = regex.test("Martinez-Cortez"); const test3 = regex.test("Leonardo da vinci"); const test4 = regex.test("ï"); console.log('test1', test1); console.log('test2', test2); console.log('test3', test3); console.log('test4', test4);

Building off of Wak's and Cœur's answer:基于 Wak 和 Cœur 的回答:

/^[\\-/A-Za-z\À-\ſ ]+$/

Works for spaces and dashes too.也适用于空格和破折号。

Example: Leonardo da vinci, Martinez-Cortez示例:列奥纳多·达·芬奇、马丁内斯-科尔特斯

If you want to match letters, whether or not they're accented, unicode property escapes can be helpful.如果您想匹配字母,无论它们是否带有重音符号, unicode 属性转义都会有所帮助。

/\p{Letter}*/u.test("à"); // true
/\p{Letter}/u.test('œ'); // true
/\p{Letter}/u.test('a'); // true
/\p{Letter}/u.test('3'); // false
/\p{Letter}/u.test('a'); // true

Matching to the start of a word is tricky, but (?<=(?:^|\\s)) seems to do the trick.匹配单词的开头很棘手,但(?<=(?:^|\\s))似乎可以解决问题。 The (?<= ) is a positive lookbehind, ensuring that something exists before the main expression. (?<= )是正向后视,确保在主表达式之前存在某些东西。 The (?: ) is a non-capture group, so you don't end up with a reference to this part in whatever match you use later. (?: )是一个非捕获组,因此在以后使用的任何匹配项中都不会引用此部分。 Then the ^ will match the start of the string if the multiline flag isn't set or the start of the line if the multiline flag is set and the \\s will match a whitespace character (space/tab/linebreak).然后,如果未设置多行标志,则^将匹配字符串的开头,如果设置了多行标志并且\\s将匹配空白字符(空格/制表符/换行符),则匹配行的开头。

So using them together, it would look something like:所以一起使用它们,它看起来像:

/(?<=(?:^|\\s))\\p{Letter}*/u

If you want to only match accented characters to the start of the string, you'd want a negated character set for a-zA-Z.如果你只想重音字符匹配字符串的开始,你会想要一个-ZA-Z一个否定的字符集。

/(?<=(?:^|\s))[^a-zA-Z]\p{Letter}*/u.match("bœ") // false
/(?<=(?:^|\s))[^a-zA-Z]\p{Letter}*/u.match("œb") // true

 // Match characters, accented or not let regex = /\\p{Letter}+$/u; console.log(regex.test("œb")); // true console.log(regex.test("bœb")); // true console.log(regex.test("àbby")); // true console.log(regex.test("à3")); // false console.log(regex.test("16 tons")); // true console.log(regex.test("3 œ")); // true console.log('-----'); // Match characters to start of line, only match characters regex = /(?<=(?:^|\\s))\\p{Letter}+$/u; console.log(regex.test("œb")); // true console.log(regex.test("bœb")); // true console.log(regex.test("àbby")); // true console.log(regex.test("à3")); // false console.log('----'); // Match accented character to start of word, only match characters regex = /(?<=(?:^|\\s))[^a-zA-Z]\\p{Letter}+$/u; console.log(regex.test("œb")); // true console.log(regex.test("bœb")); // false console.log(regex.test("àbby")); // true console.log(regex.test("à3")); // false

Unicode allows for two alternative but equivalent representations of some accented characters. Unicode 允许某些重音字符有两种替代但等效的表示形式。 For example, é has two Unicode representations: '\9' and '\e\́' .例如, é有两种 Unicode 表示: '\9''\e\́' The former is called composed form and the latter is called decomposed form.前者称为组合形式,后者称为分解形式。 JavaScript allows for conversion between the two: JavaScript 允许在两者之间进行转换:

'é'.normalize('NFD') // decompose: '\u0039' -> '\u0065\u0301'
'é'.normalize('NFC') // compose: '\u0065\u0301' -> '\u0039'
'é'.length // composed form: -> 1
'é'.length // decomposed form: -> 2 (looks identical but has different representation)
'é' == 'é' // -> false (composed and decomposed strings are not equal)

The code point '\́' belongs to the Unicode Combining Diacritical Marks code block 0300-036F .代码点'\́'属于 Unicode 组合变音符号代码块0300-036F So one way to match these accented characters is to compare them in decomposed form:因此,匹配这些重音字符的一种方法是以分解形式比较它们:

// matching accented characters
/[a-zA-Z][\u0300-\u036f]+/.test('é'.normalize('NFD')) // -> true
/\bé/.test('é') // -> false
/\bé/.test('é'.normalize('NFD')) // -> true (NOTE: /\bé/ uses the decomposed form)

// matching accented words
/^\w+$/.test('résumé') // -> false
/^(?:[a-zA-Z][\u0300-\u036f]*)+$/.test('résumé'.normalize('NFD')) // -> true

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM