简体   繁体   中英

Matching accented characters with Javascript regexes

Here's a fun snippet I ran into today:

/\ba/.test("a") --> true
/\bà/.test("à") --> false

However,

/à/.test("à") --> true

Firstly, wtf?

Secondly, if I want to match an accented character at the start of a word, how can I do that? (I'd really like to avoid using over-the-top selectors like /(?:^|\\s|'|\\(\\) .... )

This worked for me:

/^[a-z\u00E0-\u00FC]+$/i

With help from here

The reason why /\\bà/.test("à") doesn't match is because "à" is not a word character. The escape sequence \\b matches only between a boundary of word character and a non word character. /\\ba/.test("a") matches because "a" is a word character. Because of that, there is a boundary between the beginning of the string (which is not a word character) and the letter "a" which is a word character.

Word characters in JavaScript's regex is defined as [a-zA-Z0-9_] .

To match an accented character at the start of a string, just use the ^ character at the beginning of the regex (eg /^à/ ). That character means the beginning of the string (unlike \\b which matches at any word boundary within the string). It's most basic and standard regular expression, so it's definitely not over the top.

Stack Overflow had also an issue with non ASCII characters in regex, you can find it here . They are not coping with word boundaries, but maybe gives you anyway useful hints.

There is another page , but he wants to match strings and not words.

I don't know, and did not find now, an anchor for your problem, but when I see what monster regexes in my first link are used, your group, that you want to avoid, is not over the top and to my opinion your solution.

 const regex = /^[\\-/A-Za-z\À-\ſ ]+$/; const test1 = regex.test("à"); const test2 = regex.test("Martinez-Cortez"); const test3 = regex.test("Leonardo da vinci"); const test4 = regex.test("ï"); console.log('test1', test1); console.log('test2', test2); console.log('test3', test3); console.log('test4', test4);

Building off of Wak's and Cœur's answer:

/^[\\-/A-Za-z\À-\ſ ]+$/

Works for spaces and dashes too.

Example: Leonardo da vinci, Martinez-Cortez

If you want to match letters, whether or not they're accented, unicode property escapes can be helpful.

/\p{Letter}*/u.test("à"); // true
/\p{Letter}/u.test('œ'); // true
/\p{Letter}/u.test('a'); // true
/\p{Letter}/u.test('3'); // false
/\p{Letter}/u.test('a'); // true

Matching to the start of a word is tricky, but (?<=(?:^|\\s)) seems to do the trick. The (?<= ) is a positive lookbehind, ensuring that something exists before the main expression. The (?: ) is a non-capture group, so you don't end up with a reference to this part in whatever match you use later. Then the ^ will match the start of the string if the multiline flag isn't set or the start of the line if the multiline flag is set and the \\s will match a whitespace character (space/tab/linebreak).

So using them together, it would look something like:

/(?<=(?:^|\\s))\\p{Letter}*/u

If you want to only match accented characters to the start of the string, you'd want a negated character set for a-zA-Z.

/(?<=(?:^|\s))[^a-zA-Z]\p{Letter}*/u.match("bœ") // false
/(?<=(?:^|\s))[^a-zA-Z]\p{Letter}*/u.match("œb") // true

 // Match characters, accented or not let regex = /\\p{Letter}+$/u; console.log(regex.test("œb")); // true console.log(regex.test("bœb")); // true console.log(regex.test("àbby")); // true console.log(regex.test("à3")); // false console.log(regex.test("16 tons")); // true console.log(regex.test("3 œ")); // true console.log('-----'); // Match characters to start of line, only match characters regex = /(?<=(?:^|\\s))\\p{Letter}+$/u; console.log(regex.test("œb")); // true console.log(regex.test("bœb")); // true console.log(regex.test("àbby")); // true console.log(regex.test("à3")); // false console.log('----'); // Match accented character to start of word, only match characters regex = /(?<=(?:^|\\s))[^a-zA-Z]\\p{Letter}+$/u; console.log(regex.test("œb")); // true console.log(regex.test("bœb")); // false console.log(regex.test("àbby")); // true console.log(regex.test("à3")); // false

Unicode allows for two alternative but equivalent representations of some accented characters. For example, é has two Unicode representations: '\9' and '\e\́' . The former is called composed form and the latter is called decomposed form. JavaScript allows for conversion between the two:

'é'.normalize('NFD') // decompose: '\u0039' -> '\u0065\u0301'
'é'.normalize('NFC') // compose: '\u0065\u0301' -> '\u0039'
'é'.length // composed form: -> 1
'é'.length // decomposed form: -> 2 (looks identical but has different representation)
'é' == 'é' // -> false (composed and decomposed strings are not equal)

The code point '\́' belongs to the Unicode Combining Diacritical Marks code block 0300-036F . So one way to match these accented characters is to compare them in decomposed form:

// matching accented characters
/[a-zA-Z][\u0300-\u036f]+/.test('é'.normalize('NFD')) // -> true
/\bé/.test('é') // -> false
/\bé/.test('é'.normalize('NFD')) // -> true (NOTE: /\bé/ uses the decomposed form)

// matching accented words
/^\w+$/.test('résumé') // -> false
/^(?:[a-zA-Z][\u0300-\u036f]*)+$/.test('résumé'.normalize('NFD')) // -> true

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM