简体   繁体   中英

Regular expression to catch letters beyond a-z

A normal regexp to allow letters only would be "[a-zA-Z]" but I'm from, Sweden so I would have to change that into "[a-zåäöA-ZÅÄÖ]" . But suppose I don't know what letters are used in the alphabet.

Is there a way to automatically know what chars are are valid in a given locale/language or should I just make a blacklist of chars that I (think I) know I don't want?

You can use \pL to match any 'letter', which will support all letters in all languages. You can narrow it down to specific languages using 'named blocks'. More information can be found on the Character Classes documentation on MSDN.

My recommendation would be to put the regular expression (or at least the "letter" part) into a localised resource, which you can then pull out based on the current locale and form into the larger pattern.

What about \p{name}?

Matches any character in the named character class specified by {name}. Supported names are Unicode groups and block ranges. For example, Ll, Nd, Z, IsGreek, IsBoxDrawing.

I don't know enough about unicode, but maybe your characters fit a unicode class?

See character categories selection with \p and \w unicode semantics.

This regex allows only valid symbols through:

[a-zA-ZÀ-ÿ ]

All chars are "valid," so I think you're really asking for chars that are "generally considered to be letters" in a locale.

The Unicode specification has some guidelines, but in general the answer is "no," you would need to list the characters you decide are "letters."

Is there a way to automatically know what chars are are valid in a given locale/language or should I just make a blacklist of chars that I (think I) know I don't want?

This is not, in general , possible.

After all Engligh text does include some accented characters (eg in "fête" and "naïve" -- which in UK-English to be strictly correct still use accents). In some languages some of the standard letters are rarely used (eg y-diaeresis in French).

Then consider including foreign words are included (this will often be the case where technical terms are used). Quotations would be another source.

If your requirements are sufficiently narrowly defined you may be able to create a definition, but this requires linguistic experience in that language.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM