简体   繁体   English

正则表达式捕获超出 az 的字母

[英]Regular expression to catch letters beyond a-z

A normal regexp to allow letters only would be "[a-zA-Z]" but I'm from, Sweden so I would have to change that into "[a-zåäöA-ZÅÄÖ]" .只允许字母的正常正则表达式是"[a-zA-Z]"但我来自瑞典,所以我必须将其更改为"[a-zåäöA-ZÅÄÖ]" But suppose I don't know what letters are used in the alphabet.但是假设我不知道字母表中使用了哪些字母。

Is there a way to automatically know what chars are are valid in a given locale/language or should I just make a blacklist of chars that I (think I) know I don't want?有没有办法自动知道哪些字符在给定的语言环境/语言中是有效的,或者我应该只制作一个我(认为我)知道我不想要的字符的黑名单?

You can use \pL to match any 'letter', which will support all letters in all languages.您可以使用 \pL 匹配任何“字母”,这将支持所有语言的所有字母。 You can narrow it down to specific languages using 'named blocks'.您可以使用“命名块”将其缩小到特定语言。 More information can be found on the Character Classes documentation on MSDN.更多信息可以在 MSDN 上的字符类文档中找到。

My recommendation would be to put the regular expression (or at least the "letter" part) into a localised resource, which you can then pull out based on the current locale and form into the larger pattern.我的建议是将正则表达式(或至少“字母”部分)放入本地化资源中,然后您可以根据当前语言环境将其拉出并形成更大的模式。

What about \p{name}? \p{name} 呢?

Matches any character in the named character class specified by {name}.匹配由 {name} 指定的命名字符 class 中的任何字符。 Supported names are Unicode groups and block ranges.支持的名称是 Unicode 组和块范围。 For example, Ll, Nd, Z, IsGreek, IsBoxDrawing.例如,Ll、Nd、Z、IsGreek、IsBoxDrawing。

I don't know enough about unicode, but maybe your characters fit a unicode class?我对 unicode 不太了解,但也许你的角色适合 unicode class?

See character categories selection with \p and \w unicode semantics.请参阅使用\p\w unicode 语义的字符类别选择。

This regex allows only valid symbols through:此正则表达式仅通过以下方式允许有效符号:

[a-zA-ZÀ-ÿ ]

All chars are "valid," so I think you're really asking for chars that are "generally considered to be letters" in a locale.所有字符都是“有效的”,所以我认为你真的要求在语言环境中“通常被认为是字母”的字符。

The Unicode specification has some guidelines, but in general the answer is "no," you would need to list the characters you decide are "letters." Unicode 规范有一些指导方针,但通常答案是“否”,您需要列出您认为是“字母”的字符。

Is there a way to automatically know what chars are are valid in a given locale/language or should I just make a blacklist of chars that I (think I) know I don't want?有没有办法自动知道哪些字符在给定的语言环境/语言中是有效的,或者我应该只制作一个我(认为我)知道我不想要的字符的黑名单?

This is not, in general , possible.一般来说,这是不可能的。

After all Engligh text does include some accented characters (eg in "fête" and "naïve" -- which in UK-English to be strictly correct still use accents).毕竟英语文本确实包含一些重音字符(例如,在“fête”和“naïve”中——严格正确的英国英语仍然使用重音符号)。 In some languages some of the standard letters are rarely used (eg y-diaeresis in French).在某些语言中,一些标准字母很少使用(例如法语中的 y-diaeresis)。

Then consider including foreign words are included (this will often be the case where technical terms are used).然后考虑包括外来词(这通常是使用技术术语的情况)。 Quotations would be another source.报价将是另一个来源。

If your requirements are sufficiently narrowly defined you may be able to create a definition, but this requires linguistic experience in that language.如果您的要求定义得足够狭隘,您可以创建一个定义,但这需要该语言的语言经验。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM