简体   繁体   中英

Java Regular Expression with International Letters

Here's my current code:

return str.matches("^[A-Za-z\\-'. ]+");

I want it to include international letters. How do I do that in Java?

Thanks.

It seems that you want is, to match all the alphabetic characters. Typically you would do that by using Posix \\p{Alpha} expression, extended by the punctuation you want also to permit. As Java Regular Expressions documentation says, it matches ASCII only.

However, what documentation does not say clearly is, you can make this class work with Unicode characters. To do just that you need to turn Unicode character class matching on.
You can do this in one of two ways:

  1. By creating Pattern object passing the UNICODE_CHARACTER_CLASS constant:
    Pattern p = Pattern.compile("^[p{Alpha}\\\\-'. ]+", UNICODE_CHARACTER_CLASS);
  2. By using (?U) embedded pattern flag:
    str.matches("^(?U)[\\\\p{Alpha}\\\\-'. ]+");

Prove of concept:

String[] test = {"Jean-Marie Le'Blanc", "Żółć", "Ὀδυσσεύς", "原田雅彦"};
for (String str : test) {
    System.out.print(str.matches("^(?U)[\\p{Alpha}\\-'. ]+") + " ");
}

The obvious result is:

true true true true

If you think that all is correct, I have two additional points to make:

  • 原田雅彦 (Masahiko Harada) is composed of Ideographic characters . In fact they are not the alphabetic characters,
  • You want to match the dot (.) symbol. It's OK, but please consider matching Ideographic fullstops as well.

I assume you want to match alphanumeric characters other than the ASCII letters AZ. You can do this with the \\p{IsAlphabetic} Unicode character class:

return str.matches("^[\\p{IsAlphabetic}\\-'. ]+");

You'll find more Unicode character classes the full documentation .

Replace the pattern with:

"^[\\p{L}\\-'. ]+"

\\p{L} includes all unicode letters.

Use the regex \\P{L} to match any letters (national or international)

By adding [\\p{L}&&[^\\p{IsLatin}]] , you can match all letters that are not latin.

Especially for Greek, regex has \\p{InGreek} to match Greek letters and \\P{InGreek} (the difference is capital P) to match non Greek letters.

The question cannot be answered completely unless you say what you mean by "international letters", but the general solution is to use named character classes, via the \\p{name} syntax. There are many named character classes. Some are defined by the regex language, and others by the Unicode standard. Refer to the Pattern javadocs for a partial list, and to the relevant Unicode standard.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM