How do I match unicode characters in Java

Question

I m trying to match unicode characters in Java.

Input String: informa

String to match : informátion

So far I ve tried this:

Pattern p= Pattern.compile("informa[\u0000-\uffff].*", (Pattern.UNICODE_CASE|Pattern.CANON_EQ|Pattern.CASE_INSENSITIVE));
    String s = "informátion";
    Matcher m = p.matcher(s);
    if(m.matches()){
        System.out.println("Match!");
    }else{
        System.out.println("No match");
    }

It comes out as "No match". Any ideas?

Answer 1

The term "Unicode characters" is not specific enough. It would match every character which is in the Unicode range, thus also "normal" characters. This term is however very often used when one actually means "characters which are not in the printable ASCII range ".

In regex terms that would be [^\\x20-\\x7E] .

boolean containsNonPrintableASCIIChars = string.matches(".*[^\\x20-\\x7E].*");

Depending on what you'd like to do with this information, here are some useful follow-up answers:

Answer 2

Is it because informa isn't a substring of informátion at all?

How would your code work if you removed the last a from informa in your regex?

Answer 3

It sounds like you want to match letters while ignoring diacritical marks. If that's right, then normalize your strings to NFD form, strip out the diacritical marks, and then do your search.

String normalized = java.text.Normalizer.normalize(textToSearch, java.text.Normalizer.Form.NFD);
String withoutDiacritical = normalized.replaceAll("\\p{InCombiningDiacriticalMarks}+", "");
// Search code goes here...

To learn more about NFD:

How do I match unicode characters in Java

Question

3 answers

solution1
12 ACCPTED 2010-06-23 16:11:51

solution2
6 2010-06-23 16:04:57

solution3
1 2015-10-01 16:03:51

How do I match unicode characters in Java

Question

3 answers

solution1 12 ACCPTED 2010-06-23 16:11:51

solution2 6 2010-06-23 16:04:57

solution3 1 2015-10-01 16:03:51

solution1
12 ACCPTED 2010-06-23 16:11:51

solution2
6 2010-06-23 16:04:57

solution3
1 2015-10-01 16:03:51