简体   繁体   中英

How I can use Java Regex for Turkish characters to UTF-8

I'm trying to do a regex operations in Java. But when I search in the Turkish text , I'm having trouble . For example;

Search Text = "Ahmet Yıldırım" or "Esin AYDEMİR" 

//The e-mail stated in part(Ex: yildirim@example.com) , trying to look in name.
Regex Strings = "yildirim" or  "aydemir". 

Searched text is dynamically changing.Therefore , how can I solve this by using java regex pattern. Or How do I convert Turkish characters( Ex: AYDEMİR convert to AYDEMIR or Yıldırım -> Yildirim ).

Sorry, about my grammer mistakes!...

Use Pattern.CASE_INSENSITIVE and Pattern.UNICODE_CASE flag:

Pattern p = Pattern.compile("yildirim", Pattern.CASE_INSENSITIVE | Pattern.UNICODE_CASE);

Demo on ideone

Pattern.CASE_INSENSITIVE by default only match case-insensitively for characters in US-ASCII character set. Pattern.UNICODE_CASE modifies the behavior to make it match case-insensitively for all Unicode characters.

Do note that Unicode case-insensitive matching in Java regex is done in a culture-insensitive manner. Therefore, ı , i , I , İ are considered the same character.

Depending on your use case, you might want to use Pattern.LITERAL if you want to disable all metacharacters in the pattern, or only escape literal parts of the pattern with Pattern.quote()

The question in your comment is more complicated than the original one.

You can use

string=Normalizer.normalize(string, Normalizer.Form.NFD).replaceAll("\\p{Mn}", "");

to convert "İÖÜŞÇĞıöüşçğ" to "IOUSCGıouscg" which is already sufficient for a case insensitive match as pointed out by nhahtdh . If you want to perform a case sensitive match, you have to add a .replace('ı', 'i') to match ı with i .

I am using this pattern.

public static boolean isAlphaNumericWithWhiteSpace(String text) {
        return text != null && text.matches("^[\\p{L}\\p{N}ın\\s]*$");
    }

\\p{L} matches a single code point in the category "letter".

\\p{N} matches any kind of numeric character in any script.

git hub url for replacing the Turkish char https://gist.github.com/onuryilmaz/6034569

in java string.matches(".*[İÖÜŞÇĞıöüşçğ]*.") will check whether the String contains Turkish charters.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM