简体   繁体   English

我如何将Java Regex用于土耳其语字符到UTF-8

[英]How I can use Java Regex for Turkish characters to UTF-8

I'm trying to do a regex operations in Java. 我正在尝试用Java进行正则表达式操作。 But when I search in the Turkish text , I'm having trouble . 但是当我搜索土耳其文本时,我遇到了麻烦。 For example; 例如;

Search Text = "Ahmet Yıldırım" or "Esin AYDEMİR" 

//The e-mail stated in part(Ex: yildirim@example.com) , trying to look in name.
Regex Strings = "yildirim" or  "aydemir". 

Searched text is dynamically changing.Therefore , how can I solve this by using java regex pattern. 搜索的文本是动态变化的。因此,如何通过使用java正则表达式模式来解决这个问题。 Or How do I convert Turkish characters( Ex: AYDEMİR convert to AYDEMIR or Yıldırım -> Yildirim ). 或者我如何转换土耳其字符( Ex: AYDEMİR convert to AYDEMIRYıldırım -> Yildirim )。

Sorry, about my grammer mistakes!... 对不起,关于我的语法错误!...

Use Pattern.CASE_INSENSITIVE and Pattern.UNICODE_CASE flag: 使用Pattern.CASE_INSENSITIVEPattern.UNICODE_CASE标志:

Pattern p = Pattern.compile("yildirim", Pattern.CASE_INSENSITIVE | Pattern.UNICODE_CASE);

Demo on ideone 在ideone上演示

Pattern.CASE_INSENSITIVE by default only match case-insensitively for characters in US-ASCII character set. 默认情况下, Pattern.CASE_INSENSITIVE仅对US-ASCII字符集中的字符不区分大小写。 Pattern.UNICODE_CASE modifies the behavior to make it match case-insensitively for all Unicode characters. Pattern.UNICODE_CASE修改行为​​,使其与所有Unicode字符不区分大小写。

Do note that Unicode case-insensitive matching in Java regex is done in a culture-insensitive manner. 请注意,Java regex中的Unicode不区分大小写的匹配是以对文化不敏感的方式完成的。 Therefore, ı , i , I , İ are considered the same character. 因此, ıiIİ被认为是相同的角色。

Depending on your use case, you might want to use Pattern.LITERAL if you want to disable all metacharacters in the pattern, or only escape literal parts of the pattern with Pattern.quote() 根据您的使用情况,如果要禁用模式中的所有元字符,或者仅使用Pattern.quote()转义模式的文字部分,则可能需要使用Pattern.LITERAL

The question in your comment is more complicated than the original one. 您评论中的问题比原始评论更复杂。

You can use 您可以使用

string=Normalizer.normalize(string, Normalizer.Form.NFD).replaceAll("\\p{Mn}", "");

to convert "İÖÜŞÇĞıöüşçğ" to "IOUSCGıouscg" which is already sufficient for a case insensitive match as pointed out by nhahtdh . "İÖÜŞÇĞıöüşçğ"转换为"IOUSCGıouscg" ,这已经足够用于nhahtdh指出的不区分大小写的匹配。 If you want to perform a case sensitive match, you have to add a .replace('ı', 'i') to match ı with i . 如果要执行区分大小写的匹配,则必须添加.replace('ı', 'i')以匹配ıi

I am using this pattern. 我正在使用这种模式。

public static boolean isAlphaNumericWithWhiteSpace(String text) {
        return text != null && text.matches("^[\\p{L}\\p{N}ın\\s]*$");
    }

\\p{L} matches a single code point in the category "letter". \\ p {L}匹配“字母”类别中的单个代码点。

\\p{N} matches any kind of numeric character in any script. \\ p {N}匹配任何脚本中的任何数字字符。

git hub url for replacing the Turkish char https://gist.github.com/onuryilmaz/6034569 用于替换土耳其语字符的git hub url https://gist.github.com/onuryilmaz/6034569

in java string.matches(".*[İÖÜŞÇĞıöüşçğ]*.") will check whether the String contains Turkish charters. 在java string.matches(".*[İÖÜŞÇĞıöüşçğ]*.")将检查字符串是否包含土耳其章程。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM