我如何将Java Regex用于土耳其语字符到UTF-8

Question

I'm trying to do a regex operations in Java. 我正在尝试用Java进行正则表达式操作。 But when I search in the Turkish text , I'm having trouble . 但是当我搜索土耳其文本时，我遇到了麻烦。 For example; 例如;

Search Text = "Ahmet Yıldırım" or "Esin AYDEMİR" 

//The e-mail stated in part(Ex: yildirim@example.com) , trying to look in name.
Regex Strings = "yildirim" or  "aydemir".

Searched text is dynamically changing.Therefore , how can I solve this by using java regex pattern. 搜索的文本是动态变化的。因此，如何通过使用java正则表达式模式来解决这个问题。 Or How do I convert Turkish characters( Ex: AYDEMİR convert to AYDEMIR or Yıldırım -> Yildirim ). 或者我如何转换土耳其字符（ Ex: AYDEMİR convert to AYDEMIR或Yıldırım -> Yildirim ）。

Sorry, about my grammer mistakes!... 对不起，关于我的语法错误！...

Answer 1

Use Pattern.CASE_INSENSITIVE and Pattern.UNICODE_CASE flag: 使用Pattern.CASE_INSENSITIVE和Pattern.UNICODE_CASE标志：

Pattern p = Pattern.compile("yildirim", Pattern.CASE_INSENSITIVE | Pattern.UNICODE_CASE);

Demo on ideone 在ideone上演示

Pattern.CASE_INSENSITIVE by default only match case-insensitively for characters in US-ASCII character set. 默认情况下， Pattern.CASE_INSENSITIVE仅对US-ASCII字符集中的字符不区分大小写。 Pattern.UNICODE_CASE modifies the behavior to make it match case-insensitively for all Unicode characters. Pattern.UNICODE_CASE修改行为，使其与所有Unicode字符不区分大小写。

Do note that Unicode case-insensitive matching in Java regex is done in a culture-insensitive manner. 请注意，Java regex中的Unicode不区分大小写的匹配是以对文化不敏感的方式完成的。 Therefore, ı , i , I , İ are considered the same character. 因此， ı ， i ， I ， İ被认为是相同的角色。

Depending on your use case, you might want to use Pattern.LITERAL if you want to disable all metacharacters in the pattern, or only escape literal parts of the pattern with Pattern.quote() 根据您的使用情况，如果要禁用模式中的所有元字符，或者仅使用Pattern.quote()转义模式的文字部分，则可能需要使用Pattern.LITERAL

Answer 2

The question in your comment is more complicated than the original one. 您评论中的问题比原始评论更复杂。

You can use 您可以使用

string=Normalizer.normalize(string, Normalizer.Form.NFD).replaceAll("\\p{Mn}", "");

to convert "İÖÜŞÇĞıöüşçğ" to "IOUSCGıouscg" which is already sufficient for a case insensitive match as pointed out by nhahtdh . 将"İÖÜŞÇĞıöüşçğ"转换为"IOUSCGıouscg" ，这已经足够用于nhahtdh指出的不区分大小写的匹配。 If you want to perform a case sensitive match, you have to add a .replace('ı', 'i') to match ı with i . 如果要执行区分大小写的匹配，则必须添加.replace('ı', 'i')以匹配ı与i 。

Answer 3

I am using this pattern. 我正在使用这种模式。

public static boolean isAlphaNumericWithWhiteSpace(String text) {
        return text != null && text.matches("^[\\p{L}\\p{N}ın\\s]*$");
    }

\\p{L} matches a single code point in the category "letter". \\ p {L}匹配“字母”类别中的单个代码点。

\\p{N} matches any kind of numeric character in any script. \\ p {N}匹配任何脚本中的任何数字字符。

Answer 4

git hub url for replacing the Turkish char https://gist.github.com/onuryilmaz/6034569 用于替换土耳其语字符的git hub url https://gist.github.com/onuryilmaz/6034569

in java string.matches(".*[İÖÜŞÇĞıöüşçğ]*.") will check whether the String contains Turkish charters. 在java string.matches(".*[İÖÜŞÇĞıöüşçğ]*.")将检查字符串是否包含土耳其章程。

我如何将Java Regex用于土耳其语字符到UTF-8

问题描述

4 个解决方案

解决方案1
7 已采纳 2015-08-20 12:31:52

解决方案2
5 2015-08-20 12:55:44

解决方案3
0 2019-05-30 14:43:39

解决方案4
-1 2018-08-27 16:00:10

我如何将Java Regex用于土耳其语字符到UTF-8

问题描述

4 个解决方案

解决方案1 7 已采纳 2015-08-20 12:31:52

解决方案2 5 2015-08-20 12:55:44

解决方案3 0 2019-05-30 14:43:39

解决方案4 -1 2018-08-27 16:00:10

解决方案1
7 已采纳 2015-08-20 12:31:52

解决方案2
5 2015-08-20 12:55:44

解决方案3
0 2019-05-30 14:43:39

解决方案4
-1 2018-08-27 16:00:10