I'm having the following problem with regex: I've written a program that reads words from some text (txt) files and writes into another file, writing one word per line.
Everything works fine, except if the word read has a special characters ľščťžýáíé
in it. The regex deletes the char and splits the word where the special char was.
For Example :
Input:
I am Jožo.
Output:
I
am
Jo
o
Here's a snippet of the code:
while( (line = br.readLine())!= null ){
Pattern p = Pattern.compile("[\\w']+");
Matcher m = p.matcher(line);
}
Instead of this regex:
Pattern.compile("[\\w']+")
Use Unicode based:
Pattern.compile("[\\p{L}']+")
It is because by default \\\\w
in Java matches only ASCII characters, digits 0-9 and underscore.
Another option is to use the modifier
Pattern.UNICODE_CHARACTER_CLASS
Like this:
Pattern.compile("[\\w']+", Pattern.UNICODE_CHARACTER_CLASS)
\\\\ w仅匹配az,AZ和0-9(英文字母加数字)如果要接受除空格之外的任何字符作为单词的一部分,请使用\\\\ S
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.