简体   繁体   中英

Regex deleted special character

I'm having the following problem with regex: I've written a program that reads words from some text (txt) files and writes into another file, writing one word per line.

Everything works fine, except if the word read has a special characters ľščťžýáíé in it. The regex deletes the char and splits the word where the special char was.

For Example :
Input:

I am Jožo.

Output:

I
am
Jo
o

Here's a snippet of the code:

while( (line = br.readLine())!= null ){ 
  Pattern p = Pattern.compile("[\\w']+");
  Matcher m = p.matcher(line);
}

Instead of this regex:

Pattern.compile("[\\w']+")

Use Unicode based:

Pattern.compile("[\\p{L}']+")

It is because by default \\\\w in Java matches only ASCII characters, digits 0-9 and underscore.

Another option is to use the modifier

Pattern.UNICODE_CHARACTER_CLASS

Like this:

Pattern.compile("[\\w']+", Pattern.UNICODE_CHARACTER_CLASS)

\\\\ w仅匹配az,AZ和0-9(英文字母加数字)如果要接受除空格之外的任何字符作为单词的一部分,请使用\\\\ S

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM