简体   繁体   中英

Remove all special characters from file line except white space

I have extracted text using tika for some pdf files and stored the text in text files. Now i want to parse these files using opennlp Chunk parser, but i was unable to parse the file lines because it contains some special characters in it(some square type symbols)without space between word to word, sample line in my text file(unable to show those square type symbols, diacritic symbols)

51.2.3  Troubleshooting DHCP Configuration  ?
62  Module 3: Point-to-Point Protocol (PPP) ?
62.1    Configuring HDLC Encapsulation  ?

So i want to get the lines as

Troubleshooting DHCP Configuratin
Module 3: Point-to-Point Protocol(PPP)
Configuring HDLC Encapsulation

Please suggest me how to do this?

  1. Read the file line by line .
  2. Replace the unwanted Chars in each of these lines with "": line = line.replaceAll("^\\\\d{2}(\\\\.\\\\d)+ +", "").replaceAll(" +\\\\?$", "");
  3. Write the file using FileWriter .

This asumes that the number format at the beginning of the lines is dd(.d)* where d is one digit and each section after the first one has only one digit. Otherwise the regex has to be changed to fit your format.

Remove the cryptic symbols by appending .replaceAll("[æ╚]", ""); adding all of these characters into the square brackets. Ensure you have the right encoding. If you read the file with "UTF-8" you have to copy these caracters in an editor where you can specify that this file is "UTF-8".

用空格替换所有非单词字符是否足够,或者至少朝正确的方向迈出了一步?

str = str.replaceAll("\\W+", " ");

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM