简体   繁体   中英

Java remove punctuation on a String (also ’ “ ” and all of these) maintaining accents characters

I need to remove punctuation reading on a file, maintaining accents character I tried this code but don't work how I would.

Expectation: input=> ’'qwe..,rty ‘èeéò’“ ”o" "à     output=> qwertyèeéòoà

Effective result: input=> ’'qwe..,rty ‘èeéò’“ ”o" "à   output=>’qwerty ‘èeéò’“ ”o" "à

I can't remove '“” symbols and other of these

Note: Eclipse and filetext.txt are set to UTF-8 .

Thank you

import java.io.*;
import java.util.Scanner;

public class DataCounterMain {
    public static void main (String[] args) throws FileNotFoundException {

    File file = new File("filetext.txt");

    try {
        Scanner filescanner = new Scanner(file);
        while (filescanner.hasNextLine()) {

            String line = filescanner.nextLine();
            line=line.replaceAll ("\\p{Punct}", "");

            System.out.println(line);
        }
    }
    catch(FileNotFoundException e) {
        System.err.println(file +" FileNotFound");
    }
    }
}

The regex \\p{Punct} only matches US-ASCII punctuation by default, unless you enable Unicode character classes. That means that your code, as written, would only remove these characters:

!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~

If you want to match everything the Unicode Consortium classified as punctuation, try \\p{IsPunctuation} instead, which always checks Unicode character properties and matches all the punctiuation in your example (and more!).

To replace whitespace as well as punctuation, like in your example, you would use:

             
        line = line.replaceAll("\\p{IsPunctuation}|\\p{IsWhite_Space}", "");
             

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM