简体   繁体   中英

Remove punctuation, preserve letters and white space - Java Regex

Tonight I'm attempting to parse words from a file, and I'd like to remove all punctuation while preserving Lower and Upper case words as well as white spaces.

String alpha = word.replaceAll("[^a-zA-Z]", "");

This replaces everything, including white spaces.

Operating on a text file containing Testing, testing, 1, one, 2, two, 3, three. , the output becomes TESTINGTESTINGONETWOTHREE However, when I change it to

String alpha = word.replaceAll("[^a-zA-Z\\s]", "");

The output does not change.

Here's this code snippet in its entirety:

public class UpperCaseScanner {

    public static void main(String[] args) throws FileNotFoundException {

        //First, define the filepath the program will look for. 
        String filename = "file.txt";   //Filename
        String targetFile = "";         
        String workingDir = System.getProperty("user.dir");

        targetFile = workingDir + File.separator + filename;   //Full filepath.

        //System.out.println(targetFile); //Debug code, prints the filepath. 

        Scanner fileScan = new Scanner(new File(targetFile)); 

        while(fileScan.hasNext()){
            String word = fileScan.next();
            //Replace non-alphabet characters with empty char. 
            String alpha = word.replaceAll("[^a-zA-Z\\s]", "");
            System.out.print(alpha.toUpperCase());
        }

        fileScan.close();

    }
}

file.txt has one line, reading Testing, testing, 1, one, 2, two, 3, three. My goal is for the output to read Testing Testing One Two Three Am I just doing something wrong in the regular expression, or is there something else I need to do? If it's relevant, I'm working in 32-bit Eclipse 2.0.2.2.

System.out.println(str.replaceAll("\\p{P}", ""));         //Removes Special characters only
System.out.println(str.replaceAll("[^a-zA-Z]", ""));      //Removes space, Special Characters and digits
System.out.println(str.replaceAll("[^a-zA-Z\\s]", ""));   //Removes Special Characters and Digits
System.out.println(str.replaceAll("\\s+", ""));           //Remove spaces only
System.out.println(str.replaceAll("\\p{Punct}", ""));     //Removes Special characters only
System.out.println(str.replaceAll("\\W", ""));            //Removes space, Special Characters but not digits
System.out.println(str.replaceAll("\\p{Punct}+", ""));    //Removes Special characters only
System.out.println(str.replaceAll("\\p{Punct}|\\d", "")); //Removes Special Characters and Digits

I was able to get the output you were looking for using this. I wasn't sure if you required multiple spaces to be single space that is why I added the second call to replace all to convert multiple spaces to a single space.

public class RemovePunctuation {
    public static void main(String[] args) {
        String input = "Testing, testing, 1, one, 2, two, 3, three.";
        String alpha = input.replaceAll("[^a-zA-Z\\s]", "").replaceAll("\\s+", " ");
        System.out.println(alpha);
    }
}

This methods outputs:

Testing testing one two three

If you wanted the first character of each word capitalized (like you showed in your question) then you could do this:

public class Foo {
    public static void main(String[] args) {
        String input = "Testing, testing, 1, one, 2, two, 3, three.";
        String alpha = input.replaceAll("[^a-zA-Z\\s]", "").replaceAll("\\s+", " ");
        System.out.println(alpha);

        StringBuilder upperCaseWords = new StringBuilder();
        String[] words = alpha.split("\\s");

        for(String word : words) {
            String upperCase = Character.toUpperCase(word.charAt(0)) + word.substring(1) + " ";
            upperCaseWords.append(upperCase);
        }
        System.out.println(upperCaseWords.toString());
    }
}

Which outputs:

Testing testing one two three Testing Testing One Two Three

i think that Java supports

\p{Punct}

which removes all punctuation characters

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM