简体   繁体   中英

How can I remove all Non-Alphabetic characters from a String using Regex in Java

I want to remove all non-alphabetic characters from a String.

Input :

"-Hello, 1 world$!"

Output:

"Helloworld"

But instead I'm getting: "Hello1world"

How can I fix it?

My code:

public class LabProgram {
    public static String removeNonAlpha (String userString) {
    String[] stringArray = userString.split("\\W+");
        String result = new String();
        
        for(int i = 0; i < stringArray.length;i++){
            result = result+ stringArray[i];
        }
        
        return result;
    }
    
    public static void main(String args[]) {
        Scanner scnr = new Scanner(System.in);
        String str = scnr.nextLine();
        String result = removeNonAlpha(str);
        System.out.println(result);
    }
}

this should work:

import java.util.Scanner;

public class LabProgram {
    public static String removeNonAlpha (String userString) {
        // If you only want to remove the characters A to Z (lower an uppercase)
        //return userString.replaceAll("[^A-Za-z]+", "");
        return userString.replaceAll("[^\\p{Alpha}]+", "");
    }
    
    public static void main(String args[]) {
        Scanner scnr = new Scanner(System.in);
        String str = scnr.nextLine();
        String result = removeNonAlpha(str);
        System.out.println(result);
    }
}

Take a look replaceAll() , which expects a regular expression as the first argument and a replacement-string as a second:

return userString.replaceAll("[^\\p{Alpha}]", "");

for more information on regular expressions take a look at this tutorial

You could use:

 public static String removeNonAlpha (String userString) {
    return userString.replaceAll("[^a-zA-Z]+",  "");
}

The issue is that your regex pattern is matching more than just letters, but also matching numbers and the underscore character, as that is what \W does. Replacing this fixes the issue:

String[] stringArray = userString.split("\\P{Alpha}+");

Per the Pattern Javadocs, \W matches any non-word character, where a word character is defined in \w as [a-zA-Z_0-9] . This means that it matches upper and lowercase ASCII letters A - Z & a - z, the numbers 0 - 9, and the underscore character ("_").

The solution would be to use a regex pattern that excludes only the characters you want excluded. Per the pattern documentation could do [^a-zA-Z] or \P{Alpha} to exclude the main 26 upper and lowercase letters. If you want to count letters other than just the 26 ASCII letters (eg letters in non-Latin alphabets), you could use \P{IsAlphabetic} .

\p{ prop } matches if the input has the property prop , while \P{ prop } does not match if the input has that property.

As other answers have pointed out, there are other issues with your code that make it non-idiomatic, but those aren't affecting the correctness of your solution.

\W is equivalent to [a-zA-Z_0-9] , so it include numerics caracters.

Just replace it by "[^a-zA-Z]+" , like in the below example:

import java.util.Arrays;

class Scratch {
    public static void main(String[] args) {
        String input = "-Hello, 1    world$!";
        System.out.println("Input : " + input);
        String[] split = input.split("[^a-zA-Z]+");
        StringBuilder builder = new StringBuilder();
        Arrays.stream(split).forEach(builder::append);
        System.out.println("Ouput :" + builder);
    }
}

Output:

Input : -Hello, 1    world$!
Ouput :Helloworld

You can have a look at this article for more details about regular expressions: https://www.vogella.com/tutorials/JavaRegularExpressions/article.html#meta-characters

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM