简体   繁体   中英

How to remove only punctuation but leave accented letters?

I am trying to remove only the punctuation from my text data but leave the accented letters. I do not want to replace the accented letters with English equivalents. I cannot figure out how to adapt my existing code to allow for higher ascii characters.

    while (input.hasNext()){
        String phrase = input.nextLine();
        String[] words = phrase.split(" ");
        for(String word: words){
              String strippedInput = word.replaceAll("[^0-9a-zA-Z\\s]", ""); 
        }
     }

If original input is: O sal, ou o sódio, também é contraindicado em pacientes hipotensos?

Expected output should be: O sal ou o sódio também é contraindicado em pacientes hipotensos

Any ideas? Thanks!

Consider using Unicode Categories , as "AZ" is very English-centric and doesn't even cope with accents as discovered.

For example, the following would replace everything, including punctuation, except "any letter, any language" ( \\p{L} ) or "whitespace" ( \\s ). If it is desired to keep digits, add them back in as additional exclusions.

replaceAll("[^\\p{L}\\s]", "")

Here is an ideone demo .

Try this.

public class punctuationRemove {

//private static String punc = "[][(){},.;!?<>%]";
 static StringBuilder sb = new StringBuilder();
 static char[] punc = "',.;!?(){}[]<>%".toCharArray();

 public static void main(String[] args){
        String s = "Hello!, how are you?";
        System.out.println(removePuntuation(s));
    }

 public static String removePuntuation(String s)
 {
     String tmp;
     boolean fl=true;

     for(int i=0;i<s.length();i++)
     {
         fl=true;
         char strChar=s.charAt(i);
         for (char badChar : punc) 
         {
            if (badChar == strChar)
            {
               fl=false;
               break;
             }
          }

          if(fl)
          {
             sb.append(strChar);
           }
     }
     return sb.toString();
 }
}

replace a-zA-Z in regex string with \\p{L} (any kind of letter from any language)

while (input.hasNext()){
    String phrase = input.nextLine();
    String[] words = phrase.split(" ");
    for(String word: words){
          String strippedInput = word.replaceAll("[^0-9\\p{L}\\s]", ""); 
    }
 }

Maybe I'm missing the point, but something like...

String text = "O sal, ou o sódio, também é contraindicado em pacientes hipotensos?";
System.out.println(text);
System.out.println(text.replaceAll("[\\?,.:!\\(\\){}\\[\\]<>%]", ""));

Outputs

O sal, ou o sódio, também é contraindicado em pacientes hipotensos?
O sal ou o sódio também é contraindicado em pacientes hipotensos

Or, based on your example...

while (input.hasNext()){
    String phrase = input.nextLine();
    String[] words = phrase.split(" ");
    for(String word: words){
          String strippedInput = word.replaceAll("[\\?,.:!\\(\\){}\\[\\]<>%]", ""); 
    }
 }

It may be inefficient, and I'm sure the idea can be improved upon, but you could create a method that loops through the string, building a buffer of each character that is not punctuation.

private String replacePunctuation(String s){
    String output = "";

    for(int i = 0; i < s.Length(); i++){
        if(s.charAt(i) != '.' && s.charAt(i) != ',' && s.charAt(i) != '!') // Add other punctuation values you're concerned about. Perhaps the Regex class would be useful here, but I am not as familiar with it as I would like.
            output += s.charAt(i);
        }
    }
}

Again, probably not the cleanest or most efficient, but it's the best I can come up with at the moment.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM