简体   繁体   中英

Java: Regex to identify punctuations in a sentence and delete them

I have the following string:

String input = "Remove from em?ty sentence 1? Remove from sentence 2! But not from ip address 190.168.10.110!";

I want to remove punctuation marks at the right places. My output needs to be:

String str = "Remove from em?ty sentence 1 Remove from sentence 2 But not from ip address 190.168.10.110";

I am using the following code:

while (stream.hasNext()) { 
    token = stream.next();
    char[] tokenArray = token.toCharArray();
    token = token.trim();

    if(token.matches(".*?[0-9]{1,3}\\.[0-9]{1,3}\\.[0-9]{1,3}\\.[0-9]{1,3}[\\.\\?!]+")){
        System.out.println("case2");
        stream.previous();
        int len = token.length()-1;
        for(int i = token.length()-1; i>7; i--){
            if(tokenArray[i]=='.'||tokenArray[i]=='?'||tokenArray[i]=='!'){
                --len;
            }
            else
                break;
        }
        stream.set(token.substring(0, len+1));
    }
    else if(token.matches(".*?\\b[a-zA-Z_0-9]+\\b[\\.\\?!]+")){
        System.out.println("case1");
        stream.previous();
        str = token.replaceAll("[\\.\\?!]+", "");
        stream.set(str);

        System.out.println(stream.next());                          
    }
}

'Tokens' are getting sent from the 'input' string. Can you please indicate what i am doing wrong in terms of regex or the logic?

A punctuation is considered one when it ends a sentence, is not present within an ip address, not within words such as !true , emp?ty (leave them alone). Also may be followed by a space or end of string.

You can use this pattern:

\\p{Punct}(?=\\s|$)

and replace it with nothing.

example:

String subject = "Remove from em?ty sentence 1? Remove from sentence 2! But not from ip address 190.168.10.110!";
String regex = "\\p{Punct}(?=\\s|$)";
String result = subject.replaceAll(regex, "");
System.out.println(result);
String input = "Remove from em?ty sentence 1? Remove from sentence 2! But not from ip address 190.168.10.110!";
System.out.println(input.replaceAll("[?!]", ""));

Gave output:

Remove from emty sentence 1 Remove from sentence 2 But not from ip address 190.168.10.110

为什么不使用

string.replaceAll("[?!] ", ""));

I would do it the other way around.

if(token.matches("[\\.\\!\\:\\?\\;] "){
token.replace("");
}

Now, I am assuming that the punctuation marks would have a trailing space. It leaves out only the last punctuation, mark in the sentence, which you can remove separately.

Something like this might work. It rules out everything, then takes what punctuation is
significant to you. [,.!?]

Just replace with $1

    # ([^\pL\pN\s]*[\pL\pN](?:[\pL\pN_-]|\pP(?=[\pL\pN\pP_-]))*)|[,.!?]
    # "([^\\pL\\pN\\s]*[\\pL\\pN](?:[\\pL\\pN_-]|\\pP(?=[\\pL\\pN\\pP_-]))*)|[,.!?]"

    (                              # (1 start)
         [^\pL\pN\s]* [\pL\pN] 
         (?:
              [\pL\pN_-] 
           |  \pP 
              (?= [\pL\pN\pP_-] )
         )*
    )                              # (1 end)
 |  
    [,.!?] 

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM