简体   繁体   English

Java:正则表达式可识别句子中的标点符号并将其删除

[英]Java: Regex to identify punctuations in a sentence and delete them

I have the following string: 我有以下字符串:

String input = "Remove from em?ty sentence 1? Remove from sentence 2! But not from ip address 190.168.10.110!";

I want to remove punctuation marks at the right places. 我想在正确的位置删除标点符号。 My output needs to be: 我的输出需要是:

String str = "Remove from em?ty sentence 1 Remove from sentence 2 But not from ip address 190.168.10.110";

I am using the following code: 我正在使用以下代码:

while (stream.hasNext()) { 
    token = stream.next();
    char[] tokenArray = token.toCharArray();
    token = token.trim();

    if(token.matches(".*?[0-9]{1,3}\\.[0-9]{1,3}\\.[0-9]{1,3}\\.[0-9]{1,3}[\\.\\?!]+")){
        System.out.println("case2");
        stream.previous();
        int len = token.length()-1;
        for(int i = token.length()-1; i>7; i--){
            if(tokenArray[i]=='.'||tokenArray[i]=='?'||tokenArray[i]=='!'){
                --len;
            }
            else
                break;
        }
        stream.set(token.substring(0, len+1));
    }
    else if(token.matches(".*?\\b[a-zA-Z_0-9]+\\b[\\.\\?!]+")){
        System.out.println("case1");
        stream.previous();
        str = token.replaceAll("[\\.\\?!]+", "");
        stream.set(str);

        System.out.println(stream.next());                          
    }
}

'Tokens' are getting sent from the 'input' string. 从“输入”字符串发送“令牌”。 Can you please indicate what i am doing wrong in terms of regex or the logic? 您能指出我在正则表达式或逻辑方面做错了什么吗?

A punctuation is considered one when it ends a sentence, is not present within an ip address, not within words such as !true , emp?ty (leave them alone). 标点符号在结束句子时被认为是一个标点符号,它不存在于ip地址中,也不存在于诸如!trueemp?ty之类的单词中(请不要emp?ty它们)。 Also may be followed by a space or end of string. 也可以在其后跟空格或字符串结尾。

You can use this pattern: 您可以使用以下模式:

\\p{Punct}(?=\\s|$)

and replace it with nothing. 并一无所获。

example: 例:

String subject = "Remove from em?ty sentence 1? Remove from sentence 2! But not from ip address 190.168.10.110!";
String regex = "\\p{Punct}(?=\\s|$)";
String result = subject.replaceAll(regex, "");
System.out.println(result);
String input = "Remove from em?ty sentence 1? Remove from sentence 2! But not from ip address 190.168.10.110!";
System.out.println(input.replaceAll("[?!]", ""));

Gave output: 给定输出:

Remove from emty sentence 1 Remove from sentence 2 But not from ip address 190.168.10.110

为什么不使用

string.replaceAll("[?!] ", ""));

I would do it the other way around. 我会反过来做。

if(token.matches("[\\.\\!\\:\\?\\;] "){
token.replace("");
}

Now, I am assuming that the punctuation marks would have a trailing space. 现在,我假设标点符号后面有空格。 It leaves out only the last punctuation, mark in the sentence, which you can remove separately. 它仅保留最后一个标点符号,即句子中的标记,您可以将其单独删除。

Something like this might work. 这样的事情可能会起作用。 It rules out everything, then takes what punctuation is 它排除所有内容,然后使用标点符号
significant to you. 对你很重要 [,.!?]

Just replace with $1 只需替换为$ 1

    # ([^\pL\pN\s]*[\pL\pN](?:[\pL\pN_-]|\pP(?=[\pL\pN\pP_-]))*)|[,.!?]
    # "([^\\pL\\pN\\s]*[\\pL\\pN](?:[\\pL\\pN_-]|\\pP(?=[\\pL\\pN\\pP_-]))*)|[,.!?]"

    (                              # (1 start)
         [^\pL\pN\s]* [\pL\pN] 
         (?:
              [\pL\pN_-] 
           |  \pP 
              (?= [\pL\pN\pP_-] )
         )*
    )                              # (1 end)
 |  
    [,.!?] 

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM