简体   繁体   English

帮助正则表达式

[英]Help with regular expressions

I have a small piece of code which takes a input string, does the cleanup part(removes special characters like ''\\. and replaces any other characters with a space) & then generates a new string. 我有一小段代码,它接受一个输入字符串,清除部分(删除特殊字符,如''\\。并用空格替换任何其他字符)然后生成一个新字符串。

public class Example
{
    public static void main(String... args)
    {
        charFilter("I.T rocks. It's time to get a job.Come on");

    }

    public static String charFilter(String inText) { 

        String outText="";

        inText = inText.replaceAll("['’\\.]", "");
        outText = inText.replaceAll("[^a-zA-Z0-9- ]", " ");
        System.out.println(outText);
        return outText;
    }

}

The output of the above code is "IT rocks Its time to get a jobCome on". 上面代码的输出是“IT摇滚的时间来获得一份工作”。 But I need to get an output as "IT rocks Its time to get a job Come on"(job & come should appear as separate words, but IT should appear as IT) because we can expect the user inputting the data to forget adding a space after the full stop. 但我需要得到一个输出“IT摇滚它的时间来找到工作加油”(工作和来应该显示为单独的词,但IT应该显示为IT)因为我们可以期望用户输入数据忘记添加一个完全停止后的空间。

Can someone suggest me what approach I need to follow. 有人可以建议我采取什么方法来遵循。

You're substituting the . 你要替换. in the first regular expression, so it won't be substituted by an space in the second regex. 在第一个正则表达式中,所以它不会被第二个正则表达式中的空格替换。

You will need to use information about the semantics, which is why AI is more complicated then regex. 您将需要使用有关语义的信息,这就是为什么AI比正则表达式更复杂。 Without additional information, a simple regex will not be able to distinguish between what humans consider an abbreviation or an end/start of a sentence. 如果没有其他信息,简单的正则表达式将无法区分人类认为的缩写或句子的结尾/开头。

One possible suggestion, but not a 100% solution, would be to look for single characters followed or separated by a dot. 一个可能的建议,但不是100%的解决方案,将寻找单个字符后跟或由点分隔。 While I can imagine there are sentences ending on a single character and the next one starting with one, it could be a valid solution for many cases. 虽然我可以想象有一个句子以单个字符结尾而下一个以一个字符开头,但它可能是许多情况下的有效解决方案。 Maybe you can come up with a similar workaround for other special characters, using some knowledge of the input language or subject domain (if any). 也许您可以使用输入语言或主题域(如果有)的一些知识为其他特殊字符提出类似的解决方法。

A complete generic solution would be to have a human re-read and correct the errors by hand. 一个完整的通用解决方案是让人重新阅读并手动纠正错误。 A regex or other automated substitution will not come close to 100% for all possible text input. 对于所有可能的文本输入,正则表达式或其他自动替换不会接近100%。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM