简体   繁体   中英

Help with regular expressions

I have a small piece of code which takes a input string, does the cleanup part(removes special characters like ''\\. and replaces any other characters with a space) & then generates a new string.

public class Example
{
    public static void main(String... args)
    {
        charFilter("I.T rocks. It's time to get a job.Come on");

    }

    public static String charFilter(String inText) { 

        String outText="";

        inText = inText.replaceAll("['’\\.]", "");
        outText = inText.replaceAll("[^a-zA-Z0-9- ]", " ");
        System.out.println(outText);
        return outText;
    }

}

The output of the above code is "IT rocks Its time to get a jobCome on". But I need to get an output as "IT rocks Its time to get a job Come on"(job & come should appear as separate words, but IT should appear as IT) because we can expect the user inputting the data to forget adding a space after the full stop.

Can someone suggest me what approach I need to follow.

You're substituting the . in the first regular expression, so it won't be substituted by an space in the second regex.

You will need to use information about the semantics, which is why AI is more complicated then regex. Without additional information, a simple regex will not be able to distinguish between what humans consider an abbreviation or an end/start of a sentence.

One possible suggestion, but not a 100% solution, would be to look for single characters followed or separated by a dot. While I can imagine there are sentences ending on a single character and the next one starting with one, it could be a valid solution for many cases. Maybe you can come up with a similar workaround for other special characters, using some knowledge of the input language or subject domain (if any).

A complete generic solution would be to have a human re-read and correct the errors by hand. A regex or other automated substitution will not come close to 100% for all possible text input.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM