简体   繁体   中英

Remove stopwords from a string in Java

I have a string with a lot of words that I need to count.

But I want to avoid some words without significancy to the context.

So, I have a file with all the words I will ignore. I open this file and create a list I call

ArrayList<String> stopWordsList;

Now I have the string and need to clean it, eliminating the stopWords from the list.

I've tried like this:

String example = "Job in a software factory. Work with Agile, Spring, Hibernate, GWT, etc.";

for(String stopWord : stopWordsList){
    example = example.replaceAll(" "+ stopWord + " ", " ");
}

After this, string example should be:

"Job software factory. Work Agile, Spring, Hibernate, GWT, ."

The problem is that "etc." was not remove it, because of the dot after the word.

Then I tried:

for(String stopWord : stopWordsList){
    example = example.replaceAll(" "+ stopWord + " ", " ");    
    example = example.replaceAll(" "+ stopWord + ",", ",");     
    example = example.replaceAll(" "+ stopWord + ".", ".");
}

But, this is not right, it does not do what I need.

Can anybody help me finding a way to clean this string, including words that comes before punctuations or blankspaces.

PS: I can not just do

 example = example.replaceAll(stopWord, " ");   

because this can break some words like "initial". It will remove "in" and leave me "itial".

The easiest way could be to split the String along word boundaries and add back everything but stop words.

StringBuilder result = new StringBuilder(example.length());
for (String s : result.split("\\b")) {
    if (!stopWordsSet.contains(s)) result.append(s);
}

It looks like you just want to replace the word when it has non-word characters on both sides. It's pretty straightforward to just have both a lookahead and a lookbehind for this.

There's a potential issue with things like double space, and commas after periods and things along those lines, but it doesn't sound like that is relevant to your application, and if it is there's some ambiguity in how you could resolve that.

Something along the lines of this should work:

example = example.replaceAll("(?![^ a-zA-Z])" + stopWord + "(?=[^ a-zA-Z])", "")

Where (?![^ a-zA-Z]) is a negative lookahead (a look behind) for anything that's neither a space or a character, and (?=[^ a-zA-Z]) is the forward looking equivalent.

Hope that helps, let me know if you have any more questions, or if this is non-ideal for your application.

This will not remove punctuation. Since those are lookaheads and lookbehinds they don't actually match the punctuation in question.

If you want this to work with accented characters as well, you can replace the traditional \\w regex with the POSIX-compliant [:alpha:] instead.

example = example.replaceAll("(?![^ [:alpha:]])" + stopWord + "(?=[^ [:alpha:]])", "")

Created a small util library to remove stop/stemmer words from the given text and its in maven repository/github

exude library

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM