简体   繁体   English

从 Java 中的字符串中删除停用词

[英]Remove stopwords from a string in Java

I have a string with a lot of words that I need to count.我有一个字符串,里面有很多我需要计算的单词。

But I want to avoid some words without significancy to the context.但我想避免一些对上下文没有意义的词。

So, I have a file with all the words I will ignore.所以,我有一个文件,其中包含我将忽略的所有单词。 I open this file and create a list I call我打开这个文件并创建一个我调用的列表

ArrayList<String> stopWordsList;

Now I have the string and need to clean it, eliminating the stopWords from the list.现在我有了字符串,需要清理它,从列表中删除停用词。

I've tried like this:我试过这样:

String example = "Job in a software factory. Work with Agile, Spring, Hibernate, GWT, etc.";

for(String stopWord : stopWordsList){
    example = example.replaceAll(" "+ stopWord + " ", " ");
}

After this, string example should be:在此之后,字符串示例应该是:

"Job software factory. Work Agile, Spring, Hibernate, GWT, ." “工作软件工厂。工作敏捷,Spring,Hibernate,GWT,。”

The problem is that "etc."问题是“等”。 was not remove it, because of the dot after the word.没有删除它,因为单词后面的点。

Then I tried:然后我尝试:

for(String stopWord : stopWordsList){
    example = example.replaceAll(" "+ stopWord + " ", " ");    
    example = example.replaceAll(" "+ stopWord + ",", ",");     
    example = example.replaceAll(" "+ stopWord + ".", ".");
}

But, this is not right, it does not do what I need.但是,这是不对的,它不能满足我的需求。

Can anybody help me finding a way to clean this string, including words that comes before punctuations or blankspaces.任何人都可以帮我找到一种方法来清理这个字符串,包括标点符号或空格之前的单词。

PS: I can not just do PS:我不能只做

 example = example.replaceAll(stopWord, " ");   

because this can break some words like "initial".因为这可以打破一些像“初始”这样的词。 It will remove "in" and leave me "itial".它将删除“in”并让我“itial”。

The easiest way could be to split the String along word boundaries and add back everything but stop words. 最简单的方法可能是将String沿单词边界分割,然后添加除停用词以外的所有内容。

StringBuilder result = new StringBuilder(example.length());
for (String s : result.split("\\b")) {
    if (!stopWordsSet.contains(s)) result.append(s);
}

It looks like you just want to replace the word when it has non-word characters on both sides. 看起来您只想在单词的两边都包含非单词字符时替换该单词。 It's pretty straightforward to just have both a lookahead and a lookbehind for this. 为此既要先行又要先行是很简单的。

There's a potential issue with things like double space, and commas after periods and things along those lines, but it doesn't sound like that is relevant to your application, and if it is there's some ambiguity in how you could resolve that. 诸如双倍空格,句点之后的逗号以及沿这些界线的东西等可能存在问题,但这听起来与您的应用程序无关,如果解决的话,这会有些含糊。

Something along the lines of this should work: 与此类似的东西应该起作用:

example = example.replaceAll("(?![^ a-zA-Z])" + stopWord + "(?=[^ a-zA-Z])", "")

Where (?![^ a-zA-Z]) is a negative lookahead (a look behind) for anything that's neither a space or a character, and (?=[^ a-zA-Z]) is the forward looking equivalent. 其中(?![^ a-zA-Z])是对否定的否定项(向后看),既不是空格也不是字符,而(?=[^ a-zA-Z])是前瞻性等效项。

Hope that helps, let me know if you have any more questions, or if this is non-ideal for your application. 希望有帮助,如果您还有其他问题,或者这对您的应用程序不理想,请让我知道。

This will not remove punctuation. 这不会删除标点符号。 Since those are lookaheads and lookbehinds they don't actually match the punctuation in question. 由于这些是先行和后退,因此它们实际上与所讨论的标点不匹配。

If you want this to work with accented characters as well, you can replace the traditional \\w regex with the POSIX-compliant [:alpha:] instead. 如果您还希望它也使用重音符号,则可以用兼容POSIX的[:alpha:]代替传统的\\w正则表达式。

example = example.replaceAll("(?![^ [:alpha:]])" + stopWord + "(?=[^ [:alpha:]])", "")

Created a small util library to remove stop/stemmer words from the given text and its in maven repository/github 创建了一个小型util库,以从给定文本及其在Maven存储库/ github中删除停用词/词尾

exude library 散发图书馆

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM