简体   繁体   English

如何去除句子中的停用词?

[英]How to remove stopwords from a sentence?

I have a list of stopwords where I want to remove all stopwords that exist in a sentence from the stopword-list.我有一个停用词列表,我想从停用词列表中删除句子中存在的所有停用词。 I'm currently using regex.我目前正在使用正则表达式。 I have to convert it to lower case as per the requirements that i need to meet.我必须根据我需要满足的要求将其转换为小写。

However, the problem is that stopwords still exists in the sentence.但是,问题在于句子中仍然存在停用词。

// List of stopwords
List<String> stopwords = new ArrayList<>();
stopwords.add("is");
stopwords.add("a");
// the stopword list goes on ....

// Sentence
String sentence = "autism    autism is a neurodevelopmental";

// Remove stop words in the sentence
String stopwordsRegex = stopwords.stream().collect(Collectors.joining("|", "\\b(", ")\\b\\s?"));
String removedSW = sentence.toLowerCase().replaceAll(stopwordsRegex, "");

System.out.println(removedSW);
String stopwordsRegex = stopwords.stream()
        .map(String::toLowerCase)
        .collect(Collectors.joining("|", "(?i)\\b(", ")\\b\\s?"));
String removedSW = sentence.replaceAll(stopwordsRegex, "");

Everything is fine, just (?i) will add an ignore-case , so the sentence may keep its upper case.一切都很好,只是(?i)会添加一个ignore-case ,所以句子可能会保持大写。 It might have been an upper-case stop word like "I" .它可能是一个大写的停用词,例如"I" How to make words in a stream lower-case added (but not necessary).如何将流中的单词添加为小写(但不是必需的)。

this works as well:这也有效:

 for (String stopword : stopwords){
      sentence = sentence.replaceAll("\\b" + stopword + "\\b", "");
 }

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM