简体   繁体   English

删除java中的停用词

[英]remove Stopwords in java

I have a list of stop words which contain around 30 words and a set of articles . 我有一个包含大约30个单词和一组文章的停用词列表。

I want to parse each article and remove those stop words from it . 我想解析每篇文章并从中删除那些停用词。

I am not sure what is the most effecient way to do it. 我不确定最有效的方法是什么。

for instance I can loop through stop list and replace the word in article if exist with whitespace but it does not seem good . 例如,我可以循环停止列表并替换文章中的单词,如果存在空格但似乎不好。

Thanks 谢谢

  • Put stop words into a java.util.Set 将停用词放入java.util.Set
  • Split input into words 将输入拆分为单词
  • For each word in input, see if it's contained in the set of stopwords, write to output if not 对于输入中的每个单词,查看它是否包含在一组停用词中,如果没有则写入输出

Replacing the words will be inefficient. 替换单词将是低效的。 Your best bet is probably to parse the article word by word, and copy each word to a new StringBuffer; 你最好的选择可能是逐字解析文章,并将每个单词复制到一个新的StringBuffer; unless it is a stopword, in which case you copy whatever you want in its place. 除非它是一个禁用词,在这种情况下你可以复制你想要的任何东西。 StringBuffer is much more efficient than String here. StringBuffer在这里比String更有效。

How you store the stopwords is probably unimportant if there are only thirty or so. 如果只有三十个左右,你如何存储停用词可能并不重要。 A Set is probably a good bet. 套装可能是一个不错的选择。

According to the Sun Java Tutorials , you can use the Perl-compatible \\b deliminator in your regular expressions. 根据Sun Java教程 ,您可以在正则表达式中使用Perl兼容的\\b deliminator。 If you surround the word with them, it will match only that word, whether it's followed by or prefixed with a punctuation character or whitespace. 如果你用它们包围这个单词,它将只匹配那个单词,无论是后面的还是带有标点字符或空格的前缀。

Read a word from the input, and copy it to your StringBuilder (or wherever you're putting the result) if and only if it's not in the list of stop words. 从输入中读取一个单词,并将其复制到StringBuilder(或者将结果放在任何地方),当且仅当它不在停用词列表中时。 You'll be able to search for them faster if you put the stop words into something like a HashTable. 如果你把停用词放到像HashTable这样的东西上,你将能够更快地搜索它们。

Edit: oops, don't know what I was thinking, but you want a set, not a HashTable (or any other Dictionary). 编辑:oops,不知道我在想什么,但你想要一个集合,而不是HashTable(或任何其他字典)。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM