[英]Removing stopwords from a String in Java
我有一个包含大量单词的字符串,我有一个文本文件,其中包含一些需要从我的字符串中删除的停用词。 假设我有一个字符串
s="I love this phone, its super fast and there's so much new and cool things with jelly bean....but of recently I've seen some bugs."
删除停用词后,字符串应为:
"love phone, super fast much cool jelly bean....but recently bugs."
我已经能够实现这一点,但我遇到的问题是,当字符串中有相邻的停用词时,它只删除第一个,我得到的结果如下:
"love phone, super fast there's much and cool with jelly bean....but recently seen bugs"
这是我的stopwordslist.txt文件:停用词
我怎么解决这个问题。 这是我到目前为止所做的:
int k=0,i,j;
ArrayList<String> wordsList = new ArrayList<String>();
String sCurrentLine;
String[] stopwords = new String[2000];
try{
FileReader fr=new FileReader("F:\\stopwordslist.txt");
BufferedReader br= new BufferedReader(fr);
while ((sCurrentLine = br.readLine()) != null){
stopwords[k]=sCurrentLine;
k++;
}
String s="I love this phone, its super fast and there's so much new and cool things with jelly bean....but of recently I've seen some bugs.";
StringBuilder builder = new StringBuilder(s);
String[] words = builder.toString().split("\\s");
for (String word : words){
wordsList.add(word);
}
for(int ii = 0; ii < wordsList.size(); ii++){
for(int jj = 0; jj < k; jj++){
if(stopwords[jj].contains(wordsList.get(ii).toLowerCase())){
wordsList.remove(ii);
break;
}
}
}
for (String str : wordsList){
System.out.print(str+" ");
}
}catch(Exception ex){
System.out.println(ex);
}
这是一个更优雅的解决方案(恕我直言),只使用正则表达式:
// instead of the ".....", add all your stopwords, separated by "|"
// "\\b" is to account for word boundaries, i.e. not replace "his" in "this"
// the "\\s?" is to suppress optional trailing white space
Pattern p = Pattern.compile("\\b(I|this|its.....)\\b\\s?");
Matcher m = p.matcher("I love this phone, its super fast and there's so much new and cool things with jelly bean....but of recently I've seen some bugs.");
String s = m.replaceAll("");
System.out.println(s);
试试下面的程序。
String s="I love this phone, its super fast and there's so" +
" much new and cool things with jelly bean....but of recently I've seen some bugs.";
String[] words = s.split(" ");
ArrayList<String> wordsList = new ArrayList<String>();
Set<String> stopWordsSet = new HashSet<String>();
stopWordsSet.add("I");
stopWordsSet.add("THIS");
stopWordsSet.add("AND");
stopWordsSet.add("THERE'S");
for(String word : words)
{
String wordCompare = word.toUpperCase();
if(!stopWordsSet.contains(wordCompare))
{
wordsList.add(word);
}
}
for (String str : wordsList){
System.out.print(str+" ");
}
输出:爱手机,它的超快速这么多新的凉爽的东西与果冻豆....但最近我看到了一些错误。
你可以像这样使用replace All功能
String yourString ="I love this phone, its super fast and there's so much new and cool things with jelly bean....but of recently I've seen some bugs."
yourString=yourString.replaceAll("stop" ,"");
该错误是因为您从迭代的列表中删除元素。 让说,你有wordsList
包含|word0|word1|word2|
如果ii
等于1
且if测试为真,则调用wordsList.remove(1);
。 之后你的名单是|word0|word2|
。 ii
然后递增并等于2
,现在它高于列表的大小,因此word2
将永远不会被测试。
从那里有几种解决方案。 例如,您可以将值设置为“”而不是删除值。 或者创建一个特殊的“结果”列表。
这是以下方式尝试:
String s="I love this phone, its super fast and there's so much new and cool things with jelly bean....but of recently I've seen some bugs.";
String stopWords[]={"love","this","cool"};
for(int i=0;i<stopWords.length;i++){
if(s.contains(stopWords[i])){
s=s.replaceAll(stopWords[i]+"\\s+", ""); //note this will remove spaces at the end
}
}
System.out.println(s);
这样你的最终输出将没有你不想要的单词。 只需获取数组中的停用词列表并替换为必需的字符串。
输出我的停用词:
I phone, its super fast and there's so much new and things with jelly bean....but of recently I've seen some bugs.
相反,为什么不使用下面的方法。 它更容易阅读和理解:
for(String word : words){
s = s.replace(word+"\\s*", "");
}
System.out.println(s);//It will print removed word string.
尝试使用String的replaceAll api,如:
String myString = "I love this phone, its super fast and there's so much new and cool things with jelly bean....but of recently I've seen some bugs.";
String stopWords = "I|its|with|but";
String afterStopWords = myString.replaceAll("(" + stopWords + ")\\s*", "");
System.out.println(afterStopWords);
OUTPUT:
love this phone, super fast and there's so much new and cool things jelly bean....of recently 've seen some bugs.
尝试将停用词存储在集合集合中,然后将字符串标记为列表。 之后您可以简单地使用'removeAll'来获得结果。
Set<String> stopwords = new Set<>()
//fill in the set with your file
String s="I love this phone, its super fast and there's so much new and cool things with jelly bean....but of recently I've seen some bugs.";
List<String> listOfStrings = asList(s.split(" "));
listOfStrings.removeAll(stopwords);
StringUtils.join(listOfStrings, " ");
不需要循环 - 它们通常意味着问题。
似乎你停止了一句话,一个句子被移除到另一个停止词:你需要删除每个句子中的所有停止词。
您应该尝试更改代码:
for(int ii = 0; ii < wordsList.size(); ii++){
for(int jj = 0; jj < k; jj++){
if(stopwords[jj].contains(wordsList.get(ii).toLowerCase())){
wordsList.remove(ii);
break;
}
}
}
for(int ii = 0; ii < wordsList.size(); ii++)
{
for(int jj = 0; jj < k; jj++)
{
if(wordsList.get(ii).toLowerCase().contains(stopwords[jj])
{
wordsList.remove(ii);
}
}
}
请注意, break
被删除, word.contains(stopword)
stopword.contains(word)
更改为word.contains(stopword)
。
最近,在完成了一些博客和文章之后,该项目中的一个项目需要过滤来自给定文本或文件的停止/词干和咒骂词的功能。 创建了一个简单的库来过滤数据/文件并在maven中可用。 希望这可能对某人有所帮助。
https://github.com/uttesh/exude
<dependency>
<groupId>com.uttesh</groupId>
<artifactId>exude</artifactId>
<version>0.0.2</version>
</dependency>
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.