繁体   English   中英

在Java中删除String中的停用词

[英]Removing stopwords from a String in Java

我有一个包含大量单词的字符串,我有一个文本文件,其中包含一些需要从我的字符串中删除的停用词。 假设我有一个字符串

s="I love this phone, its super fast and there's so much new and cool things with jelly bean....but of recently I've seen some bugs."

删除停用词后,字符串应为:

"love phone, super fast much cool jelly bean....but recently bugs."

我已经能够实现这一点,但我遇到的问题是,当字符串中有相邻的停用词时,它只删除第一个,我得到的结果如下:

"love phone, super fast there's much and cool with jelly bean....but recently seen bugs"  

这是我的stopwordslist.txt文件:停用词

我怎么解决这个问题。 这是我到目前为止所做的:

int k=0,i,j;
ArrayList<String> wordsList = new ArrayList<String>();
String sCurrentLine;
String[] stopwords = new String[2000];
try{
        FileReader fr=new FileReader("F:\\stopwordslist.txt");
        BufferedReader br= new BufferedReader(fr);
        while ((sCurrentLine = br.readLine()) != null){
            stopwords[k]=sCurrentLine;
            k++;
        }
        String s="I love this phone, its super fast and there's so much new and cool things with jelly bean....but of recently I've seen some bugs.";
        StringBuilder builder = new StringBuilder(s);
        String[] words = builder.toString().split("\\s");
        for (String word : words){
            wordsList.add(word);
        }
        for(int ii = 0; ii < wordsList.size(); ii++){
            for(int jj = 0; jj < k; jj++){
                if(stopwords[jj].contains(wordsList.get(ii).toLowerCase())){
                    wordsList.remove(ii);
                    break;
                }
             }
        }
        for (String str : wordsList){
            System.out.print(str+" ");
        }   
    }catch(Exception ex){
        System.out.println(ex);
    }

这是一个更优雅的解决方案(恕我直言),只使用正则表达式:

    // instead of the ".....", add all your stopwords, separated by "|"
    // "\\b" is to account for word boundaries, i.e. not replace "his" in "this"
    // the "\\s?" is to suppress optional trailing white space
    Pattern p = Pattern.compile("\\b(I|this|its.....)\\b\\s?");
    Matcher m = p.matcher("I love this phone, its super fast and there's so much new and cool things with jelly bean....but of recently I've seen some bugs.");
    String s = m.replaceAll("");
    System.out.println(s);

试试下面的程序。

String s="I love this phone, its super fast and there's so" +
            " much new and cool things with jelly bean....but of recently I've seen some bugs.";
    String[] words = s.split(" ");
    ArrayList<String> wordsList = new ArrayList<String>();
    Set<String> stopWordsSet = new HashSet<String>();
    stopWordsSet.add("I");
    stopWordsSet.add("THIS");
    stopWordsSet.add("AND");
    stopWordsSet.add("THERE'S");

    for(String word : words)
    {
        String wordCompare = word.toUpperCase();
        if(!stopWordsSet.contains(wordCompare))
        {
            wordsList.add(word);
        }
    }

    for (String str : wordsList){
        System.out.print(str+" ");
    }

输出:爱手机,它的超快速这么多新的凉爽的东西与果冻豆....但最近我看到了一些错误。

你可以像这样使用replace All功能

String yourString ="I love this phone, its super fast and there's so much new and cool things with jelly bean....but of recently I've seen some bugs."
yourString=yourString.replaceAll("stop" ,"");

该错误是因为您从迭代的列表中删除元素。 让说,你有wordsList包含|word0|word1|word2| 如果ii等于1且if测试为真,则调用wordsList.remove(1); 之后你的名单是|word0|word2| ii然后递增并等于2 ,现在它高于列表的大小,因此word2将永远不会被测试。

从那里有几种解决方案。 例如,您可以将值设置为“”而不是删除值。 或者创建一个特殊的“结果”列表。

这是以下方式尝试:

   String s="I love this phone, its super fast and there's so much new and cool things with jelly bean....but of recently I've seen some bugs.";
   String stopWords[]={"love","this","cool"};
   for(int i=0;i<stopWords.length;i++){
       if(s.contains(stopWords[i])){
           s=s.replaceAll(stopWords[i]+"\\s+", ""); //note this will remove spaces at the end
       }
   }
   System.out.println(s);

这样你的最终输出将没有你不想要的单词。 只需获取数组中的停用词列表并替换为必需的字符串。
输出我的停用词:

I   phone, its super fast and there's so much new and  things with jelly bean....but of recently I've seen some bugs.

相反,为什么不使用下面的方法。 它更容易阅读和理解:

for(String word : words){
    s = s.replace(word+"\\s*", "");
}
System.out.println(s);//It will print removed word string.

尝试使用String的replaceAll api,如:

String myString = "I love this phone, its super fast and there's so much new and cool things with jelly bean....but of recently I've seen some bugs.";
String stopWords = "I|its|with|but";
String afterStopWords = myString.replaceAll("(" + stopWords + ")\\s*", "");
System.out.println(afterStopWords);

OUTPUT: 
love this phone, super fast and there's so much new and cool things jelly bean....of recently 've seen some bugs.

尝试将停用词存储在集合集合中,然后将字符串标记为列表。 之后您可以简单地使用'removeAll'来获得结果。

Set<String> stopwords = new Set<>()
//fill in the set with your file

String s="I love this phone, its super fast and there's so much new and cool things with jelly bean....but of recently I've seen some bugs.";
List<String> listOfStrings = asList(s.split(" "));

listOfStrings.removeAll(stopwords);
StringUtils.join(listOfStrings, " ");

不需要循环 - 它们通常意味着问题。

似乎你停止了一句话,一个句子被移除到另一个停止词:你需要删除每个句子中的所有停止词。

您应该尝试更改代码:

从:

for(int ii = 0; ii < wordsList.size(); ii++){
    for(int jj = 0; jj < k; jj++){
        if(stopwords[jj].contains(wordsList.get(ii).toLowerCase())){
            wordsList.remove(ii);
            break;
        }
    }
}

对于这样的事情:

for(int ii = 0; ii < wordsList.size(); ii++)
{
    for(int jj = 0; jj < k; jj++)
    {
        if(wordsList.get(ii).toLowerCase().contains(stopwords[jj])
        {
            wordsList.remove(ii);
        }
    }
}

请注意, break被删除, word.contains(stopword) stopword.contains(word)更改为word.contains(stopword)

最近,在完成了一些博客和文章之后,该项目中的一个项目需要过滤来自给定文本或文件的停止/词干和咒骂词的功能。 创建了一个简单的库来过滤数据/文件并在maven中可用。 希望这可能对某人有所帮助。

https://github.com/uttesh/exude

     <dependency>
        <groupId>com.uttesh</groupId>
        <artifactId>exude</artifactId>
        <version>0.0.2</version>
    </dependency>

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM