在Java中删除String中的停用词

Question

我有一个包含大量单词的字符串，我有一个文本文件，其中包含一些需要从我的字符串中删除的停用词。 假设我有一个字符串

s="I love this phone, its super fast and there's so much new and cool things with jelly bean....but of recently I've seen some bugs."

删除停用词后，字符串应为：

"love phone, super fast much cool jelly bean....but recently bugs."

我已经能够实现这一点，但我遇到的问题是，当字符串中有相邻的停用词时，它只删除第一个，我得到的结果如下：

"love phone, super fast there's much and cool with jelly bean....but recently seen bugs"

这是我的stopwordslist.txt文件：停用词

我怎么解决这个问题。 这是我到目前为止所做的：

int k=0,i,j;
ArrayList<String> wordsList = new ArrayList<String>();
String sCurrentLine;
String[] stopwords = new String[2000];
try{
        FileReader fr=new FileReader("F:\\stopwordslist.txt");
        BufferedReader br= new BufferedReader(fr);
        while ((sCurrentLine = br.readLine()) != null){
            stopwords[k]=sCurrentLine;
            k++;
        }
        String s="I love this phone, its super fast and there's so much new and cool things with jelly bean....but of recently I've seen some bugs.";
        StringBuilder builder = new StringBuilder(s);
        String[] words = builder.toString().split("\\s");
        for (String word : words){
            wordsList.add(word);
        }
        for(int ii = 0; ii < wordsList.size(); ii++){
            for(int jj = 0; jj < k; jj++){
                if(stopwords[jj].contains(wordsList.get(ii).toLowerCase())){
                    wordsList.remove(ii);
                    break;
                }
             }
        }
        for (String str : wordsList){
            System.out.print(str+" ");
        }   
    }catch(Exception ex){
        System.out.println(ex);
    }

Answer 1

这是一个更优雅的解决方案（恕我直言），只使用正则表达式：

    // instead of the ".....", add all your stopwords, separated by "|"
    // "\\b" is to account for word boundaries, i.e. not replace "his" in "this"
    // the "\\s?" is to suppress optional trailing white space
    Pattern p = Pattern.compile("\\b(I|this|its.....)\\b\\s?");
    Matcher m = p.matcher("I love this phone, its super fast and there's so much new and cool things with jelly bean....but of recently I've seen some bugs.");
    String s = m.replaceAll("");
    System.out.println(s);

Answer 2

试试下面的程序。

String s="I love this phone, its super fast and there's so" +
            " much new and cool things with jelly bean....but of recently I've seen some bugs.";
    String[] words = s.split(" ");
    ArrayList<String> wordsList = new ArrayList<String>();
    Set<String> stopWordsSet = new HashSet<String>();
    stopWordsSet.add("I");
    stopWordsSet.add("THIS");
    stopWordsSet.add("AND");
    stopWordsSet.add("THERE'S");

    for(String word : words)
    {
        String wordCompare = word.toUpperCase();
        if(!stopWordsSet.contains(wordCompare))
        {
            wordsList.add(word);
        }
    }

    for (String str : wordsList){
        System.out.print(str+" ");
    }

输出：爱手机，它的超快速这么多新的凉爽的东西与果冻豆....但最近我看到了一些错误。

Answer 3

你可以像这样使用replace All功能

String yourString ="I love this phone, its super fast and there's so much new and cool things with jelly bean....but of recently I've seen some bugs."
yourString=yourString.replaceAll("stop" ,"");

Answer 4

从那里有几种解决方案。 例如，您可以将值设置为“”而不是删除值。 或者创建一个特殊的“结果”列表。

Answer 5

这是以下方式尝试：

   String s="I love this phone, its super fast and there's so much new and cool things with jelly bean....but of recently I've seen some bugs.";
   String stopWords[]={"love","this","cool"};
   for(int i=0;i<stopWords.length;i++){
       if(s.contains(stopWords[i])){
           s=s.replaceAll(stopWords[i]+"\\s+", ""); //note this will remove spaces at the end
       }
   }
   System.out.println(s);

这样你的最终输出将没有你不想要的单词。 只需获取数组中的停用词列表并替换为必需的字符串。
输出我的停用词：

I   phone, its super fast and there's so much new and  things with jelly bean....but of recently I've seen some bugs.

Answer 6

相反，为什么不使用下面的方法。 它更容易阅读和理解：

for(String word : words){
    s = s.replace(word+"\\s*", "");
}
System.out.println(s);//It will print removed word string.

Answer 7

尝试使用String的replaceAll api，如：

String myString = "I love this phone, its super fast and there's so much new and cool things with jelly bean....but of recently I've seen some bugs.";
String stopWords = "I|its|with|but";
String afterStopWords = myString.replaceAll("(" + stopWords + ")\\s*", "");
System.out.println(afterStopWords);

OUTPUT: 
love this phone, super fast and there's so much new and cool things jelly bean....of recently 've seen some bugs.

Answer 8

尝试将停用词存储在集合集合中，然后将字符串标记为列表。 之后您可以简单地使用'removeAll'来获得结果。

Set<String> stopwords = new Set<>()
//fill in the set with your file

String s="I love this phone, its super fast and there's so much new and cool things with jelly bean....but of recently I've seen some bugs.";
List<String> listOfStrings = asList(s.split(" "));

listOfStrings.removeAll(stopwords);
StringUtils.join(listOfStrings, " ");

不需要循环 - 它们通常意味着问题。

Answer 9

似乎你停止了一句话，一个句子被移除到另一个停止词：你需要删除每个句子中的所有停止词。

您应该尝试更改代码：

从：

for(int ii = 0; ii < wordsList.size(); ii++){
    for(int jj = 0; jj < k; jj++){
        if(stopwords[jj].contains(wordsList.get(ii).toLowerCase())){
            wordsList.remove(ii);
            break;
        }
    }
}

对于这样的事情：

for(int ii = 0; ii < wordsList.size(); ii++)
{
    for(int jj = 0; jj < k; jj++)
    {
        if(wordsList.get(ii).toLowerCase().contains(stopwords[jj])
        {
            wordsList.remove(ii);
        }
    }
}

请注意， break被删除， word.contains(stopword) stopword.contains(word)更改为word.contains(stopword) 。

Answer 10

最近，在完成了一些博客和文章之后，该项目中的一个项目需要过滤来自给定文本或文件的停止/词干和咒骂词的功能。 创建了一个简单的库来过滤数据/文件并在maven中可用。 希望这可能对某人有所帮助。

https://github.com/uttesh/exude

     <dependency>
        <groupId>com.uttesh</groupId>
        <artifactId>exude</artifactId>
        <version>0.0.2</version>
    </dependency>

在Java中删除String中的停用词

问题描述

10 个解决方案

解决方案1
5 2014-12-29 08:58:20

解决方案2
4 2014-12-29 09:18:22

解决方案3
3 2014-12-29 10:17:43

解决方案4
2 已采纳 2014-12-29 09:11:31

解决方案5
1 2014-12-29 08:56:28

解决方案6
1 2014-12-29 08:56:41

解决方案7
1 2014-12-29 09:05:13

解决方案8
0 2014-12-29 09:31:39

解决方案9
0 2015-10-13 00:50:35

从：

对于这样的事情：

解决方案10
0 2016-01-07 15:23:24

在Java中删除String中的停用词

问题描述

10 个解决方案

解决方案1 5 2014-12-29 08:58:20

解决方案2 4 2014-12-29 09:18:22

解决方案3 3 2014-12-29 10:17:43

解决方案4 2 已采纳 2014-12-29 09:11:31

解决方案5 1 2014-12-29 08:56:28

解决方案6 1 2014-12-29 08:56:41

解决方案7 1 2014-12-29 09:05:13

解决方案8 0 2014-12-29 09:31:39

解决方案9 0 2015-10-13 00:50:35

从：

对于这样的事情：

解决方案10 0 2016-01-07 15:23:24

解决方案1
5 2014-12-29 08:58:20

解决方案2
4 2014-12-29 09:18:22

解决方案3
3 2014-12-29 10:17:43

解决方案4
2 已采纳 2014-12-29 09:11:31

解决方案5
1 2014-12-29 08:56:28

解决方案6
1 2014-12-29 08:56:41

解决方案7
1 2014-12-29 09:05:13

解决方案8
0 2014-12-29 09:31:39

解决方案9
0 2015-10-13 00:50:35

解决方案10
0 2016-01-07 15:23:24