在Java中刪除String中的停用詞

Question

我有一個包含大量單詞的字符串，我有一個文本文件，其中包含一些需要從我的字符串中刪除的停用詞。 假設我有一個字符串

s="I love this phone, its super fast and there's so much new and cool things with jelly bean....but of recently I've seen some bugs."

刪除停用詞后，字符串應為：

"love phone, super fast much cool jelly bean....but recently bugs."

我已經能夠實現這一點，但我遇到的問題是，當字符串中有相鄰的停用詞時，它只刪除第一個，我得到的結果如下：

"love phone, super fast there's much and cool with jelly bean....but recently seen bugs"

這是我的stopwordslist.txt文件：停用詞

我怎么解決這個問題。 這是我到目前為止所做的：

int k=0,i,j;
ArrayList<String> wordsList = new ArrayList<String>();
String sCurrentLine;
String[] stopwords = new String[2000];
try{
        FileReader fr=new FileReader("F:\\stopwordslist.txt");
        BufferedReader br= new BufferedReader(fr);
        while ((sCurrentLine = br.readLine()) != null){
            stopwords[k]=sCurrentLine;
            k++;
        }
        String s="I love this phone, its super fast and there's so much new and cool things with jelly bean....but of recently I've seen some bugs.";
        StringBuilder builder = new StringBuilder(s);
        String[] words = builder.toString().split("\\s");
        for (String word : words){
            wordsList.add(word);
        }
        for(int ii = 0; ii < wordsList.size(); ii++){
            for(int jj = 0; jj < k; jj++){
                if(stopwords[jj].contains(wordsList.get(ii).toLowerCase())){
                    wordsList.remove(ii);
                    break;
                }
             }
        }
        for (String str : wordsList){
            System.out.print(str+" ");
        }   
    }catch(Exception ex){
        System.out.println(ex);
    }

Answer 1

這是一個更優雅的解決方案（恕我直言），只使用正則表達式：

    // instead of the ".....", add all your stopwords, separated by "|"
    // "\\b" is to account for word boundaries, i.e. not replace "his" in "this"
    // the "\\s?" is to suppress optional trailing white space
    Pattern p = Pattern.compile("\\b(I|this|its.....)\\b\\s?");
    Matcher m = p.matcher("I love this phone, its super fast and there's so much new and cool things with jelly bean....but of recently I've seen some bugs.");
    String s = m.replaceAll("");
    System.out.println(s);

Answer 2

試試下面的程序。

String s="I love this phone, its super fast and there's so" +
            " much new and cool things with jelly bean....but of recently I've seen some bugs.";
    String[] words = s.split(" ");
    ArrayList<String> wordsList = new ArrayList<String>();
    Set<String> stopWordsSet = new HashSet<String>();
    stopWordsSet.add("I");
    stopWordsSet.add("THIS");
    stopWordsSet.add("AND");
    stopWordsSet.add("THERE'S");

    for(String word : words)
    {
        String wordCompare = word.toUpperCase();
        if(!stopWordsSet.contains(wordCompare))
        {
            wordsList.add(word);
        }
    }

    for (String str : wordsList){
        System.out.print(str+" ");
    }

輸出：愛手機，它的超快速這么多新的涼爽的東西與果凍豆....但最近我看到了一些錯誤。

Answer 3

你可以像這樣使用replace All功能

String yourString ="I love this phone, its super fast and there's so much new and cool things with jelly bean....but of recently I've seen some bugs."
yourString=yourString.replaceAll("stop" ,"");

Answer 4

從那里有幾種解決方案。 例如，您可以將值設置為“”而不是刪除值。 或者創建一個特殊的“結果”列表。

Answer 5

這是以下方式嘗試：

   String s="I love this phone, its super fast and there's so much new and cool things with jelly bean....but of recently I've seen some bugs.";
   String stopWords[]={"love","this","cool"};
   for(int i=0;i<stopWords.length;i++){
       if(s.contains(stopWords[i])){
           s=s.replaceAll(stopWords[i]+"\\s+", ""); //note this will remove spaces at the end
       }
   }
   System.out.println(s);

這樣你的最終輸出將沒有你不想要的單詞。 只需獲取數組中的停用詞列表並替換為必需的字符串。
輸出我的停用詞：

I   phone, its super fast and there's so much new and  things with jelly bean....but of recently I've seen some bugs.

Answer 6

相反，為什么不使用下面的方法。 它更容易閱讀和理解：

for(String word : words){
    s = s.replace(word+"\\s*", "");
}
System.out.println(s);//It will print removed word string.

Answer 7

嘗試使用String的replaceAll api，如：

String myString = "I love this phone, its super fast and there's so much new and cool things with jelly bean....but of recently I've seen some bugs.";
String stopWords = "I|its|with|but";
String afterStopWords = myString.replaceAll("(" + stopWords + ")\\s*", "");
System.out.println(afterStopWords);

OUTPUT: 
love this phone, super fast and there's so much new and cool things jelly bean....of recently 've seen some bugs.

Answer 8

嘗試將停用詞存儲在集合集合中，然后將字符串標記為列表。 之后您可以簡單地使用'removeAll'來獲得結果。

Set<String> stopwords = new Set<>()
//fill in the set with your file

String s="I love this phone, its super fast and there's so much new and cool things with jelly bean....but of recently I've seen some bugs.";
List<String> listOfStrings = asList(s.split(" "));

listOfStrings.removeAll(stopwords);
StringUtils.join(listOfStrings, " ");

不需要循環 - 它們通常意味着問題。

Answer 9

似乎你停止了一句話，一個句子被移除到另一個停止詞：你需要刪除每個句子中的所有停止詞。

您應該嘗試更改代碼：

從：

for(int ii = 0; ii < wordsList.size(); ii++){
    for(int jj = 0; jj < k; jj++){
        if(stopwords[jj].contains(wordsList.get(ii).toLowerCase())){
            wordsList.remove(ii);
            break;
        }
    }
}

對於這樣的事情：

for(int ii = 0; ii < wordsList.size(); ii++)
{
    for(int jj = 0; jj < k; jj++)
    {
        if(wordsList.get(ii).toLowerCase().contains(stopwords[jj])
        {
            wordsList.remove(ii);
        }
    }
}

請注意， break被刪除， word.contains(stopword) stopword.contains(word)更改為word.contains(stopword) 。

Answer 10

最近，在完成了一些博客和文章之后，該項目中的一個項目需要過濾來自給定文本或文件的停止/詞干和咒罵詞的功能。 創建了一個簡單的庫來過濾數據/文件並在maven中可用。 希望這可能對某人有所幫助。

https://github.com/uttesh/exude

     <dependency>
        <groupId>com.uttesh</groupId>
        <artifactId>exude</artifactId>
        <version>0.0.2</version>
    </dependency>

在Java中刪除String中的停用詞

問題描述

10 個解決方案

解決方案1
5 2014-12-29 08:58:20

解決方案2
4 2014-12-29 09:18:22

解決方案3
3 2014-12-29 10:17:43

解決方案4
2 已采納 2014-12-29 09:11:31

解決方案5
1 2014-12-29 08:56:28

解決方案6
1 2014-12-29 08:56:41

解決方案7
1 2014-12-29 09:05:13

解決方案8
0 2014-12-29 09:31:39

解決方案9
0 2015-10-13 00:50:35

從：

對於這樣的事情：

解決方案10
0 2016-01-07 15:23:24

在Java中刪除String中的停用詞

問題描述

10 個解決方案

解決方案1 5 2014-12-29 08:58:20

解決方案2 4 2014-12-29 09:18:22

解決方案3 3 2014-12-29 10:17:43

解決方案4 2 已采納 2014-12-29 09:11:31

解決方案5 1 2014-12-29 08:56:28

解決方案6 1 2014-12-29 08:56:41

解決方案7 1 2014-12-29 09:05:13

解決方案8 0 2014-12-29 09:31:39

解決方案9 0 2015-10-13 00:50:35

從：

對於這樣的事情：

解決方案10 0 2016-01-07 15:23:24

解決方案1
5 2014-12-29 08:58:20

解決方案2
4 2014-12-29 09:18:22

解決方案3
3 2014-12-29 10:17:43

解決方案4
2 已采納 2014-12-29 09:11:31

解決方案5
1 2014-12-29 08:56:28

解決方案6
1 2014-12-29 08:56:41

解決方案7
1 2014-12-29 09:05:13

解決方案8
0 2014-12-29 09:31:39

解決方案9
0 2015-10-13 00:50:35

解決方案10
0 2016-01-07 15:23:24