簡體   English   中英

在Java中刪除String中的停用詞

[英]Removing stopwords from a String in Java

我有一個包含大量單詞的字符串,我有一個文本文件,其中包含一些需要從我的字符串中刪除的停用詞。 假設我有一個字符串

s="I love this phone, its super fast and there's so much new and cool things with jelly bean....but of recently I've seen some bugs."

刪除停用詞后,字符串應為:

"love phone, super fast much cool jelly bean....but recently bugs."

我已經能夠實現這一點,但我遇到的問題是,當字符串中有相鄰的停用詞時,它只刪除第一個,我得到的結果如下:

"love phone, super fast there's much and cool with jelly bean....but recently seen bugs"  

這是我的stopwordslist.txt文件:停用詞

我怎么解決這個問題。 這是我到目前為止所做的:

int k=0,i,j;
ArrayList<String> wordsList = new ArrayList<String>();
String sCurrentLine;
String[] stopwords = new String[2000];
try{
        FileReader fr=new FileReader("F:\\stopwordslist.txt");
        BufferedReader br= new BufferedReader(fr);
        while ((sCurrentLine = br.readLine()) != null){
            stopwords[k]=sCurrentLine;
            k++;
        }
        String s="I love this phone, its super fast and there's so much new and cool things with jelly bean....but of recently I've seen some bugs.";
        StringBuilder builder = new StringBuilder(s);
        String[] words = builder.toString().split("\\s");
        for (String word : words){
            wordsList.add(word);
        }
        for(int ii = 0; ii < wordsList.size(); ii++){
            for(int jj = 0; jj < k; jj++){
                if(stopwords[jj].contains(wordsList.get(ii).toLowerCase())){
                    wordsList.remove(ii);
                    break;
                }
             }
        }
        for (String str : wordsList){
            System.out.print(str+" ");
        }   
    }catch(Exception ex){
        System.out.println(ex);
    }

這是一個更優雅的解決方案(恕我直言),只使用正則表達式:

    // instead of the ".....", add all your stopwords, separated by "|"
    // "\\b" is to account for word boundaries, i.e. not replace "his" in "this"
    // the "\\s?" is to suppress optional trailing white space
    Pattern p = Pattern.compile("\\b(I|this|its.....)\\b\\s?");
    Matcher m = p.matcher("I love this phone, its super fast and there's so much new and cool things with jelly bean....but of recently I've seen some bugs.");
    String s = m.replaceAll("");
    System.out.println(s);

試試下面的程序。

String s="I love this phone, its super fast and there's so" +
            " much new and cool things with jelly bean....but of recently I've seen some bugs.";
    String[] words = s.split(" ");
    ArrayList<String> wordsList = new ArrayList<String>();
    Set<String> stopWordsSet = new HashSet<String>();
    stopWordsSet.add("I");
    stopWordsSet.add("THIS");
    stopWordsSet.add("AND");
    stopWordsSet.add("THERE'S");

    for(String word : words)
    {
        String wordCompare = word.toUpperCase();
        if(!stopWordsSet.contains(wordCompare))
        {
            wordsList.add(word);
        }
    }

    for (String str : wordsList){
        System.out.print(str+" ");
    }

輸出:愛手機,它的超快速這么多新的涼爽的東西與果凍豆....但最近我看到了一些錯誤。

你可以像這樣使用replace All功能

String yourString ="I love this phone, its super fast and there's so much new and cool things with jelly bean....but of recently I've seen some bugs."
yourString=yourString.replaceAll("stop" ,"");

該錯誤是因為您從迭代的列表中刪除元素。 讓說,你有wordsList包含|word0|word1|word2| 如果ii等於1且if測試為真,則調用wordsList.remove(1); 之后你的名單是|word0|word2| ii然后遞增並等於2 ,現在它高於列表的大小,因此word2將永遠不會被測試。

從那里有幾種解決方案。 例如,您可以將值設置為“”而不是刪除值。 或者創建一個特殊的“結果”列表。

這是以下方式嘗試:

   String s="I love this phone, its super fast and there's so much new and cool things with jelly bean....but of recently I've seen some bugs.";
   String stopWords[]={"love","this","cool"};
   for(int i=0;i<stopWords.length;i++){
       if(s.contains(stopWords[i])){
           s=s.replaceAll(stopWords[i]+"\\s+", ""); //note this will remove spaces at the end
       }
   }
   System.out.println(s);

這樣你的最終輸出將沒有你不想要的單詞。 只需獲取數組中的停用詞列表並替換為必需的字符串。
輸出我的停用詞:

I   phone, its super fast and there's so much new and  things with jelly bean....but of recently I've seen some bugs.

相反,為什么不使用下面的方法。 它更容易閱讀和理解:

for(String word : words){
    s = s.replace(word+"\\s*", "");
}
System.out.println(s);//It will print removed word string.

嘗試使用String的replaceAll api,如:

String myString = "I love this phone, its super fast and there's so much new and cool things with jelly bean....but of recently I've seen some bugs.";
String stopWords = "I|its|with|but";
String afterStopWords = myString.replaceAll("(" + stopWords + ")\\s*", "");
System.out.println(afterStopWords);

OUTPUT: 
love this phone, super fast and there's so much new and cool things jelly bean....of recently 've seen some bugs.

嘗試將停用詞存儲在集合集合中,然后將字符串標記為列表。 之后您可以簡單地使用'removeAll'來獲得結果。

Set<String> stopwords = new Set<>()
//fill in the set with your file

String s="I love this phone, its super fast and there's so much new and cool things with jelly bean....but of recently I've seen some bugs.";
List<String> listOfStrings = asList(s.split(" "));

listOfStrings.removeAll(stopwords);
StringUtils.join(listOfStrings, " ");

不需要循環 - 它們通常意味着問題。

似乎你停止了一句話,一個句子被移除到另一個停止詞:你需要刪除每個句子中的所有停止詞。

您應該嘗試更改代碼:

從:

for(int ii = 0; ii < wordsList.size(); ii++){
    for(int jj = 0; jj < k; jj++){
        if(stopwords[jj].contains(wordsList.get(ii).toLowerCase())){
            wordsList.remove(ii);
            break;
        }
    }
}

對於這樣的事情:

for(int ii = 0; ii < wordsList.size(); ii++)
{
    for(int jj = 0; jj < k; jj++)
    {
        if(wordsList.get(ii).toLowerCase().contains(stopwords[jj])
        {
            wordsList.remove(ii);
        }
    }
}

請注意, break被刪除, word.contains(stopword) stopword.contains(word)更改為word.contains(stopword)

最近,在完成了一些博客和文章之后,該項目中的一個項目需要過濾來自給定文本或文件的停止/詞干和咒罵詞的功能。 創建了一個簡單的庫來過濾數據/文件並在maven中可用。 希望這可能對某人有所幫助。

https://github.com/uttesh/exude

     <dependency>
        <groupId>com.uttesh</groupId>
        <artifactId>exude</artifactId>
        <version>0.0.2</version>
    </dependency>

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM