簡體   English   中英

在Java中標記字符串后刪除停用詞

[英]removing stop words after tokenizing string in java

我想在標記字符串后刪除停用詞。 我有外部文件.txt並閱讀,然后將其與標記化字符串進行比較。 如果標記詞與停用詞相等,則將其刪除。

這是令牌化的代碼

try{
            while ((msg =readBufferData.readLine()) != null) {
                int numberOfTokens;

                System.out.println("Before: "+msg);
                StringTokenizer tokens = new StringTokenizer(msg);

                numberOfTokens = tokens.countTokens();
                System.out.println("Tokens: "+numberOfTokens);

                System.out.print("After : ");
                while (tokens.hasMoreTokens()) {
                    msg = tokens.nextToken();
                    String msgLower = msg.toLowerCase();
                    String punctuationremove = punctuationRemover(msgLower);  
          //          buffWriter.write(punctuationremove+" "); --> write into file .txt
                    System.out.print(punctuationremove+" ");
                    removingStopWord(punctuationremove, readStopWordsFile());
                    numberOfTotalTokens++;   
                }
           //     buffWriter.newLine(); make a new line after tokening new message
                System.out.println("\n");
                numberOfMessages++;
            }
        // write close    buffWriter.close();
            System.out.println("Total Tokens: "+numberOfTotalTokens);
            System.out.println("Total Messages: "+numberOfMessages);
        }
        catch (Exception e){
            System.out.println("Error Exception: "+e.getMessage());
        } 

然后我有一個讀取停用詞文件的代碼

public static Set<String> readStopWordsFile() throws FileNotFoundException, IOException{
    String fileStopWords = "\\stopWords.txt";

    Set<String> stopWords = new LinkedHashSet<String>();
    FileReader readFileStopWord = new FileReader(fileStopWords);
    BufferedReader stopWordsFile = new BufferedReader(readFileStopWord);

    String line;

    while((line = stopWordsFile.readLine())!=null){
        line = line.trim();
        stopWords.add(line);
    }
    stopWordsFile.close();
    return stopWords;
}

如何比較令牌與停用詞集並刪除與停用詞相同的令牌。 你能幫我嗎,謝謝

您可以簡單地先閱讀停用詞,然后檢查您的令牌是否為停用詞。

Set<String> stopWords = readStopWordsFile();

  // some file reading logic
  while (tokens.hasMoreTokens()) {
       msg = tokens.nextToken();
       if(stopWords.contains(msg)){
         continue; // skip over a stopword token
       }
  }

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM