簡體   English   中英

從字符串數組列表中刪除存儲在集合中的停用詞

[英]Removing Stopwords stored in a Set from an ArrayList of Strings

我一直在這里關注有關從ArrayList中刪除停用詞的文章( 更多內容是與其他相比 )。 但是在定制此代碼以滿足我的需要時遇到了一些問題。

我的代碼讀取兩個文件,一個是停用詞的文本文件,另一個是從Twitter收集的數據的文本文件。 我將停用詞存儲在HashSet中,最終希望將它們從Twitter數據的文本文件中刪除(存儲在ArrayList中)。 但是我的代碼存在的問題是,除了刪除停用詞之外,其他所有東西都可以正常工作(例如讀取文件並將輸出追加到文件中)。

我當前用於測試的文件在這里

public static void main(String[] args) {

    ArrayList<String> listOfWords = new ArrayList<String>();

    try {

        // Read in sto pwords text file aswell as the textfile to edit
        Scanner stopWordsFile = new Scanner(new File("stopwords_twitter.txt"));
        Scanner textFile = new Scanner(new File("LiverpoolTest.txt"));

        // Create a set for the stop words
        Set<String> stopWords = new HashSet<String>();

        // For each stopword split them and transform them to lowercase
        while (stopWordsFile.hasNext()) {
            stopWords.add(stopWordsFile.next().trim());
        }

        // Creates an empty list for the text files contents 
        ArrayList<String> words = new ArrayList<String>();
        /* For each word in the file correct (removing words between the delimiters) 
           them and add them to the ArrayList */
        while (textFile.hasNextLine()) {
            for (String word : textFile.nextLine().trim().toLowerCase()
                    .replaceAll("/-/-/.*?/-/-/\\s*","").split("/")) {
                words.add(word);
            }
        }

        // Iterate over the ArrayList 
        for(String word : words) {
            String wordCompare = word.toLowerCase();
            // If the word isn't a stop word, add to listOfWords ArrayList
            if (!stopWords.contains(wordCompare)) {
                listOfWords.add(word);
            }
        }

        stopWordsFile.close();
        textFile.close();

    } catch(FileNotFoundException e){
            e.printStackTrace();
    }

    try {

        File fileName;
        FileWriter fw;

        // Create a new textfile for listOfWords
        fileName = new File("LiverpoolNoStopWords.txt");
        fw = new FileWriter(fileName, true);

        // Output listOfWords to a new textfile 
        for (String str : listOfWords) {
            String word = str + "\n";
            System.out.print(word);
            fw.write(word);
        }

        fw.close();

    } catch(IOException e){
        System.err.println("Error. Cannot open file for writing.");
        System.exit(1);
    }
}

它所要做的只是調試程序。 OP可以做到這一點。

  1. 為了測試停用詞的加載,我打印了stopWords的內容。 沒錯

  2. 為了測試推特單詞的解析,我在設置它之后立即打印了wordCompare

      String wordCompare = word.toLowerCase(); System.out.println("|"+wordCompare+"|"); 

並得到了:

            |the redmen tv : chris sat down with spanish journalist guillem balague to talk through liverpool’s season as a whole, how real madrid have been playing and how they are likely to play against liverpool tonight!|
            ||
            |watch now:https:|
            ||
            |t.co|
            |oqmcx3zs9c|
            |subscribe: https:|
            ||
            |t.co|
            |tbybrgabge https:|
            ||
            |t.co|
            |s6010yicen|
            ||
            |the redmen tv : real madrid v liverpool | https:|
            ||              
            |t.co|
            |jlwbp8q7bf|
            ||
            |we have hours of build up content including interviews with;|
  1. 顯然,問題在於split()不會拆分成單詞。 實際上,該拆分期望將正斜杠"/"作為分隔符。 更改為.split("\\\\s+")以空格分隔。

  2. 添加了打印,以防發現停用詞

      if (!stopWords.contains(wordCompare)) { listOfWords.add(word); } else { System.out.println("@@##$$"); } 

並得到了:

            |the|               
            @@##$$
            |redmen|
            |tv|
            |:|
            |chris|
            |sat|
            @@##$$
            |down|
            @@##$$
            |with|
            @@##$$
            |spanish|

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM