從字符串數組列表中刪除存儲在集合中的停用詞

Question

我一直在這里關注有關從ArrayList中刪除停用詞的文章（更多內容是與其他相比）。 但是在定制此代碼以滿足我的需要時遇到了一些問題。

我的代碼讀取兩個文件，一個是停用詞的文本文件，另一個是從Twitter收集的數據的文本文件。 我將停用詞存儲在HashSet中，最終希望將它們從Twitter數據的文本文件中刪除（存儲在ArrayList中）。 但是我的代碼存在的問題是，除了刪除停用詞之外，其他所有東西都可以正常工作（例如讀取文件並將輸出追加到文件中）。

我當前用於測試的文件在這里

public static void main(String[] args) {

    ArrayList<String> listOfWords = new ArrayList<String>();

    try {

        // Read in sto pwords text file aswell as the textfile to edit
        Scanner stopWordsFile = new Scanner(new File("stopwords_twitter.txt"));
        Scanner textFile = new Scanner(new File("LiverpoolTest.txt"));

        // Create a set for the stop words
        Set<String> stopWords = new HashSet<String>();

        // For each stopword split them and transform them to lowercase
        while (stopWordsFile.hasNext()) {
            stopWords.add(stopWordsFile.next().trim());
        }

        // Creates an empty list for the text files contents 
        ArrayList<String> words = new ArrayList<String>();
        /* For each word in the file correct (removing words between the delimiters) 
           them and add them to the ArrayList */
        while (textFile.hasNextLine()) {
            for (String word : textFile.nextLine().trim().toLowerCase()
                    .replaceAll("/-/-/.*?/-/-/\\s*","").split("/")) {
                words.add(word);
            }
        }

        // Iterate over the ArrayList 
        for(String word : words) {
            String wordCompare = word.toLowerCase();
            // If the word isn't a stop word, add to listOfWords ArrayList
            if (!stopWords.contains(wordCompare)) {
                listOfWords.add(word);
            }
        }

        stopWordsFile.close();
        textFile.close();

    } catch(FileNotFoundException e){
            e.printStackTrace();
    }

    try {

        File fileName;
        FileWriter fw;

        // Create a new textfile for listOfWords
        fileName = new File("LiverpoolNoStopWords.txt");
        fw = new FileWriter(fileName, true);

        // Output listOfWords to a new textfile 
        for (String str : listOfWords) {
            String word = str + "\n";
            System.out.print(word);
            fw.write(word);
        }

        fw.close();

    } catch(IOException e){
        System.err.println("Error. Cannot open file for writing.");
        System.exit(1);
    }
}

Answer 1

它所要做的只是調試程序。 OP可以做到這一點。

為了測試停用詞的加載，我打印了stopWords的內容。 沒錯

為了測試推特單詞的解析，我在設置它之后立即打印了wordCompare ：

  String wordCompare = word.toLowerCase(); System.out.println("|"+wordCompare+"|");

並得到了：

            |the redmen tv : chris sat down with spanish journalist guillem balague to talk through liverpool’s season as a whole, how real madrid have been playing and how they are likely to play against liverpool tonight!|
            ||
            |watch now:https:|
            ||
            |t.co|
            |oqmcx3zs9c|
            |subscribe: https:|
            ||
            |t.co|
            |tbybrgabge https:|
            ||
            |t.co|
            |s6010yicen|
            ||
            |the redmen tv : real madrid v liverpool | https:|
            ||              
            |t.co|
            |jlwbp8q7bf|
            ||
            |we have hours of build up content including interviews with;|

顯然，問題在於split()不會拆分成單詞。 實際上，該拆分期望將正斜杠"/"作為分隔符。 更改為.split("\\\\s+")以空格分隔。

添加了打印，以防發現停用詞

  if (!stopWords.contains(wordCompare)) { listOfWords.add(word); } else { System.out.println("@@##$$"); }

並得到了：

            |the|               
            @@##$$
            |redmen|
            |tv|
            |:|
            |chris|
            |sat|
            @@##$$
            |down|
            @@##$$
            |with|
            @@##$$
            |spanish|

從字符串數組列表中刪除存儲在集合中的停用詞

問題描述

1 個解決方案

解決方案1
0 已采納 2018-06-27 10:09:49

從字符串數組列表中刪除存儲在集合中的停用詞

問題描述

1 個解決方案

解決方案1 0 已采納 2018-06-27 10:09:49

解決方案1
0 已采納 2018-06-27 10:09:49