[英]Removing Stopwords stored in a Set from an ArrayList of Strings
我一直在這里關注有關從ArrayList中刪除停用詞的文章( 更多內容是與其他相比 )。 但是在定制此代碼以滿足我的需要時遇到了一些問題。
我的代碼讀取兩個文件,一個是停用詞的文本文件,另一個是從Twitter收集的數據的文本文件。 我將停用詞存儲在HashSet中,最終希望將它們從Twitter數據的文本文件中刪除(存儲在ArrayList中)。 但是我的代碼存在的問題是,除了刪除停用詞之外,其他所有東西都可以正常工作(例如讀取文件並將輸出追加到文件中)。
我當前用於測試的文件在這里
public static void main(String[] args) {
ArrayList<String> listOfWords = new ArrayList<String>();
try {
// Read in sto pwords text file aswell as the textfile to edit
Scanner stopWordsFile = new Scanner(new File("stopwords_twitter.txt"));
Scanner textFile = new Scanner(new File("LiverpoolTest.txt"));
// Create a set for the stop words
Set<String> stopWords = new HashSet<String>();
// For each stopword split them and transform them to lowercase
while (stopWordsFile.hasNext()) {
stopWords.add(stopWordsFile.next().trim());
}
// Creates an empty list for the text files contents
ArrayList<String> words = new ArrayList<String>();
/* For each word in the file correct (removing words between the delimiters)
them and add them to the ArrayList */
while (textFile.hasNextLine()) {
for (String word : textFile.nextLine().trim().toLowerCase()
.replaceAll("/-/-/.*?/-/-/\\s*","").split("/")) {
words.add(word);
}
}
// Iterate over the ArrayList
for(String word : words) {
String wordCompare = word.toLowerCase();
// If the word isn't a stop word, add to listOfWords ArrayList
if (!stopWords.contains(wordCompare)) {
listOfWords.add(word);
}
}
stopWordsFile.close();
textFile.close();
} catch(FileNotFoundException e){
e.printStackTrace();
}
try {
File fileName;
FileWriter fw;
// Create a new textfile for listOfWords
fileName = new File("LiverpoolNoStopWords.txt");
fw = new FileWriter(fileName, true);
// Output listOfWords to a new textfile
for (String str : listOfWords) {
String word = str + "\n";
System.out.print(word);
fw.write(word);
}
fw.close();
} catch(IOException e){
System.err.println("Error. Cannot open file for writing.");
System.exit(1);
}
}
它所要做的只是調試程序。 OP可以做到這一點。
為了測試停用詞的加載,我打印了stopWords
的內容。 沒錯
為了測試推特單詞的解析,我在設置它之后立即打印了wordCompare
:
String wordCompare = word.toLowerCase(); System.out.println("|"+wordCompare+"|");
並得到了:
|the redmen tv : chris sat down with spanish journalist guillem balague to talk through liverpool’s season as a whole, how real madrid have been playing and how they are likely to play against liverpool tonight!|
||
|watch now:https:|
||
|t.co|
|oqmcx3zs9c|
|subscribe: https:|
||
|t.co|
|tbybrgabge https:|
||
|t.co|
|s6010yicen|
||
|the redmen tv : real madrid v liverpool | https:|
||
|t.co|
|jlwbp8q7bf|
||
|we have hours of build up content including interviews with;|
顯然,問題在於split()
不會拆分成單詞。 實際上,該拆分期望將正斜杠"/"
作為分隔符。 更改為.split("\\\\s+")
以空格分隔。
添加了打印,以防發現停用詞
if (!stopWords.contains(wordCompare)) { listOfWords.add(word); } else { System.out.println("@@##$$"); }
並得到了:
|the|
@@##$$
|redmen|
|tv|
|:|
|chris|
|sat|
@@##$$
|down|
@@##$$
|with|
@@##$$
|spanish|
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.