無需使用nltk語料庫即可刪除停用詞

Question

我試圖在不使用nltk的情況下刪除文本文件中的停用詞。 我有f1，f2，f3三個文本文件。 f1逐行顯示文本，f2包含停用詞列表，f3為空文件。 我想逐行閱讀f1，然后逐字閱讀，需要檢查它是否在f2（停用詞）中。 如果該單詞不在停用詞中，則在f3中寫入該單詞。 因此，在f3末尾應具有與f1相同的文本，但在每一行中，應刪除f2中的單詞（停用詞）。

f1 = open("file1.txt","r")
f2 = open("stop.txt","r")
f3 = open("file2.txt","w")

for line in f1:
    words = line.split()
    for word in words:
        t=word

for line in f2:
    w = line.split()
    for word in w:
        t1=w
        if t!=t1:
            f3.write(word)

f1.close()
f2.close()
f3.close()

該代碼是錯誤的。 但是任何人都可以通過更改代碼來完成此任務。

提前致謝。

Answer 1

您可以使用Linux Sed方法刪除停用詞

sed -f <(sed 's/.*/s|\\\<&\\\>||g/' stopwords.txt) all_lo.txt > all_remove1.txt

Answer 2

我個人要做的是遍歷停用詞列表（f2），並將每個單詞附加到腳本的列表中。 例如：

stoplist = []
file1 = open('f1.txt','r')
file2 = open('f2.txt','r')
file3 = open('f3.txt','a') # append mode. Similar to rw
for line in f2:
    w = line.split()
    for word in w:
        stoplist.append(word)
#end 
for line in file1:
    w = line.split()
    for word in w:
        if word in stoplist: continue
        else: 
            file3.write(word)
#end 
file1.close()
file2.close()
file3.close()

Answer 3

您的第一個for循環是錯誤的，因為使用此命令for word in words: t=word您擁有單詞t中的所有單詞列表是一個列表，您可以使用它： 如果文件包含多行，則列表不包含所有單詞 ！！ 您必須這樣做！ 它正常工作！

f1 = open("a.txt","r")
f2 = open("b.txt","r")
f3 = open("c.txt","w")
first_words=[]
second_words=[]
for line in f1:
 words = line.split()
 for w in words:
  first_words.append(w)

for line in f2:
 w = line.split()
 for i in w:
  second_words.append(i)


for word1 in first_words :
 for word2 in second_words:
   if word1==word2:
    first_words.remove(word2)

for word in first_words:
 f3.write(word)
 f3.write(' ')

f1.close()
f2.close()
f3.close()

無需使用nltk語料庫即可刪除停用詞

問題描述

3 個解決方案

解決方案1
1 2016-06-25 12:19:47

解決方案2
0 2014-07-06 07:00:43

解決方案3
0 2014-07-06 07:04:29

無需使用nltk語料庫即可刪除停用詞

問題描述

3 個解決方案

解決方案1 1 2016-06-25 12:19:47

解決方案2 0 2014-07-06 07:00:43

解決方案3 0 2014-07-06 07:04:29

解決方案1
1 2016-06-25 12:19:47

解決方案2
0 2014-07-06 07:00:43

解決方案3
0 2014-07-06 07:04:29