简体   繁体   English

无需使用nltk语料库即可删除停用词

[英]removing stop words without using nltk corpus

I am trying to remove stop words in a text file without using nltk. 我试图在不使用nltk的情况下删除文本文件中的停用词。 I have f1,f2,f3 three text files. 我有f1,f2,f3三个文本文件。 f1 has text line by line and f2 has stop words list and f3 is empty file. f1逐行显示文本,f2包含停用词列表,f3为空文件。 I want to read f1 line by line and in turn word by word and need to check whether it is in f2(stop words). 我想逐行阅读f1,然后逐字阅读,需要检查它是否在f2(停用词)中。 If the word is not in the stop word then write the word in f3. 如果该单词不在停用词中,则在f3中写入该单词。 Thus at the end f3 should have text as in f1 but in each line, words in f2(stop words) should be removed. 因此,在f3末尾应具有与f1相同的文本,但在每一行中,应删除f2中的单词(停用词)。

f1 = open("file1.txt","r")
f2 = open("stop.txt","r")
f3 = open("file2.txt","w")

for line in f1:
    words = line.split()
    for word in words:
        t=word

for line in f2:
    w = line.split()
    for word in w:
        t1=w
        if t!=t1:
            f3.write(word)

f1.close()
f2.close()
f3.close()

this code is wrong. 该代码是错误的。 but can any one do this task by changing the code. 但是任何人都可以通过更改代码来完成此任务。

Thanks in Advance. 提前致谢。

您可以使用Linux Sed方法删除停用词

sed -f <(sed 's/.*/s|\\\<&\\\>||g/' stopwords.txt) all_lo.txt > all_remove1.txt

What I would personally do is loop through the list of stop words (f2) and append each word to a list in your script. 我个人要做的是遍历停用词列表(f2),并将每个单词附加到脚本的列表中。 Ex: 例如:

stoplist = []
file1 = open('f1.txt','r')
file2 = open('f2.txt','r')
file3 = open('f3.txt','a') # append mode. Similar to rw
for line in f2:
    w = line.split()
    for word in w:
        stoplist.append(word)
#end 
for line in file1:
    w = line.split()
    for word in w:
        if word in stoplist: continue
        else: 
            file3.write(word)
#end 
file1.close()
file2.close()
file3.close()

your first for loop is wrong because by this command for word in words: t=word you havnt all words in t the words is a list and you can work with it : also if your files contain multiple line your list dont contain all words !! 您的第一个for循环是错误的,因为使用此命令for word in words: t=word您拥有单词t中的所有单词列表是一个列表,您可以使用它: 如果文件包含多行,则列表不包含所有单词 ! ! you must do it like this ! 您必须这样做! it works correctly ! 它正常工作!

f1 = open("a.txt","r")
f2 = open("b.txt","r")
f3 = open("c.txt","w")
first_words=[]
second_words=[]
for line in f1:
 words = line.split()
 for w in words:
  first_words.append(w)

for line in f2:
 w = line.split()
 for i in w:
  second_words.append(i)


for word1 in first_words :
 for word2 in second_words:
   if word1==word2:
    first_words.remove(word2)

for word in first_words:
 f3.write(word)
 f3.write(' ')

f1.close()
f2.close()
f3.close()

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM