removing stop words without using nltk corpus

Question

I am trying to remove stop words in a text file without using nltk. I have f1,f2,f3 three text files. f1 has text line by line and f2 has stop words list and f3 is empty file. I want to read f1 line by line and in turn word by word and need to check whether it is in f2(stop words). If the word is not in the stop word then write the word in f3. Thus at the end f3 should have text as in f1 but in each line, words in f2(stop words) should be removed.

f1 = open("file1.txt","r")
f2 = open("stop.txt","r")
f3 = open("file2.txt","w")

for line in f1:
    words = line.split()
    for word in words:
        t=word

for line in f2:
    w = line.split()
    for word in w:
        t1=w
        if t!=t1:
            f3.write(word)

f1.close()
f2.close()
f3.close()

this code is wrong. but can any one do this task by changing the code.

Thanks in Advance.

Answer 1

您可以使用Linux Sed方法删除停用词

sed -f <(sed 's/.*/s|\\\<&\\\>||g/' stopwords.txt) all_lo.txt > all_remove1.txt

Answer 2

What I would personally do is loop through the list of stop words (f2) and append each word to a list in your script. Ex:

stoplist = []
file1 = open('f1.txt','r')
file2 = open('f2.txt','r')
file3 = open('f3.txt','a') # append mode. Similar to rw
for line in f2:
    w = line.split()
    for word in w:
        stoplist.append(word)
#end 
for line in file1:
    w = line.split()
    for word in w:
        if word in stoplist: continue
        else: 
            file3.write(word)
#end 
file1.close()
file2.close()
file3.close()

Answer 3

your first for loop is wrong because by this command for word in words: t=word you havnt all words in t the words is a list and you can work with it : also if your files contain multiple line your list dont contain all words !! you must do it like this ! it works correctly !

f1 = open("a.txt","r")
f2 = open("b.txt","r")
f3 = open("c.txt","w")
first_words=[]
second_words=[]
for line in f1:
 words = line.split()
 for w in words:
  first_words.append(w)

for line in f2:
 w = line.split()
 for i in w:
  second_words.append(i)


for word1 in first_words :
 for word2 in second_words:
   if word1==word2:
    first_words.remove(word2)

for word in first_words:
 f3.write(word)
 f3.write(' ')

f1.close()
f2.close()
f3.close()

removing stop words without using nltk corpus

Question

3 answers

solution1
1 2016-06-25 12:19:47

solution2
0 2014-07-06 07:00:43

solution3
0 2014-07-06 07:04:29

removing stop words without using nltk corpus

Question

3 answers

solution1 1 2016-06-25 12:19:47

solution2 0 2014-07-06 07:00:43

solution3 0 2014-07-06 07:04:29

solution1
1 2016-06-25 12:19:47

solution2
0 2014-07-06 07:00:43

solution3
0 2014-07-06 07:04:29