filter words from one text file in another text file?

Question

I have a file that is a list of words- one word on each line- filterlist.txt. The other file is a giant string of text- text.txt.

I want to find all the instances of the words from filterlist.txt in text.txt and delete them.

Here is what i have so far:

text = open('ttext.txt').read().split()
filter_words = open('filterlist.txt').readline()

for line in text:
    for word in filter_words:
        if word == filter_words:
            text.remove(word)

Answer 1

Store the filter words in a set, iterate over the words from the line in ttext.txt , and only keep the words that are not in the set of filter words.

with open('ttext.txt') as text,  open('filterlist.txt') as filter_words:
    st = set(map(str.rstrip,filter_words))
    txt = next(text).split()
    out = [word  for word in txt if word not in st]

If you want to ignore case and remove punctuation you will need to call lower on each line and strip the punctuation:

from string import punctuation
with open('ttext.txt') as text,  open('filterlist.txt') as filter_words:
    st = set(word.lower().rstrip(punctuation+"\n") for word in  filter_words)
    txt = next(text).lower().split()
    out = [word  for word in txt if word not in st]

If you had multiple lines in ttext using (word for line in text for word in line.split()) would be a more memory efficient approach.

Answer 2

Using Padraic Cunningham's principle i coded this in a function

from string import punctuation

def vocab_filter(text, filter_vocab):
    txt = text.replace('\n', ' ').lower().split()
    out = [word for word in txt if word not in filter_vocab]
    return out

Is very important to use a set, not a list in second argument. Lookups in lists are O(n), lookups in dictionaries are amortized O(1). So for big files this is optimal.

Answer 3

Let's say if this is what you have in the text.txt file: 'hello foo apple car water cat' and this is what you have in the filterlist.txt file: apple car

text = open('text.txt').read().strip("'").split(' ')
    filter_words = open('filterlist.txt').readline().split()
    for i in filter_words:
        if i in text:
            text.remove(i)
            new_text = ' '.join(text)
    print new_text

The output will be:

hello foo water cat

filter words from one text file in another text file?

Question

3 answers

solution1
1 ACCPTED 2015-07-10 14:37:28

solution2
0 2016-12-01 19:38:32

solution3
-1 2015-07-10 15:27:39

filter words from one text file in another text file?

Question

3 answers

solution1 1 ACCPTED 2015-07-10 14:37:28

solution2 0 2016-12-01 19:38:32

solution3 -1 2015-07-10 15:27:39

solution1
1 ACCPTED 2015-07-10 14:37:28

solution2
0 2016-12-01 19:38:32

solution3
-1 2015-07-10 15:27:39