I have a file that is a list of words- one word on each line- filterlist.txt. The other file is a giant string of text- text.txt.
I want to find all the instances of the words from filterlist.txt in text.txt and delete them.
Here is what i have so far:
text = open('ttext.txt').read().split()
filter_words = open('filterlist.txt').readline()
for line in text:
for word in filter_words:
if word == filter_words:
text.remove(word)
Store the filter words in a set, iterate over the words from the line in ttext.txt
, and only keep the words that are not in the set of filter words.
with open('ttext.txt') as text, open('filterlist.txt') as filter_words:
st = set(map(str.rstrip,filter_words))
txt = next(text).split()
out = [word for word in txt if word not in st]
If you want to ignore case and remove punctuation you will need to call lower on each line and strip the punctuation:
from string import punctuation
with open('ttext.txt') as text, open('filterlist.txt') as filter_words:
st = set(word.lower().rstrip(punctuation+"\n") for word in filter_words)
txt = next(text).lower().split()
out = [word for word in txt if word not in st]
If you had multiple lines in ttext
using (word for line in text for word in line.split())
would be a more memory efficient approach.
Using Padraic Cunningham's principle i coded this in a function
from string import punctuation
def vocab_filter(text, filter_vocab):
txt = text.replace('\n', ' ').lower().split()
out = [word for word in txt if word not in filter_vocab]
return out
Is very important to use a set, not a list in second argument. Lookups in lists are O(n), lookups in dictionaries are amortized O(1). So for big files this is optimal.
Let's say if this is what you have in the text.txt
file: 'hello foo apple car water cat'
and this is what you have in the filterlist.txt
file: apple car
text = open('text.txt').read().strip("'").split(' ')
filter_words = open('filterlist.txt').readline().split()
for i in filter_words:
if i in text:
text.remove(i)
new_text = ' '.join(text)
print new_text
The output will be:
hello foo water cat
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.