简体   繁体   中英

filter words from one text file in another text file?

I have a file that is a list of words- one word on each line- filterlist.txt. The other file is a giant string of text- text.txt.

I want to find all the instances of the words from filterlist.txt in text.txt and delete them.

Here is what i have so far:

text = open('ttext.txt').read().split()
filter_words = open('filterlist.txt').readline()

for line in text:
    for word in filter_words:
        if word == filter_words:
            text.remove(word)

Store the filter words in a set, iterate over the words from the line in ttext.txt , and only keep the words that are not in the set of filter words.

with open('ttext.txt') as text,  open('filterlist.txt') as filter_words:
    st = set(map(str.rstrip,filter_words))
    txt = next(text).split()
    out = [word  for word in txt if word not in st]

If you want to ignore case and remove punctuation you will need to call lower on each line and strip the punctuation:

from string import punctuation
with open('ttext.txt') as text,  open('filterlist.txt') as filter_words:
    st = set(word.lower().rstrip(punctuation+"\n") for word in  filter_words)
    txt = next(text).lower().split()
    out = [word  for word in txt if word not in st]

If you had multiple lines in ttext using (word for line in text for word in line.split()) would be a more memory efficient approach.

Using Padraic Cunningham's principle i coded this in a function

from string import punctuation

def vocab_filter(text, filter_vocab):
    txt = text.replace('\n', ' ').lower().split()
    out = [word for word in txt if word not in filter_vocab]
    return out

Is very important to use a set, not a list in second argument. Lookups in lists are O(n), lookups in dictionaries are amortized O(1). So for big files this is optimal.

Let's say if this is what you have in the text.txt file: 'hello foo apple car water cat' and this is what you have in the filterlist.txt file: apple car

text = open('text.txt').read().strip("'").split(' ')
    filter_words = open('filterlist.txt').readline().split()
    for i in filter_words:
        if i in text:
            text.remove(i)
            new_text = ' '.join(text)
    print new_text

The output will be:

hello foo water cat

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM