从另一个文本文件中的一个文本文件中过滤单词？

Question

I have a file that is a list of words- one word on each line- filterlist.txt. 我有一个文件，它是单词列表-每行一个单词-filterlist.txt。 The other file is a giant string of text- text.txt. 另一个文件是一个巨大的字符串text-txt。

I want to find all the instances of the words from filterlist.txt in text.txt and delete them. 我想从text.txt的filterlist.txt中找到单词的所有实例并将其删除。

Here is what i have so far: 这是我到目前为止所拥有的：

text = open('ttext.txt').read().split()
filter_words = open('filterlist.txt').readline()

for line in text:
    for word in filter_words:
        if word == filter_words:
            text.remove(word)

Answer 1

Store the filter words in a set, iterate over the words from the line in ttext.txt , and only keep the words that are not in the set of filter words. 将过滤器单词存储在集合中，遍历ttext.txt的行中的ttext.txt ，仅保留不在过滤器单词集中的单词。

with open('ttext.txt') as text,  open('filterlist.txt') as filter_words:
    st = set(map(str.rstrip,filter_words))
    txt = next(text).split()
    out = [word  for word in txt if word not in st]

If you want to ignore case and remove punctuation you will need to call lower on each line and strip the punctuation: 如果要忽略大小写并删除标点符号，则需要在每一行的较低位置调用并删除标点符号：

from string import punctuation
with open('ttext.txt') as text,  open('filterlist.txt') as filter_words:
    st = set(word.lower().rstrip(punctuation+"\n") for word in  filter_words)
    txt = next(text).lower().split()
    out = [word  for word in txt if word not in st]

If you had multiple lines in ttext using (word for line in text for word in line.split()) would be a more memory efficient approach. 如果在ttext使用多行(word for line in text for word in line.split())将是一种内存效率更高的方法。

Answer 2

Using Padraic Cunningham's principle i coded this in a function 运用Padraic Cunningham的原理，我将其编码为一个函数

from string import punctuation

def vocab_filter(text, filter_vocab):
    txt = text.replace('\n', ' ').lower().split()
    out = [word for word in txt if word not in filter_vocab]
    return out

Is very important to use a set, not a list in second argument. 使用集而不是第二个参数的列表非常重要。 Lookups in lists are O(n), lookups in dictionaries are amortized O(1). 列表中的查找为O（n），字典中的查找为摊销O（1）。 So for big files this is optimal. 因此，对于大文件，这是最佳选择。

Answer 3

Let's say if this is what you have in the text.txt file: 'hello foo apple car water cat' and this is what you have in the filterlist.txt file: apple car 假设这是text.txt文件中的内容： 'hello foo apple car water cat'而这是filterlist.txt文件中的内容： apple car

text = open('text.txt').read().strip("'").split(' ')
    filter_words = open('filterlist.txt').readline().split()
    for i in filter_words:
        if i in text:
            text.remove(i)
            new_text = ' '.join(text)
    print new_text

The output will be: 输出将是：

hello foo water cat

从另一个文本文件中的一个文本文件中过滤单词？

问题描述

3 个解决方案

解决方案1
1 已采纳 2015-07-10 14:37:28

解决方案2
0 2016-12-01 19:38:32

解决方案3
-1 2015-07-10 15:27:39

从另一个文本文件中的一个文本文件中过滤单词？

问题描述

3 个解决方案

解决方案1 1 已采纳 2015-07-10 14:37:28

解决方案2 0 2016-12-01 19:38:32

解决方案3 -1 2015-07-10 15:27:39

解决方案1
1 已采纳 2015-07-10 14:37:28

解决方案2
0 2016-12-01 19:38:32

解决方案3
-1 2015-07-10 15:27:39