简体   繁体   English

从另一个文本文件中的一个文本文件中过滤单词?

[英]filter words from one text file in another text file?

I have a file that is a list of words- one word on each line- filterlist.txt. 我有一个文件,它是单词列表-每行一个单词-filterlist.txt。 The other file is a giant string of text- text.txt. 另一个文件是一个巨大的字符串text-txt。

I want to find all the instances of the words from filterlist.txt in text.txt and delete them. 我想从text.txt的filterlist.txt中找到单词的所有实例并将其删除。

Here is what i have so far: 这是我到目前为止所拥有的:

text = open('ttext.txt').read().split()
filter_words = open('filterlist.txt').readline()

for line in text:
    for word in filter_words:
        if word == filter_words:
            text.remove(word)

Store the filter words in a set, iterate over the words from the line in ttext.txt , and only keep the words that are not in the set of filter words. 将过滤器单词存储在集合中,遍历ttext.txt的行中的ttext.txt ,仅保留不在过滤器单词集中的单词。

with open('ttext.txt') as text,  open('filterlist.txt') as filter_words:
    st = set(map(str.rstrip,filter_words))
    txt = next(text).split()
    out = [word  for word in txt if word not in st]

If you want to ignore case and remove punctuation you will need to call lower on each line and strip the punctuation: 如果要忽略大小写并删除标点符号,则需要在每一行的较低位置调用并删除标点符号:

from string import punctuation
with open('ttext.txt') as text,  open('filterlist.txt') as filter_words:
    st = set(word.lower().rstrip(punctuation+"\n") for word in  filter_words)
    txt = next(text).lower().split()
    out = [word  for word in txt if word not in st]

If you had multiple lines in ttext using (word for line in text for word in line.split()) would be a more memory efficient approach. 如果在ttext使用多行(word for line in text for word in line.split())将是一种内存效率更高的方法。

Using Padraic Cunningham's principle i coded this in a function 运用Padraic Cunningham的原理,我将其编码为一个函数

from string import punctuation

def vocab_filter(text, filter_vocab):
    txt = text.replace('\n', ' ').lower().split()
    out = [word for word in txt if word not in filter_vocab]
    return out

Is very important to use a set, not a list in second argument. 使用集而不是第二个参数的列表非常重要。 Lookups in lists are O(n), lookups in dictionaries are amortized O(1). 列表中的查找为O(n),字典中的查找为摊销O(1)。 So for big files this is optimal. 因此,对于大文件,这是最佳选择。

Let's say if this is what you have in the text.txt file: 'hello foo apple car water cat' and this is what you have in the filterlist.txt file: apple car 假设这是text.txt文件中的内容: 'hello foo apple car water cat'而这是filterlist.txt文件中的内容: apple car

text = open('text.txt').read().strip("'").split(' ')
    filter_words = open('filterlist.txt').readline().split()
    for i in filter_words:
        if i in text:
            text.remove(i)
            new_text = ' '.join(text)
    print new_text

The output will be: 输出将是:

hello foo water cat

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM