[英]How do I check if words in a list are contained in sentences in another list?
I'm web scraping and trying to filter out sentences with certain terms in them. 我正在抓取网页,并尝试过滤掉其中带有某些术语的句子。 Suppose I have this list of sentences: 假设我有以下句子列表:
z = ['a privacy policy', 'there are many standard challenges that face every business']
And I want to filter out the sentences in it that contain any words in this list: 我想过滤掉其中包含此列表中所有单词的句子:
junk_terms = ['privacy policy', 'cookie policy', 'copyright']
So I do: 所以我做:
for sentence in z:
if all(term not in sentence for term in junk_terms):
print sentence
It prints out there are many standard challenges that face every business
它印出了there are many standard challenges that face every business
So far so good. 到现在为止还挺好。 However, I noticed that it's not matching up the term in junk_terms to that whole term in z. 但是,我注意到它与junk_terms中的术语与z中的整个术语不匹配。 It's looking to see if any letters in junk_terms occurs in z. 它正在查看junk_terms中是否有字母出现在z中。 For example, let's change the term "privacy policy" in junk_terms to "privac" 例如,让我们将junk_terms中的术语“隐私策略”更改为“ privac”
junk_terms = ['privac', 'cookie policy', 'copyright']
I would expect it to not filter out any of the sentences in z. 我希望它不会过滤出z中的任何句子。 However, if you run it you'll see that it still filters out the sentence with "privacy policy" in it because it contains the letters "privac". 但是,如果运行它,则会看到它仍然过滤掉其中带有“隐私策略”的句子,因为它包含字母“ privac”。 Is there a way to write this code so that it's not comparing the letters but rather the whole word? 有没有一种方法可以编写此代码,使它不比较字母而是比较整个单词?
re is probably what you're looking for. re可能正是您想要的。 The result is all of the unfiltered strings. 结果是所有未过滤的字符串。 This way, you also catch strings containing junk expressions ending with dots or commas. 这样,您还可以捕获包含以点或逗号结尾的垃圾表达式的字符串。
import re
import itertools
# All of the strings
z = ['a privacy policy', 'there are many standard challenges that face every business']
junk_terms = ['privacy policy', 'cookie policy', 'copyright']
# Build the regex, making sure we don't capture parts.
regex = re.compile("|".join(r"\b{}\b".format(term) for term in junk_terms))
# Filter out anything that we found junk in.
result = list(itertools.filterfalse(regex.search, z))
Explanation regarding the re: \\b
means a word boundary and matches between words, and the |
关于re的说明: \\b
表示单词边界,单词之间匹配,而|
means OR. 表示“或”。 Basically \\bfoo\\b|\\bbar\\b
will match any string containing foo
as a word or bar
as a word, and since we filterfalse()
, they will be dropped out. 基本上\\bfoo\\b|\\bbar\\b
会匹配任何字符串foo
作为一个字或bar
作为一个词,因为我们filterfalse()
它们将被退学了。
Update: 更新:
For python 2 the correct function is ifilterfalse()
instead of filterfalse()
. 对于python 2,正确的函数是ifilterfalse()
而不是filterfalse()
。
I think your code works the way it is intended. 我认为您的代码按预期方式工作。 You can also write it with a list comprehension: 您还可以使用列表理解来编写它:
print [sentence for sentence in z if not any(term in sentence for term in junk_terms)]
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.