简体   繁体   English

如何检查列表中的单词是否包含在另一个列表中的句子中?

[英]How do I check if words in a list are contained in sentences in another list?

I'm web scraping and trying to filter out sentences with certain terms in them. 我正在抓取网页,并尝试过滤掉其中带有某些术语的句子。 Suppose I have this list of sentences: 假设我有以下句子列表:

z = ['a privacy policy', 'there are many standard challenges that face every business']

And I want to filter out the sentences in it that contain any words in this list: 我想过滤掉其中包含此列表中所有单词的句子:

junk_terms = ['privacy policy', 'cookie policy', 'copyright']

So I do: 所以我做:

for sentence in z:
    if all(term not in sentence for term in junk_terms):
        print sentence

It prints out there are many standard challenges that face every business 它印出了there are many standard challenges that face every business

So far so good. 到现在为止还挺好。 However, I noticed that it's not matching up the term in junk_terms to that whole term in z. 但是,我注意到它与junk_terms中的术语与z中的整个术语不匹配。 It's looking to see if any letters in junk_terms occurs in z. 它正在查看junk_terms中是否有字母出现在z中。 For example, let's change the term "privacy policy" in junk_terms to "privac" 例如,让我们将junk_terms中的术语“隐私策略”更改为“ privac”

junk_terms = ['privac', 'cookie policy', 'copyright']

I would expect it to not filter out any of the sentences in z. 我希望它不会过滤出z中的任何句子。 However, if you run it you'll see that it still filters out the sentence with "privacy policy" in it because it contains the letters "privac". 但是,如果运行它,则会看到它仍然过滤掉其中带有“隐私策略”的句子,因为它包含字母“ privac”。 Is there a way to write this code so that it's not comparing the letters but rather the whole word? 有没有一种方法可以编写此代码,使它不比较字母而是比较整个单词?

re is probably what you're looking for. re可能正是您想要的。 The result is all of the unfiltered strings. 结果是所有未过滤的字符串。 This way, you also catch strings containing junk expressions ending with dots or commas. 这样,您还可以捕获包含以点或逗号结尾的垃圾表达式的字符串。

import re
import itertools
# All of the strings
z = ['a privacy policy', 'there are many standard challenges that face every business']
junk_terms = ['privacy policy', 'cookie policy', 'copyright']

# Build the regex, making sure we don't capture parts.
regex = re.compile("|".join(r"\b{}\b".format(term) for term in junk_terms))

# Filter out anything that we found junk in.
result = list(itertools.filterfalse(regex.search, z))

Explanation regarding the re: \\b means a word boundary and matches between words, and the | 关于re的说明: \\b表示单词边界,单词之间匹配,而| means OR. 表示“或”。 Basically \\bfoo\\b|\\bbar\\b will match any string containing foo as a word or bar as a word, and since we filterfalse() , they will be dropped out. 基本上\\bfoo\\b|\\bbar\\b会匹配任何字符串foo作为一个字或bar作为一个词,因为我们filterfalse()它们将被退学了。

Update: 更新:

For python 2 the correct function is ifilterfalse() instead of filterfalse() . 对于python 2,正确的函数是ifilterfalse()而不是filterfalse()

I think your code works the way it is intended. 我认为您的代码按预期方式工作。 You can also write it with a list comprehension: 您还可以使用列表理解来编写它:

print [sentence for sentence in z if not any(term in sentence for term in junk_terms)]

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何检查单词列表是否包含在 pandas dataframe 的另一个列表中? - How to check if a list of words is contained in another list in a pandas dataframe? 如何从句子列表中创建单词列表? - How do I create a list of words from a list of sentences? 如何检查列表中的元素是否包含在其他列表中? - How do i check if the elements in a list are contained in other list? 如何使用 Speech_recognizer 检查我是否说了列表中包含的单词之一 - How can I check if I said one of the words contained in a list using speech_recognizer 如何从 Python 句子列表中打印单个句子? - How do I print individual sentences from a Python list of sentences? 如何检查列表中的哪些单词包含在字符串中? - How to check which words from a list are contained in a string? 如何有效地检查单词列表是否包含在 Spark Dataframe 中? - How to efficiently check if a list of words is contained in a Spark Dataframe? 如何检查列表中的部分字符串是否包含在 Python 的另一个列表中 - How to check if part of a string in a list is contained in another list in Python 如何在没有循环的情况下检查一个列表是否包含在另一个列表中? - How to check if a list is contained inside another list without a loop? 如何在 Python 中检查列表的每个元素,是否包含在另一个列表中? - How to check in Python for each element of a list, whether it is contained in another list?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM