简体   繁体   中英

How do I check if words in a list are contained in sentences in another list?

I'm web scraping and trying to filter out sentences with certain terms in them. Suppose I have this list of sentences:

z = ['a privacy policy', 'there are many standard challenges that face every business']

And I want to filter out the sentences in it that contain any words in this list:

junk_terms = ['privacy policy', 'cookie policy', 'copyright']

So I do:

for sentence in z:
    if all(term not in sentence for term in junk_terms):
        print sentence

It prints out there are many standard challenges that face every business

So far so good. However, I noticed that it's not matching up the term in junk_terms to that whole term in z. It's looking to see if any letters in junk_terms occurs in z. For example, let's change the term "privacy policy" in junk_terms to "privac"

junk_terms = ['privac', 'cookie policy', 'copyright']

I would expect it to not filter out any of the sentences in z. However, if you run it you'll see that it still filters out the sentence with "privacy policy" in it because it contains the letters "privac". Is there a way to write this code so that it's not comparing the letters but rather the whole word?

re is probably what you're looking for. The result is all of the unfiltered strings. This way, you also catch strings containing junk expressions ending with dots or commas.

import re
import itertools
# All of the strings
z = ['a privacy policy', 'there are many standard challenges that face every business']
junk_terms = ['privacy policy', 'cookie policy', 'copyright']

# Build the regex, making sure we don't capture parts.
regex = re.compile("|".join(r"\b{}\b".format(term) for term in junk_terms))

# Filter out anything that we found junk in.
result = list(itertools.filterfalse(regex.search, z))

Explanation regarding the re: \\b means a word boundary and matches between words, and the | means OR. Basically \\bfoo\\b|\\bbar\\b will match any string containing foo as a word or bar as a word, and since we filterfalse() , they will be dropped out.

Update:

For python 2 the correct function is ifilterfalse() instead of filterfalse() .

I think your code works the way it is intended. You can also write it with a list comprehension:

print [sentence for sentence in z if not any(term in sentence for term in junk_terms)]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM