简体   繁体   中英

Python comparing two lists and filtering items

I would like to do some word filtering (extracting only items in 'keyword' list that exist in 'whitelist').

Here is my code so far:

whitelist = ['Cat', 'Dog', 'Cow']
keyword = ['Cat, Cow, Horse', 'Bird, Whale, Dog', 'Pig, Chicken', 'Tiger, Cat']
keyword_filter = []
 
for word in whitelist:
    for i in range(len(keyword)):
        if word in keyword[i]:
            keyword_filter.append(word)
        else: pass

I want to remove every word except for 'Cat', 'Dog', and 'Cow' (which are in the 'whitelist') so that the result ('keyword_filter' list) will look like this:

['Cat, Cow', 'Dog', '', 'Cat']

However, I got the result something like this:

['Cat', 'Cat', 'Dog', 'Cow']

I would sincerely appreciate if you can give some advice.

You need to split the strings in the list and check if word in the split is contained in the whitelist. Then rejoin all words in the whitelist after filtering:

whitelist = {'Cat', 'Dog', 'Cow'}
filtered = []
for words in keyword:
    filtered.append(', '.join(w for w in words.split(', ') if w in whitelist))

print(filtered)
# ['Cat, Cow', 'Dog', '', 'Cat']

Better to make whitelist a set to improve the performance for lookup of each word.

You could also use re.findall to find all parts of each word matching strings contained in the whitelist, and then rejoin after finding the matches:

import re

pattern = re.compile(',?\s?Cat|,?\s?Dog|,?\s?Cow')
filtered = [''.join(pattern.findall(words))) for words in keyword]

try this..

whitelist = ['Cat', 'Dog', 'Cow']
keyword = ['Cat, Cow, Horse', 'Bird, Whale, Dog', 'Pig, Chicken', 'Tiger, Cat']
keyword_filter = []

for word in keyword:
    whitelistedWords = []
    for w in word.split(', '):
        if w in whitelist:
            whitelistedWords.append(w)
            #print whitelistedWords
    keyword_filter.append( ', '.join(whitelistedWords) )

print keyword_filter

Simple list comprehension:

whitelist = ['Cat', 'Dog', 'Cow']
keyword = ['Cat, Cow, Horse', 'Bird, Whale, Dog', 'Pig, Chicken', 'Tiger, Cat']
keyword_filter = [', '.join(w for w in k.split(', ') if w in whitelist) for k in keyword]

print(keyword_filter)

The output:

['Cat, Cow', 'Dog', '', 'Cat']

Since you want to preserve the order of your keyword list, you'll want to have that as the outermost loop.

for phrase in keyword:

Now you need to split up the phrase into its actual words and determine if those words are in the whitelist. Then you need to put the words back together. You can do this in one line.

   filtered = ", ".join(word in phrase.split(", ") if word in whitelist)

Breakdown: phrase.split(", ") gives you a list of strings that were separated by ", " in the original string -- ie the words you care about. word in ... if word in whitelist is a list comprehension . It will return a list of each word in ... , in this case phrase.split , that meets the condition word in whitelist . Finally, ", ".join(...) gives you a string made up of every element in the list ... connected by ", ".

Lastly, you need to put the newly filtered string into your list of filtered strings.

   keyword_filter.append(filtered)

As a sidenote, I agree with others that you should use a set for your collection of whitelisted words. It has much faster lookup time. However, for a miniscule list of words like this example you won't notice a performance difference.

You could use regex:

import re

whitelist = ['Cat', 'Dog', 'Cow']
keyword = ['Cat, Cow, Horse', 'Bird, Whale, Dog', 'Pig, Chicken', 'Tiger, Cat']
keyword_filter = []

for words in keyword:
    match = re.findall('(' + r'|'.join(whitelist) + ')[,\s]*', words)
    keyword_filter.append(', '.join(match))
print(keyword_filter)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM