简体   繁体   中英

How to match words in 2 list against another string of words without sub-string matching in Python?

I have 2 lists with keywords in them:

slangNames = [Vikes, Demmies, D, MS Contin]
riskNames = [enough, pop, final, stress, trade]

i also have a dictionary called overallDict , that contains tweets. The key value pairs are {ID: Tweet text) For eg:

{1:"Vikes is not enough for me", 2:"Demmies is okay", 3:"pop a D"}

I am trying to isolate only those tweets that have atleast one keyword from both slangNames and riskNames. So the tweet has to have any keyword from slangNames AND any keyword from riskNames. So from the above example, my code should return keys 1 and 3, ie,

{1:"Vikes is not enough for me", 3:"pop a D"}. 

But my code is picking up substrings instead of complete words. So basically, anything withthe letter 'D' is getting picked up. How do I match these as whole words and not substrings? Please help. Thanks!

My code so far is as below:

for key in overallDict:
    if any(x in overallDict[key] for x in strippedRisks) and (any(x in overallDict[key] for x in strippedSlangs)):
        output.append(key)

Store slangNames and riskNames as sets, split the strings and check if any of the words appear in both sets

slangNames = set(["Vikes", "Demmies", "D", "MS", "Contin"])
riskNames = set(["enough", "pop", "final", "stress", "trade"])
d =  {1: "Vikes is not enough for me", 2:"Demmies is okay", 3:"pop a D"}

for k,v in d.items():
    spl = v.split() # split once
    if any(word in slangNames for word in spl) and any(word  in riskNames for word in spl):
        print(k,v)

Output:

1 Vikes is not enough for me
3 pop a D

Or use not set.isdisjoint :

slangNames = set(["Vikes", "Demmies", "D", "MS", "Contin"])
riskNames = set(["enough", "pop", "final", "stress", "trade"])
d =  {1: "Vikes is not enough for me", 2:"Demmies is okay", 3:"pop a D"}

for k,v in d.items():
    spl = v.split()
    if not slangNames.isdisjoint(spl) and not riskNames.isdisjoint(spl):
        print(k, v)

Using any should be the most efficient as we will short circuit on the first match. Two sets are disjoint if their intersection is an empty set so if if not slangNames.isdisjoint(spl) is True at least one common word appears.

If MS Contin is actually one word you also need to catch that:

import re
slangNames = set(["Vikes", "Demmies", "D"])
r = re.compile(r"\bMS Contin\b")
riskNames = set(["enough", "pop", "final", "stress", "trade"])
d =  {1: "Vikes is not enough for me", 2:"Demmies is okay", 3:"pop a D"}

for k,v in d.items():
    spl = v.split()
    if (not slangNames.isdisjoint(spl) or r.search(v)) and not riskNames.isdisjoint(spl):
        print(k,v)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM