简体   繁体   中英

How to check whether a element from given list is in text or not using python?

I have to check whether a element from given list is in text or not,if it is a single word i can,but if it contains multiple words like below i am not able to get

text="what is the price of wheat and White Pepper?"

words=['wheat','White Pepper','rice','pepper']

Expected output=['wheat','White Pepper']

I tried in below ways ,but not getting expected output,can anyone help me?

>>> output=[word for word in words if word in text]

>>> print output

>>> ['rice', 'White Pepper', 'wheat']

here it is taking word "rice" from word "price"

If i use nltk or any it will split "White Pepper" into "White" and "pepper"

>>> from nltk import word_tokenize

>>> n_words=word_tokenize(text)

>>> print n_words

>>> ['what', 'is', 'the', 'price', 'of', 'wheat', 'and', 'White', 'Pepper', '?']

>>> output=[word for word in words if word in n_words]
>>> print output

>>> ['wheat']

you could use regular expressions and word boundaries:

import re

text="what is the price of wheat and White Pepper?"

words=['wheat','White Pepper','rice','pepper']

output=[word for word in words if re.search(r"\b{}\b".format(word),text)]

print(output)

result:

['wheat', 'White Pepper']

you can optimize the search by pre-building your regex (courtesy Jon Clements ):

output = re.findall(r'\b|\b'.join(sorted(words, key=len, reverse=True)), text)

The sort is necessary to make sure longest strings are taken first. Regex escaping is probably not necessary since the words contain only spaces and alphanums.

So I would do something like this.

def findWord(list, text):
    words = []
    for i in list:
        index = text.find(i) 
        if index != -1:
            if index != 0 and text[index - 1] != " ":
                continue 
            words.append(i)
    return words

The find function for string will return -1 if a string is not present. White Pepper returns 31 because that is the index where it starts.

This returns ['wheat', and 'White Pepper'] for the test case you provided.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM