简体   繁体   中英

Determine if a list of words is in a sentence?

Is there a way (Pattern or Python or NLTK, etc) to detect of a sentence has a list of words in it.

ie

The cat ran into the hat, box, and house. | The list would be hat, box, and house

This could be string processed but we may have more generic lists:

ie

The cat likes to run outside, run inside, or jump up the stairs. |

List=run outside, run inside, or jump up the stairs.

This could be in the middle of a paragraph or the end of the sentence which further complicates things.

I've been working with Pattern for python for awhile and I'm not seeing a way to go about this and was curious if there is a way with pattern or nltk (natural language tool kit).

From what I got from your question, I think you want to search whether all the words in your list is present in a sentence or not.

In general to search for a list elements, in a sentence, you can use all function. It returns true, if all the arguments in it are true.

listOfWords = ['word1', 'word2', 'word3', 'two words']
sentence = "word1 as word2 a fword3 af two words"

if all(word in sentence for word in listOfWords):
    print "All words in sentence"
else:
    print "Missing"

OUTPUT : -

"All words in sentence"

I think this might serve your purpose. If not, then you can clarify.

What about using from nltk.tokenize import sent_tokenize ?

sent_tokenize("Hello SF Python. This is NLTK.")
["Hello SF Python.", "This is NLTK."]

Then you can use that list of sentences in this way:

for sentence in my_list:
  # test if this sentence contains the words you want
  # using all() method 

More info here

all(word in sentence for word in listOfWords)

Using a Trie , you will be able to achieve this is O(n) where n is the number of words in the list of words after building a trie with the list of words which takes O(n) where n is the number of words in the list.

Algorithm

  • split the sentence into list of words separated by space.
  • For each word check if it has a key in the trie. ie that word exist in the list
    • if it exits add that word to the result to keep track of how many words from the list appear in the sentence
    • keep track of the words that has a has subtrie that is the current word is a prefix of the longer word in the list of words
      • for each word in this words see by extending it with the current word it can be a key or a subtrie on the list of words
    • if it's a subtrie then we add it to the extend_words list and see if concatenating with the next words we are able to get an exact match.

Code

import pygtrie
listOfWords = ['word1', 'word2', 'word3', 'two words']

trie = pygtrie.StringTrie()
trie._separator = ' '
for word in listOfWords:
  trie[word] = True

print('s', trie._separator)

sentence = "word1 as word2 a fword3 af two words"
sentence_words = sentence.split()
words_found = {}
extended_words = set()

for possible_word in sentence_words:
  has_possible_word = trie.has_node(possible_word)

  if has_possible_word & trie.HAS_VALUE:
    words_found[possible_word] = True

  deep_clone = set(extended_words)
  for extended_word in deep_clone:
    extended_words.remove(extended_word)

    possible_extended_word = extended_word + trie._separator + possible_word
    print(possible_extended_word)
    has_possible_extended_word = trie.has_node(possible_extended_word)

    if has_possible_extended_word & trie.HAS_VALUE:
      words_found[possible_extended_word] = True

    if has_possible_extended_word & trie.HAS_SUBTRIE:
      extended_words.update(possible_extended_word)


  if has_possible_word & trie.HAS_SUBTRIE:
    extended_words.update([possible_word])

print(words_found)
print(len(words_found) == len(listOfWords))

This is useful if your list of words is huge and you do not wish to iterate over it every time or you have a large number of queries that over the same list of words.

The code is here

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM