简体   繁体   English

Python 字符串匹配 - 查找单词列表中的特定数量的单词是否存在于另一个列表中的句子中

[英]Python string matching - Find if certain number of words in a list of words exist in a sentence in another list

I have a String and a list defined as below我有一个字符串和一个定义如下的列表

my_string = 'she said he replied'
my_list = ['This is a cool sentence', 'This is another sentence','she said hello he replied goodbye', 'she replied', 'Some more sentences in here', 'et cetera et cetera...']

I am trying to check if at least 3 words in my_string exists in any of the strings in my_list .我正在尝试检查my_string中的任何字符串my_list是否至少存在 3 个单词。 The approach i'm taking is to split my_string , and use all to do the matching.我采用的方法是拆分my_string ,并使用all进行匹配。 However, this only works if all the items in my_string exist in a sentence from my_list但是,这仅在my_string中的所有项目都存在于my_list的一个句子中时才有效

if all(word in item for item in my_list for word in my_string.split()):
    print('we happy')

1- How can I make it so the condition is satisfied if at least 3 items of my_string are present in the sentence list? 1-如果句子列表中至少存在 3 个my_string项,我怎样才能满足条件?

2- Is it possible to match only the first and last word in my_string in the same order? 2-是否可以以相同的顺序仅匹配my_string中的第一个和最后一个单词? ie "she" and "replied" are present in 'she replied' at index 3 of my_list , return True.即“她”和“回复”出现在my_list索引 3 处的“她回复”中,返回 True。

The words in common between two strings can be computed using a set intersection.两个字符串之间的共同词可以使用集合交集来计算。 The len of the resulting set gives you the number of words the strings have in common.结果集的len为您提供了字符串共有的单词数。

First build a set of all words in the strings in my_list , using a set union:首先使用集合并集在my_list中的字符串中构建一组所有单词:

all_words = set.union(*[set(item.split()) for item in my_list])

Then check if the intersection has length >= 3 :然后检查交叉点的长度是否>= 3

search_words = set(my_string.split())
if len(search_words & all_words) >= 3:
    print('we happy')

Regarding part 1, I think this should work, and I would recommend using a regex and not string.split for finding words.You could also use nltk.word_tokenize if your sentences have complex words and punctuation.关于第 1 部分,我认为这应该可行,我建议使用正则表达式而不是 string.split 来查找单词。如果您的句子有复杂的单词和标点符号,您也可以使用 nltk.word_tokenize。 They are both slower than string.split, but if you need them, they're useful.它们都比 string.split 慢,但如果你需要它们,它们很有用。

Here's a couple decent posts highlighting the differences (wordpunct-tokenize is basically a word regex in disguise):这里有几个不错的帖子突出了差异(wordpunct-tokenize 基本上是变相的单词正则表达式):

nltk wordpunct_tokenize vs word_tokenize nltk wordpunct_tokenize 与 word_tokenize

Python re.split() vs nltk word_tokenize and sent_tokenize Python re.split() 与 nltk word_tokenize 和 sent_tokenize

import re

num_matches = 3

def get_words(input):
    return re.compile('\w+').findall(input)

my_string = 'she said he replied'
my_list = ['This is a cool sentence', 'This is another sentence','she said hello he replied goodbye', 'she replied', 'Some more sentences in here', 'et cetera et cetera...']

my_string_word_set = set(get_words(my_string))
my_list_words_set = [set(get_words(x)) for x in my_list]

result = [len(my_string_word_set.intersection(x)) >= num_matches for x in my_list_words_set]
print(result)

Results in结果是

[False, False, True, False, False, False] [假,假,真,假,假,假]

For part 2, something like this should work, though it's not a super clean solution.对于第 2 部分,这样的事情应该可以工作,尽管它不是一个超级干净的解决方案。 If you don't want them just in order, but next to each other, check that the indexes are 1 apart instead.如果您不希望它们按顺序排列,而是彼此相邻,请检查索引是否相隔 1。

words = get_words(my_string)
first_and_last = [words[0], words[-1]]
my_list_dicts = []
for sentence in my_list:
    word_dict = {}
    sentence_words = get_words(sentence)
    for i, word in enumerate(sentence_words):
        word_dict[word] = i
    my_list_dicts.append(word_dict)

result2 = []
for word_dict in my_list_dicts:
    if all(k in word_dict for k in first_and_last) and word_dict[first_and_last[0]] < word_dict[first_and_last[1]]:
        result2.append(True)
    else:
        result2.append(False)

print(result2)

Result:结果:

[False, False, True, True, False, False] [假,假,真,真,假,假]

Use the inherent coding that True is 1, False is 0. Sum the values of the in results:使用True为 1, False为 0 的固有编码。将in结果的值相加:

if sum(word in item for item in my_list for word in my_string.split()) >= 3:
    print('we happy')

For your given input, this prints we happy .对于您给定的输入,这会打印出we happy

Re: mamun 's point, we also want to make sure that whole words match.回复: mamun的观点,我们还想确保整个单词匹配。 You'll need to split each string in my_list to get the list of available words.您需要拆分my_list中的每个字符串以获取可用单词的列表。 kaya3 already posted what I would tell you to do. kaya3已经发布了我会告诉你做什么。

you can use flashtext as well for doing this您也可以使用 flashtext 来执行此操作

from flashtext import KeywordProcessor

kw_list = my_string.split()
kp = KeywordProcessor()
kp.add_keywords_from_list(kw_list) # add keyword that you are looking for 

def func_(x):
    kw = kp.extract_keywords(x)  # this will return all keyword present in the string
    return len(set(kw)) # now you find the sum of unique kw found in string 

print(list(map(func_, my_list)))
[0, 0, 4, 2, 0, 0]

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 在列表和字符串中查找匹配的单词 - Find matching words in a list and a string 根据单词列表查找句子中匹配单词的总和 - Find sum of matching words in sentence according to list of words 在列表和列表中查找匹配的单词和不匹配的单词 - Find matching words and not matching words in a list and a list 通过在 python 中传递字符串从列表中查找匹配的单词 - Find matching words from a list by passing string in python 如何在Python中没有子字符串匹配的情况下将2列表中的单词与另一个单词字符串匹配? - How to match words in 2 list against another string of words without sub-string matching in Python? 如何将一个包含多个单词的字符串拆分成一个包含一定数量单词的字符串的列表? - How to split a string of multiple words into a list with strings of a certain number of words? 如果字符串包含 python 中列表的所有单词,则匹配该字符串 - Matching a string if it contains all words of a list in python Python:如何确定字符串中是否存在单词列表 - Python: how to determine if a list of words exist in a string 确定句子中是否包含单词列表? - Determine if a list of words is in a sentence? Python:如何在句子的单词列表中找到一个字母并以原始大小写返回这些单词(大写/小写) - Python: How to find a letter in a sentence's list of words and return those words in their original case (upper/lower)
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM