考虑到空格，如何将列表中的精确字符串匹配到更大的字符串？

Question

I have a large list of strings and I want to check whether a string occurs in a larger string.我有一个很大的字符串列表，我想检查一个字符串是否出现在更大的字符串中。 The list contains of strings of one word and also strings of multiple words.该列表包含一个单词的字符串和多个单词的字符串。 To do so I have written the following code:为此，我编写了以下代码：

example_list = ['pain', 'chestpain', 'headache', 'sickness', 'morning sickness']
example_text = "The patient has kneepain as wel as a headache"

emptylist = []
for i in example_text:
    res = [ele for ele in example_list if(ele in i)]
    emptylist.append(res)

However the problem is here is 'pain' is also added to emptylist which it should not as I only want something from the example_list to be added if exactly matches the text.然而，这里的问题是“痛苦”也被添加到空列表中，它不应该，因为我只希望在与文本完全匹配的情况下添加 example_list 中的某些内容。 I also tried using sets:我也尝试使用集合：

word_set = set(example_list)
phrase_set = set(example_text.split())
word_set.intersection(phrase_set)

This however chops op 'morning sickness' into 'morning' and 'sickness'.然而，这将 op 'morning disease' 分为'morning' 和 'sickness'。 Does anyone know what is the correct way to tackle this problem?有谁知道解决这个问题的正确方法是什么？

Answer 1

Using PyParsing:使用 PyParsing：

import pyparsing as pp

example_list = ['pain', 'chestpain', 'headache', 'sickness', 'morning sickness']
example_text = "The patient has kneepain as wel as a headache morning sickness"

list_of_matches = []

for word in example_list:
  rule = pp.OneOrMore(pp.Keyword(word))
  for t, s, e in rule.scanString(example_text):
    if t:
      list_of_matches.append(t[0])

print(list_of_matches)

Which yields:其中产生：

['headache', 'sickness', 'morning sickness']

Answer 2

Nice examples have already been provided in this post by members.成员在这篇文章中已经提供了很好的例子。

I made the matching_text a little more challenging where the pain occurred more than once.我使matching_text更具挑战性，因为它不止一次地出现了疼痛。 I also aimed for a little more information about where the match location starts.我还希望了解更多有关比赛位置开始位置的信息。 I ended up with the following code.我最终得到了以下代码。

I worked on the following sentence.我研究了以下句子。

"The patient has not only kneepain but headache and arm pain, stomach pain and sickness"

import re
from collections import defaultdict

example_list = ['pain', 'chestpain', 'headache', 'sickness', 'morning sickness']
example_text = "The patient has not only kneepain but headache and arm pain, stomach pain and sickness"

TruthFalseDict = defaultdict(list)
for i in example_list:
    MatchedTruths = re.finditer(r'\b%s\b'%i, example_text)
    if MatchedTruths:
        for j in MatchedTruths:
            TruthFalseDict[i].append(j.start())

print(dict(TruthFalseDict))

The above gives me the following output.以上给了我以下输出。

{'pain': [55, 69], 'headache': [38], 'sickness': [78]}

Answer 3

You should be able to use a regex using word boundaries您应该能够使用使用字边界的正则表达式

>>> import re
>>> [word for word in example_list if re.search(r'\b{}\b'.format(word), example_text)]
['headache']

This will not match 'pain' in 'kneepain' since that does not begin with a word boundary.这将不匹配'pain' 'kneepain' 'pain'中的'kneepain'因为它不以单词边界开头。 But it would properly match substrings that contained whitespace.但它会正确匹配包含空格的子字符串。

考虑到空格，如何将列表中的精确字符串匹配到更大的字符串？

问题描述

3 个解决方案

解决方案1
1 2020-01-08 16:46:32

解决方案2
1 已采纳 2020-01-08 18:29:16

解决方案3
0 2020-01-08 15:55:18

考虑到空格，如何将列表中的精确字符串匹配到更大的字符串？

问题描述

3 个解决方案

解决方案1 1 2020-01-08 16:46:32

解决方案2 1 已采纳 2020-01-08 18:29:16

解决方案3 0 2020-01-08 15:55:18

解决方案1
1 2020-01-08 16:46:32

解决方案2
1 已采纳 2020-01-08 18:29:16

解决方案3
0 2020-01-08 15:55:18