在python中删除子字符串时识别字符串

Question

I have a dictionary of words with their frequencies as follows.我有一个单词字典，其频率如下。

mydictionary = {'yummy tim tam':3, 'milk':2, 'chocolates':5, 'biscuit pudding':3, 'sugar':2}

I have a set of strings (removed punctuation marks) as follows.我有一组字符串（删除标点符号）如下。

recipes_book = "For todays lesson we will show you how to make biscuit pudding using 
yummy tim tam milk and rawsugar"

In the above string I need output only "biscuit pudding", "yummy tim tam" and "milk" by referring the dictionary.在上面的字符串中，我只需要通过参考字典输出“饼干布丁”、“美味的蒂姆”和“牛奶”。 NOT sugar, because its rawsugar in the string.不是糖，因为它是串中的粗糖。

However, the code I am currently using outputs sugar as well.但是，我目前使用的代码也输出糖。

mydictionary = {'yummy tim tam':3, 'milk':2, 'chocolates':5, 'biscuit pudding':3, 'sugar':2}
recipes_book = "For today's lesson we will show you how to make biscuit pudding using yummy tim tam milk and rawsugar"
searcher = re.compile(r'{}'.format("|".join(mydictionary.keys())), flags=re.I | re.S)

for match in searcher.findall(recipes_book):
    print(match)

How to avoid using sub-strings like that and only consider one full tokens such as 'milk'.如何避免使用这样的子字符串，而只考虑一个完整的标记，例如“牛奶”。 Please help me.请帮我。

Answer 1

Use word boundary '\\b'.使用字边界'\\b'。 In simple words简单来说

recipes_book = "For todays lesson we will show you how to make biscuit pudding using 
yummy tim tam milk and rawsugar"

>>> re.findall(r'(?is)(\bchocolates\b|\bbiscuit pudding\b|\bsugar\b|\byummy tim tam\b|\bmilk\b)',recipes_book)
['biscuit pudding', 'yummy tim tam', 'milk']

Answer 2

You can update your code with regex word boundary:您可以使用正则表达式字边界更新您的代码：

mydictionary = {'yummy tim tam':3, 'milk':2, 'chocolates':5, 'biscuit pudding':3, 'sugar':2}
recipes_book = "For today's lesson we will show you how to make biscuit pudding using yummy tim tam milk and rawsugar"
searcher = re.compile(r'{}'.format("|".join(map(lambda x: r'\b{}\b'.format(x), mydictionary.keys()))), flags=re.I | re.S)

for match in searcher.findall(recipes_book):
    print(match)

Output:输出：

biscuit pudding
yummy tim tam
milk

Answer 3

One more way using re.escape .使用re.escape一种方法。 More info regarding re.escape here !!!更多关于re.escape 的信息在这里！！！

import re

mydictionary = {'yummy tim tam':3, 'milk':2, 'chocolates':5, 'biscuit pudding':3, 'sugar':2}
recipes_book = "For today's lesson we will show you how to make biscuit pudding using yummy tim tam milk and rawsugar"

val_list = []

for i in mydictionary.keys():
    tmp_list = []
    regex_tmp = r'\b'+re.escape(str(i))+r'\b'
    tmp_list = re.findall(regex_tmp,recipes_book)
    val_list.extend(tmp_list)

print val_list

Output:输出：

"C:\Program Files (x86)\Python27\python.exe" C:/Users/punddin/PycharmProjects/demo/demo.py
['yummy tim tam', 'biscuit pudding', 'milk']

在python中删除子字符串时识别字符串

问题描述

3 个解决方案

解决方案1
1 2017-10-03 10:26:32

解决方案2
0 已采纳 2017-10-03 10:30:34

解决方案3
0 2017-10-03 10:40:37

在python中删除子字符串时识别字符串

问题描述

3 个解决方案

解决方案1 1 2017-10-03 10:26:32

解决方案2 0 已采纳 2017-10-03 10:30:34

解决方案3 0 2017-10-03 10:40:37

解决方案1
1 2017-10-03 10:26:32

解决方案2
0 已采纳 2017-10-03 10:30:34

解决方案3
0 2017-10-03 10:40:37