简体   繁体   English

在python中比较大字符串的最快方法

[英]Fastest way to compare large strings in python

I have a dictionary of words with their frequencies as follows. 我有一个单词字典,其频率如下。

mydictionary = {'yummy tim tam':3, 'fresh milk':2, 'chocolates':5, 'biscuit pudding':3}

I have a set of strings as follows. 我有一组字符串如下。

recipes_book = "For today's lesson we will show you how to make biscuit pudding using 
yummy tim tam and fresh milk."

In the above string I have "biscuit pudding", "yummy tim tam" and "fresh milk" from the dictionary. 在上面的字符串中,我有字典中的“饼干布丁”,“美味的蒂姆塔姆”和“新鲜牛奶”。

I am currently tokenizing the string to identify the words in the dictionary as follows. 我目前正在将字符串标记为识别字典中的单词,如下所示。

words = recipes_book.split()
for word in words:
    if word in mydictionary:
        print("Match Found!")

However it only works for one word dictionary keys. 但是它只适用于一个单词字典键。 Hence, I am interested in the fastest way (because my real recipes are very large texts) to identify the dictionary keys with more than one word. 因此,我感兴趣的是以最快的方式(因为我的真实食谱是非常大的文本)来识别具有多个单词的字典键。 Please help me. 请帮我。

Build up your regex and compile it. 构建你的正则表达式并编译它。

import re

mydictionary = {'yummy tim tam':3, 'fresh milk':2, 'chocolates':5, 'biscuit pudding':3}

searcher = re.compile("|".join(mydictionary.keys()), flags=re.I | re.S)

for match in searcher.findall(recipes_book):
    mydictionary[match] += 1

Output after this 此后输出

{'yummy tim tam': 4, 'biscuit pudding': 4, 'chocolates': 5, 'fresh milk': 3}

According to some tests, the "in" keywork is faster than "re" module : 根据一些测试, “in”键工作比“re”模块更快

What's a faster operation, re.match/search or str.find? 什么是更快的操作,re.match / search或str.find?

There is no problem with spaces here. 这里的空格没有问题。 Supposing mydictionary is static (predefined), I think you should probably go for the inverse thing: 假设mydictionary是静态的(预定义的),我认为你可能应该采取相反的做法:

for key in mydictionary.iterkeys():
    if key in recipes_book:
        print("Match Found!")
        mydictionary[key] += 1

In python2, using iterkeys you have an iterator and it's a good practice. 在python2中,使用iterkeys你有一个迭代器,这是一个很好的做法。 With python3 you could cycle directly on the dict. 使用python3,你可以直接在dict上循环。

Try the other way around by search the text you want to find in the large chunk of str data. 通过在大块str数据中搜索要查找的文本,尝试相反的方法。

import re
for item in mydictionary:
    match = re.search(item, recipes_book, flags=re.I | re.S)
    if match:
       start, end = match.span()
       print("Match found for %s between %d and %d character span" % (match.group(0), start, end))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM