繁体   English   中英

在Python中的文本语料库中优化了正则表达式搜索

[英]Optimized regular expression search in a text corpus in Python

这是我在python中的要求:

我从字面上加载字典-从/ usr / share / dict / words说(不要与字典类型混淆),然后用它来搜索有效的单词。 目前,我正在执行以下操作:

dict_list  = open('dictionary', 'r').read().split()

def search_dictionary(key):
    p=re.compile(key)
    # Comment: Is 'key' a prefix for a valid word in dictionary?
    # If yes, return True, else return False
    tmp_list = [x for x in dict_list if bool(p.match(x))]
    if not tmp_list:
        ....
    else:
        ....

请注意,可以多次调用search_dictionary,这是当前的瓶颈。 有没有更有效的方法来执行此字符串搜索? 就像说预编译字典。 可以想到一个字典攻击用例。 我是一个初学者。

编辑:我已经用注释更新了代码。 如评论中所建议,我可能正在做比所需更多的工作。

您的算法在O(n)时间内以大常数运行。 当简单的二进制搜索可以执行O(lg n)时,这似乎是如此错误。 如果您的正则表达式不包含特殊字符,为什么不这样做:

import bisect
with open('dictionary') as f:
    dictionary = f.read().split()

    # .sort is slow, so better to sort
    # words on disk! And/or run many searches
    # for one invocation
    dictionary.sort()

def bisect_search(key):
    i = bisect.bisect_left(dictionary, key)
    if i != len(dictionary):
        return dictionary[i].startswith(key)

    return False

对数组进行排序,然后从字典上找到word >= key “最小”单词,并查看它是否是给定键的前缀。

针对字典中Padraic的第一个字母进行速度测试,然后线性搜索

In [1]: %timeit bisect_search('thu')
1000000 loops, best of 3: 1.07 µs per loop

In [2]: %timeit search_dictionary('thu')
1000 loops, best of 3: 595 µs per loop

包含146880个单词和以t开头的5760个单词。

如果您想在第一个匹配项上停下来,请使用任何会在我们找到匹配项后立即短路的内容,即使您在第一个匹配项上找到匹配项,在代码中也始终会遍历字典中的每个单词。不必要地列出:

dict_list  = open('dictionary').read().split()

def search_dictionary(key):
    p = re.compile(key)
    if any(p.match(x) for x in dict_list):
    .....

您最好也应该只创建一次字典,而不是每次调用该函数一次。 在代码的开头定义它,并在需要时将其作为参数传递。

如果要使用str.startswith查找前缀可能会更快:

if any(x.startswith(key) for x in dict_list): 

要优化startwith调用:

check = str.startswith
if any(check(x, key) for x in dict_list):

或者,如果它可以出现在任何地方,只需在以下位置使用:

if any(key in x for x in dict_list): 

使用优化的str.startswith方法使用cpython2.7似乎更有效:

In [15]: s ="efficient"

In [16]: timeit p.match(s)
1000000 loops, best of 3: 359 ns per loop

In [17]: check = str.startswith

In [18]: timeit check(s,"eff")
1000000 loops, best of 3: 212 ns per loop

非匹配项的差异大致相同

如果您从字典中做出一个实际的字典,其中的键来自az,值是从键开始的单词列表,则可以使用函数中每个key的第一个字母进行查找,仅搜索以开头的单词同一封信。

from collections import defaultdict
word_dict = defaultdict(list)

with open("/usr/share/dict/words") as f:
    for line in f:
        line = line.rstrip().lower()
        word_dict[line[0]].append(line)

您可以看到使用键“ z”的示例输出:

word_dict["z"]
['z', "z's", 'zachariah', "zachariah's", 'zachary', "zachary's", 'zachery', "zachery's", 'zagreb', "zagreb's", 'zaire', "zaire's", 'zairian', 'zambezi', "zambezi's", 'zambia', "zambia's", 'zambian', "zambian's", 'zambians', 'zamboni', 'zamenhof', "zamenhof's", 'zamora', 'zane', "zane's", 'zanuck', "zanuck's", 'zanzibar', "zanzibar's", 'zapata', 'zaporozhye', 'zapotec', 'zappa', "zappa's", 'zara', "zara's", 'zebedee', 'zechariah', 'zedekiah', "zedekiah's", 'zedong', "zedong's", 'zeffirelli', "zeffirelli's", 'zeke', "zeke's", 'zelig', 'zelma', "zelma's", 'zen', "zen's", 'zenger', "zenger's", 'zeno', "zeno's", 'zens', 'zephaniah', 'zephyrus', 'zeppelin', 'zest', "zest's", 'zeus', "zeus's", 'zhengzhou', 'zhivago', "zhivago's", 'zhukov', 'zibo', "zibo's", 'ziegfeld', 'ziegler', "ziegler's", 'ziggy', "ziggy's", 'zimbabwe', "zimbabwe's", 'zimbabwean', "zimbabwean's", 'zimbabweans', 'zimmerman', "zimmerman's", 'zinfandel', "zinfandel's", 'zion', "zion's", 'zionism', "zionism's", 'zionisms', 'zionist', "zionist's", 'zionists', 'zions', 'ziploc', 'zn', "zn's", 'zoe', "zoe's", 'zola', "zola's", 'zollverein', 'zoloft', 'zomba', "zomba's", 'zorn', 'zoroaster', "zoroaster's", 'zoroastrian', "zoroastrian's", 'zoroastrianism', "zoroastrianism's", 'zoroastrianisms', 'zorro', "zorro's", 'zosma', "zosma's", 'zr', "zr's", 'zsigmondy', 'zubenelgenubi', "zubenelgenubi's", 'zubeneschamali', "zubeneschamali's", 'zukor', "zukor's", 'zulu', "zulu's", 'zulus', 'zuni', 'zwingli', "zwingli's", 'zworykin', 'zyrtec', "zyrtec's", 'zyuganov', "zyuganov's", 'zürich', "zürich's", 'z', 'zanier', 'zanies', 'zaniest', 'zaniness', "zaniness's", 'zany', "zany's", 'zap', "zap's", 'zapped', 'zapping', 'zaps', 'zeal', "zeal's", 'zealot', "zealot's", 'zealots', 'zealous', 'zealously', 'zealousness', "zealousness's", 'zebra', "zebra's", 'zebras', 'zebu', "zebu's", 'zebus', 'zed', "zed's", 'zeds', 'zenith', "zenith's", 'zeniths', 'zephyr', "zephyr's", 'zephyrs', 'zeppelin', "zeppelin's", 'zeppelins', 'zero', "zero's", 'zeroed', 'zeroes', 'zeroing', 'zeros', 'zest', "zest's", 'zestful', 'zestfully', 'zests', 'zeta', 'zigzag', "zigzag's", 'zigzagged', 'zigzagging', 'zigzags', 'zilch', "zilch's", 'zillion', "zillion's", 'zillions', 'zinc', "zinc's", 'zinced', 'zincing', 'zincked', 'zincking', 'zincs', 'zing', "zing's", 'zinged', 'zinger', "zinger's", 'zingers', 'zinging', 'zings', 'zinnia', "zinnia's", 'zinnias', 'zip', "zip's", 'zipped', 'zipper', "zipper's", 'zippered', 'zippering', 'zippers', 'zippier', 'zippiest', 'zipping', 'zippy', 'zips', 'zircon', "zircon's", 'zirconium', "zirconium's", 'zircons', 'zit', "zit's", 'zither', "zither's", 'zithers', 'zits', 'zodiac', "zodiac's", 'zodiacal', 'zodiacs', 'zombi', "zombi's", 'zombie', "zombie's", 'zombies', 'zombis', 'zonal', 'zone', "zone's", 'zoned', 'zones', 'zoning', 'zonked', 'zoo', "zoo's", 'zoological', 'zoologist', "zoologist's", 'zoologists', 'zoology', "zoology's", 'zoom', "zoom's", 'zoomed', 'zooming', 'zooms', 'zoos', 'zucchini', "zucchini's", 'zucchinis', 'zwieback', "zwieback's", 'zygote', "zygote's", 'zygotes']

因此,请使用word_dict[key]来获取值以获取适当的值:

def search_dictionary(key):
    check = str.startswith
    vals = word_dict[key[0]]
    if any(check(x, key) for x in vals):

不确定是否已考虑大小写,因此您可能要根据需要删除下级呼叫。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM