简体   繁体   English

在 Python 中使用 Whoosh 进行模糊字符串搜索

[英]Fuzzy String Searching with Whoosh in Python

I've built up a large database of banks in MongoDB.我在 MongoDB 中建立了一个大型银行数据库。 I can easily take this information and create indexes with it in whoosh.我可以轻松地获取这些信息并用它创建索引。 For example I'd like to be able to match the bank names 'Eagle Bank & Trust Co of Missouri' and 'Eagle Bank and Trust Company of Missouri'.例如,我希望能够匹配银行名称“Eagle Bank & Trust Co of Missouri”和“Eagle Bank and Trust Company of Missouri”。 The following code works with simple fuzzy such, but cannot achieve a match on the above:以下代码适用于简单的模糊等,但无法实现上述匹配:

from whoosh.index import create_in
from whoosh.fields import *

schema = Schema(name=TEXT(stored=True))
ix = create_in("indexdir", schema)
writer = ix.writer()

test_items = [u"Eagle Bank and Trust Company of Missouri"]

writer.add_document(name=item)
writer.commit()

from whoosh.qparser import QueryParser
from whoosh.query import FuzzyTerm

with ix.searcher() as s:
    qp = QueryParser("name", schema=ix.schema, termclass=FuzzyTerm)
    q = qp.parse(u"Eagle Bank & Trust Co of Missouri")
    results = s.search(q)
    print results

gives me:给我:

<Top 0 Results for And([FuzzyTerm('name', u'eagle', boost=1.000000, minsimilarity=0.500000, prefixlength=1), FuzzyTerm('name', u'bank', boost=1.000000, minsimilarity=0.500000, prefixlength=1), FuzzyTerm('name', u'trust', boost=1.000000, minsimilarity=0.500000, prefixlength=1), FuzzyTerm('name', u'co', boost=1.000000, minsimilarity=0.500000, prefixlength=1), FuzzyTerm('name', u'missouri', boost=1.000000, minsimilarity=0.500000, prefixlength=1)]) runtime=0.00166392326355>

Is it possible to achieve what I want with Whoosh? Whoosh可以实现我想要的吗? If not what other python based solutions do I have?如果没有,我还有什么其他基于 python 的解决方案?

You could match Co with Company using Fuzzy Search in Whoosh but You shouldn't do because the difference between Co and Company is large.您可以使用 Whoosh 中的模糊搜索将CoCompany匹配,但您不应该这样做,因为CoCompany之间的差异很大。 Co is similar to Company as Be is similar to Beast and ny to Company , You can imagine how bad and how large will be the search results. CoCompany相似, BeBeast相似, nyCompany相似,您可以想象搜索结果会有多糟糕和有多大。

However, if you want to match Compan or compani or Companee to Company you could do it by using a Personalized Class of FuzzyTerm with default maxdist equal to 2 or more:但是,如果您想将CompancompaniCompaneeCompany匹配,您可以使用 FuzzyTerm 的个性化 Class 来FuzzyTerm ,默认maxdist等于或大于 2:

maxdist – The maximum edit distance from the given text. maxdist – 与给定文本的最大编辑距离。

class MyFuzzyTerm(FuzzyTerm):
     def __init__(self, fieldname, text, boost=1.0, maxdist=2, prefixlength=1, constantscore=True):
         super(MyFuzzyTerm, self).__init__(fieldname, text, boost, maxdist, prefixlength, constantscore)

Then:然后:

 qp = QueryParser("name", schema=ix.schema, termclass=MyFuzzyTerm)

You could match Co with Company by setting maxdist to 5 but this as I said give bad search results.您可以通过将maxdist设置为5来将CoCompany匹配,但正如我所说,这会给出错误的搜索结果。 I suggest to keep maxdist from 1 to 3 .我建议保持maxdist13

If you are looking for matching a word linguistic variations, you better use whoosh.query.Variations .如果您正在寻找匹配单词的语言变体,您最好使用whoosh.query.Variations

Note: older Whoosh versions has minsimilarity instead of maxdist .注意:较旧的 Whoosh 版本具有minsimilarity而不是maxdist

For future reference, and there must be a better way to do this somehow, but here's my shot.为了将来参考,必须有更好的方法来做到这一点,但这是我的镜头。

# -*- coding: utf-8 -*-
import whoosh
from whoosh.index import create_in
from whoosh.fields import *
from whoosh.query import *
from whoosh.qparser import QueryParser

schema = Schema(name=TEXT(stored=True))
idx = create_in("C:\\idx_name\\", schema, "idx_name")

writer = idx.writer()

writer.add_document(name=u"This is craaazy shit")
writer.add_document(name=u"This is craaazy beer")
writer.add_document(name=u"Raphaël rocks")
writer.add_document(name=u"Rockies are mountains")

writer.commit()

s = idx.searcher()
print "Fields: ", list(s.lexicon("name"))
qp = QueryParser("name", schema=schema, termclass=FuzzyTerm)

for i in range(1,40):
    res = s.search(FuzzyTerm("name", "just rocks", maxdist=i, prefixlength=0))
    if len(res) > 0:
        for r in res:
            print "Potential match ( %s ): [  %s  ]" % ( i, r["name"] )
        break
    else:
        print "Pass: %s" % i

s.close()

Perhaps some of this stuff might help (string matching open sourced by the seatgeek guys):也许其中一些东西可能会有所帮助(由 seatgeek 家伙开源的字符串匹配):

https://github.com/seatgeek/fuzzywuzzy https://github.com/seatgeek/fuzzywuzzy

For anyone stumbling across this question more recently, it looks like they've added fuzzy support natively, though it'd take a bit of work to satisfy the particular use case outlined here: https://whoosh.readthedocs.io/en/latest/parsing.html对于最近遇到这个问题的任何人,看起来他们已经在本地添加了模糊支持,尽管需要一些工作才能满足此处概述的特定用例: https://whoosh.readthedocs.io/en/最新/解析.html

You could use this function below to fuzz search a set of words against a phrase:您可以使用下面的 function 对一组单词进行模糊搜索:

def FuzzySearch(text, phrase):
    """Check if word in phrase is contained in text"""
    phrases = phrase.split(" ")

    for x in range(len(phrases)):
        if phrases[x] in text:
            print("Match! Found " + phrases[x] + " in text")
        else:
            continue

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM