[英]Fuzzy String Searching with Whoosh in Python
I've built up a large database of banks in MongoDB.我在 MongoDB 中建立了一个大型银行数据库。 I can easily take this information and create indexes with it in whoosh.我可以轻松地获取这些信息并用它创建索引。 For example I'd like to be able to match the bank names 'Eagle Bank & Trust Co of Missouri' and 'Eagle Bank and Trust Company of Missouri'.例如,我希望能够匹配银行名称“Eagle Bank & Trust Co of Missouri”和“Eagle Bank and Trust Company of Missouri”。 The following code works with simple fuzzy such, but cannot achieve a match on the above:以下代码适用于简单的模糊等,但无法实现上述匹配:
from whoosh.index import create_in
from whoosh.fields import *
schema = Schema(name=TEXT(stored=True))
ix = create_in("indexdir", schema)
writer = ix.writer()
test_items = [u"Eagle Bank and Trust Company of Missouri"]
writer.add_document(name=item)
writer.commit()
from whoosh.qparser import QueryParser
from whoosh.query import FuzzyTerm
with ix.searcher() as s:
qp = QueryParser("name", schema=ix.schema, termclass=FuzzyTerm)
q = qp.parse(u"Eagle Bank & Trust Co of Missouri")
results = s.search(q)
print results
gives me:给我:
<Top 0 Results for And([FuzzyTerm('name', u'eagle', boost=1.000000, minsimilarity=0.500000, prefixlength=1), FuzzyTerm('name', u'bank', boost=1.000000, minsimilarity=0.500000, prefixlength=1), FuzzyTerm('name', u'trust', boost=1.000000, minsimilarity=0.500000, prefixlength=1), FuzzyTerm('name', u'co', boost=1.000000, minsimilarity=0.500000, prefixlength=1), FuzzyTerm('name', u'missouri', boost=1.000000, minsimilarity=0.500000, prefixlength=1)]) runtime=0.00166392326355>
Is it possible to achieve what I want with Whoosh? Whoosh可以实现我想要的吗? If not what other python based solutions do I have?如果没有,我还有什么其他基于 python 的解决方案?
You could match Co
with Company
using Fuzzy Search in Whoosh but You shouldn't do because the difference between Co
and Company
is large.您可以使用 Whoosh 中的模糊搜索将Co
与Company
匹配,但您不应该这样做,因为Co
和Company
之间的差异很大。 Co
is similar to Company
as Be
is similar to Beast
and ny
to Company
, You can imagine how bad and how large will be the search results. Co
与Company
相似, Be
与Beast
相似, ny
与Company
相似,您可以想象搜索结果会有多糟糕和有多大。
However, if you want to match Compan
or compani
or Companee
to Company
you could do it by using a Personalized Class of FuzzyTerm
with default maxdist
equal to 2 or more:但是,如果您想将Compan
或compani
或Companee
与Company
匹配,您可以使用 FuzzyTerm 的个性化 Class 来FuzzyTerm
,默认maxdist
等于或大于 2:
maxdist – The maximum edit distance from the given text. maxdist – 与给定文本的最大编辑距离。
class MyFuzzyTerm(FuzzyTerm):
def __init__(self, fieldname, text, boost=1.0, maxdist=2, prefixlength=1, constantscore=True):
super(MyFuzzyTerm, self).__init__(fieldname, text, boost, maxdist, prefixlength, constantscore)
Then:然后:
qp = QueryParser("name", schema=ix.schema, termclass=MyFuzzyTerm)
You could match Co
with Company
by setting maxdist
to 5
but this as I said give bad search results.您可以通过将maxdist
设置为5
来将Co
与Company
匹配,但正如我所说,这会给出错误的搜索结果。 I suggest to keep maxdist
from 1
to 3
.我建议保持maxdist
从1
到3
。
If you are looking for matching a word linguistic variations, you better use whoosh.query.Variations
.如果您正在寻找匹配单词的语言变体,您最好使用whoosh.query.Variations
。
Note: older Whoosh versions has minsimilarity
instead of maxdist
.注意:较旧的 Whoosh 版本具有minsimilarity
而不是maxdist
。
For future reference, and there must be a better way to do this somehow, but here's my shot.为了将来参考,必须有更好的方法来做到这一点,但这是我的镜头。
# -*- coding: utf-8 -*-
import whoosh
from whoosh.index import create_in
from whoosh.fields import *
from whoosh.query import *
from whoosh.qparser import QueryParser
schema = Schema(name=TEXT(stored=True))
idx = create_in("C:\\idx_name\\", schema, "idx_name")
writer = idx.writer()
writer.add_document(name=u"This is craaazy shit")
writer.add_document(name=u"This is craaazy beer")
writer.add_document(name=u"Raphaël rocks")
writer.add_document(name=u"Rockies are mountains")
writer.commit()
s = idx.searcher()
print "Fields: ", list(s.lexicon("name"))
qp = QueryParser("name", schema=schema, termclass=FuzzyTerm)
for i in range(1,40):
res = s.search(FuzzyTerm("name", "just rocks", maxdist=i, prefixlength=0))
if len(res) > 0:
for r in res:
print "Potential match ( %s ): [ %s ]" % ( i, r["name"] )
break
else:
print "Pass: %s" % i
s.close()
Perhaps some of this stuff might help (string matching open sourced by the seatgeek guys):也许其中一些东西可能会有所帮助(由 seatgeek 家伙开源的字符串匹配):
https://github.com/seatgeek/fuzzywuzzy https://github.com/seatgeek/fuzzywuzzy
For anyone stumbling across this question more recently, it looks like they've added fuzzy support natively, though it'd take a bit of work to satisfy the particular use case outlined here: https://whoosh.readthedocs.io/en/latest/parsing.html对于最近遇到这个问题的任何人,看起来他们已经在本地添加了模糊支持,尽管需要一些工作才能满足此处概述的特定用例: https://whoosh.readthedocs.io/en/最新/解析.html
You could use this function below to fuzz search a set of words against a phrase:您可以使用下面的 function 对一组单词进行模糊搜索:
def FuzzySearch(text, phrase):
"""Check if word in phrase is contained in text"""
phrases = phrase.split(" ")
for x in range(len(phrases)):
if phrases[x] in text:
print("Match! Found " + phrases[x] + " in text")
else:
continue
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.