[英]Python: Whoosh search for a non-exact query
是否可以使用 Whoosh 搜索与查询不完全匹配但非常接近的文档? 例如,在查找某物的查询中只缺少一个词。
如果查询涵盖所有文档,我写了一个简单的代码:
import os.path
from whoosh.fields import Schema, TEXT
from whoosh.index import create_in, open_dir
from whoosh.qparser import QueryParser
if not os.path.exists("index"):
os.mkdir("index")
schema = Schema(title=TEXT(stored=True))
ix = create_in("index", schema)
ix = open_dir("index")
writer = ix.writer()
writer.add_document(title=u'TV Ultra HD')
writer.add_document(title=u'TV HD')
writer.add_document(title=u'TV 4K Ultra HD')
writer.commit()
with ix.searcher() as searcher:
parser = QueryParser('title', ix.schema)
myquery = parser.parse(u'TV HD')
results = searcher.search(myquery)
for result in results:
print(result)
不幸的是,如果我将查询更改为以下查询之一,我将无法找到所有 3 个文档(或根本找不到):
myquery = parser.parse(u'TV Ultra HD') # 2 Hits
myquery = parser.parse(u'TV 4K Ultra HD') # 1 Hit
myquery = parser.parse(u'TV HD 2022') # 0 Hit
是否可以创建一个解析,以便即使标题字段略有不同,这些查询中的任何一个仍会返回 3 个文档?
我猜你可以使用 operator *
from parse
...
with ix.searcher() as searcher:
parser = QueryParser('title', ix.schema)
myquery = parser.parse(u'TV Ultra*') # 3
myquery = parser.parse(u'TV 4K Ultra*') # 3
myquery = parser.parse(u'TV HD*') # 3
results = searcher.search(myquery)
for i in results:
print(i)
经过一番思考,我来到了所有单词组合的常规枚举。
我添加了一个变量tolerance
——这是可以从原始请求中删除的最大单词数。 还添加了一个单独的方法getResults(words, tolerance)
。
最终代码是:
import os.path
from whoosh.fields import Schema, TEXT
from whoosh.index import create_in, open_dir
from whoosh.qparser import QueryParser
from whoosh.searching import Results
from itertools import combinations
def getResults(words: list, tol: int) -> Results:
count = len(words)
for tol in range(tolerance):
if count - tol <= 0:
return None
for variant in combinations(words, count - tolerance):
myquery = parser.parse(' '.join(variant))
results = searcher.search(myquery)
if results:
return results
return None
if not os.path.exists("index"):
os.mkdir("index")
schema = Schema(title=TEXT(stored=True, spelling=True))
ix = create_in("index", schema)
ix = open_dir("index")
writer = ix.writer()
writer.add_document(title=u'TV Ultra HD')
writer.add_document(title=u'TV 4K Ultra HD')
writer.add_document(title=u'TV HD 2022')
writer.commit()
with ix.searcher() as searcher:
parser = QueryParser('title', ix.schema)
words = u'TV HD 2022'.split(' ')
tolerance = 1 # New variable
results = getResults(words, tolerance)
for result in results:
print(result)
结果是 3 次点击:
<Hit {'title': 'TV Ultra HD'}>
<Hit {'title': 'TV HD 2022'}>
<Hit {'title': 'TV 4K Ultra HD'}>
但我认为这是一个错误的决定,因为在我看来,在 Whoosh 中,这可以更简洁地实现
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.