繁体   English   中英

Python:快速搜索非精确查询

[英]Python: Whoosh search for a non-exact query

是否可以使用 Whoosh 搜索与查询不完全匹配但非常接近的文档? 例如,在查找某物的查询中只缺少一个词。

如果查询涵盖所有文档,我写了一个简单的代码:

import os.path
from whoosh.fields import Schema, TEXT
from whoosh.index import create_in, open_dir
from whoosh.qparser import QueryParser


if not os.path.exists("index"):
    os.mkdir("index")

schema = Schema(title=TEXT(stored=True))
ix = create_in("index", schema)
ix = open_dir("index")

writer = ix.writer()
writer.add_document(title=u'TV Ultra HD')
writer.add_document(title=u'TV HD')
writer.add_document(title=u'TV 4K Ultra HD')
writer.commit()

with ix.searcher() as searcher:
    parser = QueryParser('title', ix.schema)
    myquery = parser.parse(u'TV HD')
    results = searcher.search(myquery)
    
    for result in results:
        print(result)

不幸的是,如果我将查询更改为以下查询之一,我将无法找到所有 3 个文档(或根本找不到):

myquery = parser.parse(u'TV Ultra HD')  # 2 Hits
myquery = parser.parse(u'TV 4K Ultra HD')  # 1 Hit
myquery = parser.parse(u'TV HD 2022')  # 0 Hit

是否可以创建一个解析,以便即使标题字段略有不同,这些查询中的任何一个仍会返回 3 个文档?

我猜你可以使用 operator * from parse

...

with ix.searcher() as searcher:
    parser = QueryParser('title', ix.schema)
    myquery = parser.parse(u'TV Ultra*')  # 3
    myquery = parser.parse(u'TV 4K Ultra*')  # 3 
    myquery = parser.parse(u'TV HD*')  # 3
    results = searcher.search(myquery)
    
    for i in results:
        print(i)

经过一番思考,我来到了所有单词组合的常规枚举。

我添加了一个变量tolerance ——这是可以从原始请求中删除的最大单词数。 还添加了一个单独的方法getResults(words, tolerance)

最终代码是:

import os.path
from whoosh.fields import Schema, TEXT
from whoosh.index import create_in, open_dir
from whoosh.qparser import QueryParser
from whoosh.searching import Results
from itertools import combinations


def getResults(words: list, tol: int) -> Results:
    count = len(words)
    
    for tol in range(tolerance):
        if count - tol <= 0:
            return None
        
        for variant in combinations(words, count - tolerance):
            myquery = parser.parse(' '.join(variant))
            results = searcher.search(myquery)
            
            if results:
                return results
    
    return None


if not os.path.exists("index"):
    os.mkdir("index")

schema = Schema(title=TEXT(stored=True, spelling=True))
ix = create_in("index", schema)
ix = open_dir("index")

writer = ix.writer()
writer.add_document(title=u'TV Ultra HD')
writer.add_document(title=u'TV 4K Ultra HD')
writer.add_document(title=u'TV HD 2022')
writer.commit()

with ix.searcher() as searcher:
    parser = QueryParser('title', ix.schema)
    words = u'TV HD 2022'.split(' ')
    tolerance = 1  # New variable
    results = getResults(words, tolerance)
    
    for result in results:
        print(result)

结果是 3 次点击:

<Hit {'title': 'TV Ultra HD'}>
<Hit {'title': 'TV HD 2022'}>
<Hit {'title': 'TV 4K Ultra HD'}>

但我认为这是一个错误的决定,因为在我看来,在 Whoosh 中,这可以更简洁地实现

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM