简体   繁体   中英

Why is whoosh Search performing worse than tfidfvectorizer in sklearn?

I implemented a basic (almost all default options) TF-IDF vectorizer (sklearn) based search program to search for some documents based on user query.

I also tried to implement the same using Whoosh in python. While the standalone tfidfvectorizer implementation returns many intuitive results for the same query, the whoosh query returns only one. (That too vanishes when I try to search in more fields and comes down to 0 results). I want to know what am I doing wrong here?

I have tried setting the scoring in whoosh searcher appropriately according to the whoosh docs. with myindex.searcher(weighting=scoring.TF_IDF()) as s:

With this, I assume it should give somewhat similar results to the sklearn implementation of TF-IDF vectorizer but instead returns only one hit. How do I get similar results, ie, make use of something similar to sklearns TF-IDF vectorizer implementation in whoosh.

Also when I use multiple fields to search using MultifieldParser(["title", "content", "tags", "categories"], ix.schema) as opposed to only a single field "content", the result is no hits.

Schema:

schema = Schema(id = NUMERIC,
                title = TEXT(field_boost=2.0, stored=True, analyzer = StandardAnalyzer(minsize = 1)),              
                content = TEXT(stored=False, analyzer = StemmingAnalyzer(minsize = 1)),
                permalink = ID(stored=True),
                tags = KEYWORD(field_boost=2.0,lowercase=True, commas=True, scorable=True, stored = True),
                categories = KEYWORD(field_boost=2.0,lowercase=True, commas=True, scorable=True, stored = True),
                pub_date = DATETIME(stored = True),
                creator = TEXT(stored=False)
                )

searching:

writer = ix.writer()
for i in range(len(df)):
    writer.add_document(id = df["ID"][i], title = df["Title"][i],  content=df["Content"][i],
                    permalink = df["Permalink"][i], tags = df["Tag"][i], categories = df["Category"][i],
                    pub_date = df["PubDate"][i], creator = df["Creator"][i])
writer.commit()

with ix.searcher(weighting=scoring.TF_IDF()) as searcher:
    parser = MultifieldParser(["title", "content", "tags", "categories"], ix.schema)
    query_string = sys.argv[2]
    myquery = parser.parse(query_string)
    results = searcher.search(myquery, limit = 10, terms = True)
    print(len(results))
    for i in range(results.scored_length()):
        print(results[i])
        print()
    print("\n")

The code does work and also fetches results. The only problem I am facing is that they seem lacking when compared to the TF-IDF implementation and also return fewer results in most cases (The issue is not the limit attribute in whoosh search). I want to know how to get better results or scoring of results in whoosh along with why it is returning fewer results than the normal implementation.

OUTPUT example for query "How to code?" TF-IDF (sklearn):

30 Tips to Become Super Effective Software Developers
(Cosine Similarity of 0.3783876779183675 ):

Automation and Continuous Delivery are the bedrock of DevOps
(Cosine Similarity of 0.1476918570123896 ):

Practical Implementation of DevOps Step by Step
(Cosine Similarity of 0.1469115686911894 ):

10 Software Development Frustrations & What You Can Do To Avoid Them!
(Cosine Similarity of 0.13241987064219532 ):

WHOOSH (when searched only in content field. else returns 0 hits):

<Hit {'title': 'Ultimate List of 110 Must Read Software Development Books'}>

EDIT: I just ran the code again and found out that if I remove "?" from query "How to code?" and search only in "title" and "content", it returns quite a few results and they seem better too. Although as soon as I include "tag" and "categories" in the fields to search for, the results go to 0. Why is that?

? is treated as a wildcard. I'm playing with whoosh right now and noticed that:

query = QueryParser("content", ix.schema).parse("one")

I get:

<Top 1 Results for Term('content', 'one') runtime=0.0006002392619848251>

While if I search for one?:

query = QueryParser("content", ix.schema).parse("one?")

I get:

<Top 0 Results for Wildcard('content', 'one?') runtime=0.0002482738345861435>

As you can see in the second example, the returned object is Wildcard. Read more here: https://whoosh.readthedocs.io/en/latest/querylang.html#inexact-terms

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM