简体   繁体   中英

Searching and indexing hyphenated words with Whoosh

I am using Whoosh to index and search a large number of documents, and many of the things I need to search on are hyphenated. Whoosh seems to treat hyphens as a special character of some kind, but for the life of me I can't figure out it's behavior.

Can anyone advise on how Whoosh treats hyphens while indexing and searching?

Whoosh simply treats all punctuation as a space. Assuming a default AND search, the query dual-scale thermometer is equivalent to dual AND scale AND thermometer . This will find a document containing dual-scale digital thermometer , but it will also find dual purpose bathroom scale with thermometer .

One solution to avoid this is to turn the hyphenated words in your query into phrases: "dual-scale" thermometer , which is the equivalent of "dual scale" AND thermometer .

You could also force Whoosh to accept hyphens as part of a word. You do this by overriding the RegexTokenizer expression in the StandardAnalyzer with a regular expression that accepts hyphens as a valid part of a token.

    from whoosh import fields, analysis

    myanalyzer = analysis.StandardAnalyzer(expression=r'[\w-]+(\.?\w+)*')
    schema = fields.Schema(myfield=fields.TEXT(analyzer=myanalyzer))

Now a search for dual-scale thermometer is equivalent to dual-scale AND thermometer and will find dual-scale digital thermometer but not "dual purpose bathroom scale with thermometer" .

However, you won't be able to search for hyphenated words independently. If your document contained high-quality components , you would not match it if you searched for quality ; only high-quality , because this has now become one token. Because of this side-effect, unless your content is strictly constrained in its use of hyphens to truly atomic hyphenated words, I would recommend the phrase approach.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM