简体   繁体   English

场内的tf / idf提升

[英]tf/idf boosting within field

My use case is like this: for a query iphone charger , I am getting higher relevance for results, having name , iphone charger coupons than with name iphone charger , possibly because of better match in description and other fields. 我的用例是这样的:对于查询iphone charger ,与名称iphone charger相比,具有nameiphone charger coupons结果与查询的相关性更高,这可能是因为description和其他领域的匹配更好。 Boosting name field isn't helping much unless I skew the importance drastically. 除非我大幅度改变重要性,否则提升name字段不会有多大帮助。 what I really need is tf/idf boost within name field 我真正需要的是name字段中的tf/idf Boost

to quote elasticsearch blog : 引用elasticsearch 博客

the frequency of a term in a field is offset by the length of the field. 字段中术语的频率会被字段的长度抵消。 However, the practical scoring function treats all fields in the same way. 但是,实际评分功能以相同的方式对待所有字段。 It will treat all title fields (because they are short) as more important than all body fields (because they are long). 它将所有标题字段(因为它们很短)比所有正文字段(因为它们很长)更重要。

I need to boost this more important value for a particular field. 我需要为特定领域提升这一more important价值。 Can we do this with function score or any other way? 我们可以用功能评分或其他方式做到这一点吗?

A one term difference in length is not much of a difference to the scoring algorithm (and, in fact, can vanish entirely due to imprecision on the length norm). 在长度A一个术语差多以评分算法的差的(并且,实际上,可完全消失由于在长度规范不精确性)。 If there are hits on other fields, you have a lot of scoring elements to fight against. 如果在其他领域有成功,您将有很多得分元素需要对抗。

A dis_max would probably be a reasonable approach to this. dis_max可能是解决此问题的合理方法。 Instead of all the additive scores and coords and such you are trying to overcome, it will simply select the score of the best matching subquery. 而不是您试图克服的所有累加分数和坐标,它只会选择最匹配子查询的分数。 If you boost the query against title, you can ensure matches there are strongly preferred. 如果根据标题增强查询,则可以确保强烈推荐匹配项。

You can then assign a "tie_breaker", so that the score against the description subquery is factored in only when "title" scores are tied. 然后,您可以分配一个“ tie_breaker”,以便仅当“标题”得分并列时才考虑描述子查询的得分。

{
    "dis_max" : {
        "tie_breaker" : 0.2,
        "queries" : [
            {
                "terms" : { 
                    "age" : ["iphone", "charger"],
                    "boost" : 10
                }
            },
            {
                "terms" : {
                    "description" : ["iphone", "charger"]
                }
            }
        ]
    }
}

Another approach to this sort of thing, if you absolutely know when you have an exact match against the entire field , is to separately index an untokenized version of that field, and query that field as well. 如果您完全知道何时与整个字段完全匹配 ,则另一种处理此类问题的方法是分别索引该字段的未标记版本,并同时查询该字段。 Any match against the untokenized version of the field will be an exact match again the entire field contents. 与该字段的未标记版本的任何匹配将再次是整个字段内容的完全匹配。 This would prevent you needing to relying on the length norm to make that determination. 这样可以避免您需要依靠长度标准来进行确定。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM