简体   繁体   English

Lucene中LMJelinekMercerSimilarity方法中使用的LM公式的解释

[英]Interpretation of LM formula used in LMJelinekMercerSimilarity method in Lucene

I found a return statement in score method of LMJelinekMercerSimilarity like: 我在LMJelinekMercerSimilarity的评分方法中找到了一条返回语句,如下所示:

protected float score(BasicStats stats, float freq, float docLen) {
    return stats.getBoost()
            * (float) Math.log(1 + ((1 - alpha) * freq / docLen)
                            / (alpha * ((LMStats) stats).getCollectionProbability()));
}

This return statement should be theoretically similar as: "(1-lambda) * P(w|d) + lambda * P(w|Collection)". 该返回语句在理论上应类似于:“(1-lambda)* P(w | d)+ lambda * P(w | Collection)”。

But, I cannot understand how they are related. 但是,我不明白它们之间的关系。 Can anyone help. 谁能帮忙。

The implementation of lucene is a bit different from The language Model. lucene的实现与The Language Model有所不同。

The language model actually calculates the Probability of a query (in the specified context) and it uses the JM model: 语言模型实际上计算查询的概率 (在指定的上下文中),并且使用JM模型:

(1-lambda) * P(w|d) + lambda * P(w|Collection)

But Lucene does some mathematical operations on this expression. 但是Lucene对该表达式进行了一些数学运算。 It factorizes the expression lambda * P(w|Collection) from the language model formula and obtains: 它从语言模型公式中分解出表达式lambda * P(w|Collection)并获得:

lambda * P(w|Collection) * ( ( (1-lambda) * P(w|d) / lambda * P(w|Collection) ) +1 )

By the concept of ranking models, We know that the first factor: lambda * P(w|Collection) doesn't affect the final ranking, So lucene ignores this factor and obtains this expression: 通过排名模型的概念,我们知道第一个因素: lambda * P(w|Collection)不会影响最终排名,因此lucene忽略了该因素并获得了以下表达式:

( ( (1-lambda) * P(w|d) / lambda * P(w|Collection) ) +1 )

In Information Retrieval trend, We're desired to work with log-scale functions. 在“信息检索趋势”中,我们希望使用对数刻度功能。 Because they are easy to deal and that's the reason why lucene uses log function: 因为它们易于处理,这就是lucene使用log函数的原因:

log ( ( ( (1-lambda) * P(w|d) / lambda * P(w|Collection) ) +1 ) )

Hope this explanation would help you! 希望这种解释对您有所帮助!

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM