如何使用语言 model 扩展 Lucene？

Question

Good evening everyone, So, i have as project "extend lucene with a language model", i tried adding 2: 3 lines to my code like this.大家晚上好，所以，我有一个项目“使用语言模型扩展 lucene”，我尝试像这样在我的代码中添加 2：3 行。 on search.java在 search.java 上

LMDirichletSimilarity similarity = new LMDirichletSimilarity(2000f);  
indexSearcher.setSimilarity(similarity);

and on index.java和索引.java

LMDirichletSimilarity similarity = new LMDirichletSimilarity(2000f);
config.setSimilarity(similarity);

but i don't think it's that easy?但我不认为这很容易？ maybe i should write an algorithm or something ?也许我应该写一个算法之类的？ please if you have some answers help me thank you ^_^有答案的请帮帮我谢谢^_^

Answer 1

There is a language model similarity method in LMJelinekMercerSimilarity and the implementation is: LMJelinekMercerSimilarity 中有一种语言LMJelinekMercerSimilarity相似度方法，实现为：

protected float score(BasicStats stats, float freq, float docLen) {
    return stats.getBoost()
            * (float) Math.log(1 + ((1 - alpha) * freq / docLen)
                            / (alpha * ((LMStats) stats).getCollectionProbability()));
}

This method is implementation of this formula: (1-lambda) * P(w|d) + lambda * P(w|Collection) If you look at the method above and the language model formula you see there is a bit difference between them.此方法是此公式的实现： (1-lambda) * P(w|d) + lambda * P(w|Collection)如果您查看上面的方法和语言 model 公式，您会发现它们之间存在一些差异. It is because Lucene's factorizes the expression lambda * P(w|Collection) from the language model formula and create a new formula: lambda * P(w|Collection) * ( ( (1-lambda) * P(w|d) / lambda * P(w|Collection) ) +1 ) then it removes the lambda * P(w|Collection) because of ranking (It doesn't affect ranking) and just calculate ( ( (1-lambda) * P(w|d) / lambda * P(w|Collection) ) +1 ) . It is because Lucene's factorizes the expression lambda * P(w|Collection) from the language model formula and create a new formula: lambda * P(w|Collection) * ( ( (1-lambda) * P(w|d) / lambda * P(w|Collection) ) +1 )然后它删除lambda * P(w|Collection)因为排名（它不影响排名）并且只计算( ( (1-lambda) * P(w|d) / lambda * P(w|Collection) ) +1 ) 。 you can see it is similar the method above.你可以看到它类似于上面的方法。 but there is a little difference and that is Logarithm.但有一点区别，那就是对数。 In IR community they use Logarithm because that is easy to deal and easily evaluated by computers.在 IR 社区中，他们使用对数，因为它易于处理且易于由计算机评估。 so the final statement is: log ( ( ( (1-lambda) * P(w|d) / lambda * P(w|Collection) ) +1 ) ) The method above is a protected so you can derive that method and implement your own.所以最后的语句是： log ( ( ( (1-lambda) * P(w|d) / lambda * P(w|Collection) ) +1 ) )上面的方法是protected的，因此您可以派生该方法并实现你自己。

如何使用语言 model 扩展 Lucene？

问题描述

1 个解决方案

解决方案1
1 2020-06-17 05:12:14

如何使用语言 model 扩展 Lucene？

问题描述

1 个解决方案

解决方案1 1 2020-06-17 05:12:14

解决方案1
1 2020-06-17 05:12:14