简体   繁体   English

如何使用Lucene和Nhibernate进行索引和搜索来配置tolkenizers

[英]How to configure tolkenizers with indexing and searching with Lucene and Nhibernate

This is a question for using Lucene via the NHibernate.Search namespace, which works in conjunction with Lucene. 这是通过NHibernate.Search命名空间(与Lucene结合使用)使用Lucene的问题。

I'm indexing a Title in the Index: Grey's Anatomy 我正在索引“ 灰色解剖学”中标题

Title : "Grey's Anatomy"

By using Luke, I see that that title is getting Tokenized into: 通过使用Luke,我看到该标题被令牌化为:

Title: anatomy
Title: grey

Now, I get a result if I search for: 现在,如果搜索:

"grey" or "grey's"

However, if I search for "greys" then I get nothing. 但是,如果我搜索“灰色”,则一无所获。

I would like "greys" to return a result. 我希望“灰色”返回结果。 And I guess this could be an issue with any word with an apostrophe. 我想这可能是带撇号的任何单词的问题。

So, here are some questions: 因此,这里有一些问题:

  1. Am I right in thinking I could fix this issue either by changing something on the time of index (so, changing the tolkenizer..??) or changing it a query time (query parser?) 我是否认为我可以通过更改索引时间(例如,更改tolkenizer .. ??)或将其更改为查询时间(查询解析器)来解决此问题?
  2. If there is a solution, could someone provide a small code sample? 如果有解决方案,有人可以提供一个小的代码示例吗?

thanks 谢谢

If you make a classic Term search using Lucene, then greys it's most likely not to show on the results, except that you make a nice tokenizing work when saving, so from where I see it, you have 2 choices or a 3rd beign a combination of them: 如果您使用Lucene进行经典的Term搜索,则灰色效果很可能不会显示在结果上,除非您在保存时进行了出色的标记化工作,所以从我看到的位置来看,您有2个选择或第3个beign组合其中:

  1. Use a Stemmer for indexed data and query. 将Stemmer用于索引数据和查询。 Stemmers are fast, and you can always find an implementation of Porter's stemmer somewhere in Google . 词干提取器速度很快,您始终可以在Google的某处找到Porter词干提取器的实现。 Problem is when you look for different languages. 问题是当您寻找其他语言时。
  2. Use Fuzzy queries. 使用模糊查询。 Using a Fuzzy Query you can set the edit distance that you want to get "away" from the word being search. 使用模糊查询,您可以设置要从正在搜索的单词中“移开”的编辑距离。 The thing is that because 2 words are "close" using an edition distance (ie, Lehvenstein) doesn't mean that they're the same, but the problem of Grey and Grey's and Greys should be solved with setting an edit distance of 2. 事实是,因为两个单词在编辑距离上是“接近”的(例如,Lehvenstein)并不意味着它们相同,而是应通过将编辑距离设置为2来解决Gray and Grey's和Grays的问题。

I think you will be able to find a decent implementation of the Porter Stemmer, which is nice right here . 我认为您将能够找到Porter Stemmer的一个不错的实现,在这里很好。

Hope I can help! 希望我能帮忙!

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM