简体   繁体   English

Lucene Index带有“ - ”字符的问题

[英]Lucene Index problems with “-” character

I'm having trouble with a Lucene Index, which has indexed words, that contain "-" Characters. 我在使用包含“ - ”字符的索引单词的Lucene索引时遇到问题。

It works for some words that contain "-" but not for all and I don't find the reason, why it's not working. 它适用于包含“ - ”的一些单词,但不适用于所有单词,我找不到原因,为什么它不起作用。

The field I'm searching in, is analyzed and contains version of the word with and without the "-" character. 我正在搜索的字段被分析并包含带有和不带“ - ”字符的单词的版本。

I'm using the analyzer: org.apache.lucene.analysis.standard.StandardAnalyzer 我正在使用分析器:org.apache.lucene.analysis.standard.StandardAnalyzer

here an example: 这里有一个例子:

if I search for "gsx-*" I got a result, the indexed field contains "SUZUKI GSX-R 1000 GSX-R1000 GSXR" 如果我搜索“gsx- *”我得到一个结果,索引字段包含“SUZUKI GSX-R 1000 GSX-R1000 GSXR”

but if I search for "v-*" I got no result. 但如果我搜索“v- *”,我就没有结果。 The indexed field of the expected result contains: "SUZUKI DL 1000 V-STROM DL1000V-STROMVSTROM V STROM" 预期结果的索引字段包含:“SUZUKI DL 1000 V-STROM DL1000V-STROMVSTROM V STROM”

If I search for "v-strom" without "*" it works, but if I just search for "v-str" for example I don't get the result. 如果我搜索没有“*”的“v-strom”它可以工作,但如果我只是搜索“v-str”,例如我没有得到结果。 (There should be a result because it's for a live search for a webshop) (应该有一个结果,因为它是用于网上商店的实时搜索)

So, what's the difference between the 2 expected results? 那么,2个预期结果之间有什么区别? why does it work for "gsx- " but not for "v- " ? 为什么它适用于“gsx- ”而不适用于“v- ”?

StandardAnalyzer will treat the hyphen as whitespace, I believe. 我相信,StandardAnalyzer会将连字符视为空格。 So it turns your query "gsx-*" into "gsx*" and "v-*" into nothing because at also eliminates single-letter tokens. 因此,它将您的查询"gsx-*"变为"gsx*""v-*"因为它也消除了单字母标记。 What you see as the field contents in the search result is the stored value of the field, which is completely independent of the terms that were indexed for that field. 您在搜索结果中看到的字段内容是字段的存储值,它完全独立于为该字段编制索引的字词。

So what you want is for "v-strom" as a whole to be an indexed term. 所以你想要的是“v-strom”作为一个整体成为一个索引术语。 StandardAnalyzer is not suited to this kind of text. StandardAnalyzer不适合此类文本。 Maybe have a go with the WhitespaceAnalyzer or SimpleAnalyzer . 也许可以使用WhitespaceAnalyzerSimpleAnalyzer If that still doesn't cut it, you also have the option of throwing together your own analyzer, or just starting off those two mentined and composing them with further TokenFilters . 如果仍然不剪,你也有拼凑自己的分析,还是刚刚开始关闭这两个mentined,并进一步将它们组成的选项TokenFilters A very good explanation is given in the Lucene Analysis package Javadoc. Lucene Analysis软件包Javadoc给出了一个非常好的解释

BTW there's no need to enter all the variants in the index, like V-strom, V-Strom, etc. The idea is for the same analyzer to normalize all these variants to the same string both in the index and while parsing the query. 顺便说一句,没有必要输入索引中的所有变体,比如V-strom,V-Strom等。这个想法是让同一个分析器在索引和解析查询时将所有这些变体规范化为相同的字符串。

ClassicAnalyzer handles '-' as a useful, non-delimiter character. ClassicAnalyzer将 ' - '作为有用的非分隔符处理。 As I understand ClassicAnalyzer, it handles '-' like the pre-3.1 StandardAnalyzer because ClassicAnalyzer uses ClassicTokenizer which treats numbers with an embedded '-' as a product code, so the whole thing is tokenized as one term. 据我了解ClassicAnalyzer,它像3.1之前的StandardAnalyzer一样处理' - ',因为ClassicAnalyzer使用ClassicTokenizer来处理带有嵌入式' - '作为产品代码的数字,因此整个事物被标记为一个术语。

When I was at Regenstrief Institute I noticed this after upgrading Luke, as the LOINC standard medical terms (LOINC was initiated by RI) are identified by a number followed by a '-' and a checkdigit, like '1-8' or '2857-1'. 当我在Regenstrief研究所时,我在升级Luke之后注意到这一点,因为LOINC标准医学术语(LOINC由RI发起)通过一个数字后跟一个' - '和一个校验位来识别,如'1-8'或'2857 -1' 。 My searches for LOINCs like '45963-6' failed using StandardAnalyzer in Luke 3.5.0, but succeeded with ClassicAnalyzer (and this was because we built the index with the 2.9.2 Lucene.NET). 我在Luke 3.5.0中使用StandardAnalyzer时,我对LOINC的搜索失败了,例如'45963-6',但是在ClassicAnalyzer中成功了(这是因为我们使用2.9.2 Lucene.NET构建了索引)。

(Based on Lucene 4.7) StandardTokenizer splits hyphenated words into two. (基于Lucene 4.7) StandardTokenizer将带连字符的单词拆分为两个。 for example "chat-room" into "chat","room" and index the two words separately instead of indexing as a single whole word. 例如“聊天室”进入“聊天室”,“房间”并分别索引两个单词,而不是索引为单个整个单词。 It is quite common for separate words to be connected with a hyphen: “sport-mad,” “camera-ready,” “quick-thinking,” and so on. 单独的单词用连字符连接是很常见的:“运动疯狂”,“准备相机”,“快速思考”等等。 A significant number are hyphenated names, such as “Emma-Claire.” When doing a Whole Word Search or query, users expect to find the word within those hyphens. 很多都是带连字符的名称,例如“Emma-Claire”。在进行全字搜索或查询时,用户希望在这些连字符中找到该单词。 While there are some cases where they are separate words, that's why lucene keeps the hyphen out of the default definition. 虽然在某些情况下它们是单独的单词,但这就是lucene将连字符保留在默认定义之外的原因。

To give support of hyphen in StandardAnalyzer , you have to make changes in StandardTokenizerImpl.java which is generated class from jFlex . 要在StandardAnalyzer支持连字符,您必须在StandardTokenizerImpl.java中进行更改,该更改是从jFlex生成的类。

Refer this link for complete guide. 请参阅此链接以获取完整指南。

You have to add following line in SUPPLEMENTARY.jflex-macro which is included by StandardTokenizerImpl.jflex file. 您必须在SUPPLEMENTARY.jflex-macro添加以下行,该行包含在StandardTokenizerImpl.jflex文件中。

 MidLetterSupp = ( [\u002D]  ) 

And After making changes provide StandardTokenizerImpl.jflex file as input to jFlex engine and click on generate. 在进行更改后,将StandardTokenizerImpl.jflex文件作为jFlex引擎的输入,然后单击generate。 The output of that will be StandardTokenizerImpl.java 它的输出将是StandardTokenizerImpl.java

And using that class file rebuild the index. 并使用该类文件重建索引。

The ClassicAnalzer is recommended to index text containing product codes like 'GSX-R1000'. 建议ClassicAnalzer索引包含“GSX-R1000”等产品代码的文本。 It will recognize this as a single term and did not split up its parts. 它会认为这是一个单独的术语,并没有将其部分分开。 But for example the text 'Europe/Berlin' will be split up by the ClassicAnalzer into the words 'Europe' and 'Berlin'. 但是,例如,“欧洲/柏林”文本将由ClassicAnalzer分为“欧洲”和“柏林”。 This means if you have a text indexed by the ClassicAnalyzer containing the phrase 这意味着如果您有包含短语的ClassicAnalyzer索引的文本

Europe/Berlin GSX-R1000

you can search for "europe", "berlin" or "GSX-R1000". 你可以搜索“欧洲”,“柏林”或“GSX-R1000”。

But be careful which analyzer you use for the search. 但要小心你用于搜索的分析仪。 I think the best choice to search a Lucene index is the KeywordAnalyzer. 我认为搜索Lucene索引的最佳选择是KeywordAnalyzer。 With the KeywordAnalyzer you can also search for specific fields in a document and you can build complex queries like: 使用KeywordAnalyzer,您还可以搜索文档中的特定字段,并可以构建复杂的查询,如:

(processid:4711) (berlin) 

This query will search documents with the phrase 'berlin' but also a field 'processid' containing the number 4711. 此查询将使用短语“berlin”搜索文档,但也会搜索包含编号4711的字段“processid”。

But if you search the index for the phrase "europe/berlin" you will get no result! 但是,如果你在索引中搜索“欧洲/柏林”这个词,你将得不到任何结果! This is because the KeywordAnalyzer did not change your search phrase, but the phrase 'Europe/Berlin' was split up into two separate words by the ClassicAnalyzer. 这是因为KeywordAnalyzer没有改变你的搜索短语,但ClassicAnalyzer将短语“Europe / Berlin”分成两个单独的单词。 This means you have to search for 'europe' and 'berlin' separately. 这意味着您必须分别搜索“欧洲”和“柏林”。

To solve this conflict you can translate a search term, entered by the user, in a search query that fits you needs using the following code: 要解决此冲突,您可以使用以下代码在符合您需要的搜索查询中翻译用户输入的搜索词:

QueryParser parser = new QueryParser("content", new ClassicAnalyzer());
Query result = parser.parse(searchTerm);
searchTerm = result.toString("content");

This code will translate the serach pharse 此代码将翻译serach pharse

Europe/Berlin

into

europe berlin

which will result in the expected document set. 这将导致预期的文档集。

Note: This will also work for more complex situations. 注意:这也适用于更复杂的情况。 The search term 搜索词

Europe/Berlin GSX-R1000

will be translated into: 将被翻译成:

(europe berlin) GSX-R1000

which will search correctly for all phrases in combination using the KeyWordAnalyzer. 这将使用KeyWordAnalyzer正确搜索组合中的所有短语。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM