[英]Hibernate search with lucene does not index similar names correctly
I'm learning Hibernate Search 6.1.3.Final
with Lucene 8.11.1
as backend and Spring Boot 2.6.6
.我正在学习
Hibernate Search 6.1.3.Final
以Lucene 8.11.1
作为后端和Spring Boot 2.6.6
。 I'm trying to create a search for product names, barcodes and manufacturers.我正在尝试创建对产品名称、条形码和制造商的搜索。 Currently, I'm doing an integration test to see what happens when a couple of products have similar name:
目前,我正在做一个集成测试,看看当几个产品有相似的名字时会发生什么:
@Test
void shouldFindSimilarTobaccosByQuery() {
var tobaccoGreen = TobaccoBuilder.builder()
.name("TobaCcO GreEN")
.build();
var tobaccoRed = TobaccoBuilder.builder()
.name("TobaCcO ReD")
.build();
var tobaccoGreenhouse = TobaccoBuilder.builder()
.name("TobaCcO GreENhouse")
.build();
tobaccoRepository.saveAll(List.of(tobaccoGreen, tobaccoRed, tobaccoGreenhouse));
webTestClient
.get().uri("/tobaccos?query=green")
.exchange()
.expectStatus().isOk()
.expectBodyList(Tobacco.class)
.value(tobaccos -> assertThat(tobaccos)
.hasSize(2)
.contains(tobaccoGreen, tobaccoGreenhouse)
);
}
As you can see in the test, I expect to obtain the two tobaccos with similar names: tobaccoGreen
and tobaccoGreenhouse
by using a green
as query for the search criteria.正如您在测试中看到的那样,我希望通过使用
green
作为搜索条件的查询来获得名称相似的两种烟草: tobaccoGreen
和tobaccoGreenhouse
。 The entity is the following:该实体如下:
@Data
@Entity
@Indexed
@NoArgsConstructor
@AllArgsConstructor
@Builder(toBuilder = true)
@EqualsAndHashCode(of = "id")
@EntityListeners(AuditingEntityListener.class)
public class Tobacco {
@Id
@GeneratedValue
private UUID id;
@NotBlank
@KeywordField
private String barcode;
@NotBlank
@FullTextField(analyzer = "name")
private String name;
@NotBlank
@FullTextField(analyzer = "name")
private String manufacturer;
@CreatedDate
private Instant createdAt;
@LastModifiedDate
private Instant updatedAt;
}
I have followed the docs and configure an analyzer for names:我已经按照文档并为名称配置了一个分析器:
@Component("luceneTobaccoAnalysisConfigurer")
public class LuceneTobaccoAnalysisConfigurer implements LuceneAnalysisConfigurer {
@Override
public void configure(LuceneAnalysisConfigurationContext context) {
context.analyzer("name").custom()
.tokenizer("standard")
.tokenFilter("lowercase")
.tokenFilter("asciiFolding");
}
}
And using a simple query with fuzzy option:并使用带有模糊选项的简单查询:
@Component
@AllArgsConstructor
public class IndexSearchTobaccoRepository {
private final EntityManager entityManager;
public List<Tobacco> find(String query) {
return Search.session(entityManager)
.search(Tobacco.class)
.where(f -> f.match()
.fields("barcode", "name", "manufacturer")
.matching(query)
.fuzzy()
)
.fetch(10)
.hits();
}
}
The test shows that is only able to find tobaccoGreen
but not tobaccoGreenhouse
and I don't understand why, how can I search similar product names (or barcodes, manufacturer)?测试显示只能找到
tobaccoGreen
而不能找到tobaccoGreenhouse
我不明白为什么,如何搜索相似的产品名称(或条形码,制造商)?
Before I answer your question, I'd like to point out that calling .fetch(10).hits()
is suboptimal, especially when using the default sort (like you do):在我回答你的问题之前,我想指出调用
.fetch(10).hits()
是次优的,尤其是在使用默认排序时(就像你一样):
return Search.session(entityManager)
.search(Tobacco.class)
.where(f -> f.match()
.fields("barcode", "name", "manufacturer")
.matching(query)
.fuzzy()
)
.fetch(10)
.hits();
If you call .fetchHits(10)
directly, Lucene will be able to skip part of the search (the part where it counts the total hit count), and in large indexes this could lead to sizeable performance gains.如果您直接调用
.fetchHits(10)
,Lucene 将能够跳过部分搜索(它计算总命中数的部分),并且在大型索引中,这可能会带来相当大的性能提升。 So, do this instead:所以,改为这样做:
return Search.session(entityManager)
.search(Tobacco.class)
.where(f -> f.match()
.fields("barcode", "name", "manufacturer")
.matching(query)
.fuzzy()
)
.fetchHits(10);
Now, the actual answer:现在,实际答案:
.fuzzy()
isn't magic, it won't just match anything you think should match:) There's a specific definition of what it does , and that's not what you want here. .fuzzy()
不是魔法,它不会只匹配任何你认为应该匹配的东西:) 它的作用有一个具体的定义,这不是你想要的。
To get the behavior you want, you could use this instead of your current predicate:要获得所需的行为,您可以使用它来代替当前的谓词:
.where(f -> f.simpleQueryString()
.fields("barcode", "name", "manufacturer")
.matching("green*")
)
You lose fuzziness, but you get the ability to perform prefix queries , which would give the results you want ( green*
would match greenhouse
).你失去了模糊性,但你获得了执行前缀查询的能力,这将给出你想要的结果(
green*
将匹配greenhouse
)。
However, the prefix queries are explicit: the user must add *
after "green" in order to match "all words that start with green".但是,前缀查询是显式的:用户必须在“green”之后添加
*
才能匹配“所有以 green 开头的单词”。
Which leads us to...这导致我们...
If you want this "prefix matching" behavior to be automatic, without the need to add *
in the query, then what you need is a different analyzer.如果您希望这种“前缀匹配”行为是自动的,而不需要在查询中添加
*
,那么您需要的是一个不同的分析器。
Your current analyzer breaks down indexed text using space as a separator (more or less; it's a bit more complex but that's the idea).您当前的分析器使用空格作为分隔符来分解索引文本(或多或少;它有点复杂,但就是这个想法)。 But you apparently want it to break down "greenhouse" into "green" and "house";
但是您显然希望它将“温室”分解为“绿色”和“房屋”; that's the only way a query with the word "green" would match the word "greenhouse".
这是带有单词“green”的查询与单词“greenhouse”相匹配的唯一方式。
To do that, you can use an analyzer similar to yours, but with an additional "edge_ngram" filter, to generate additional indexed tokens for every prefix string of your existing tokens.为此,您可以使用与您的类似的分析器,但使用额外的“edge_ngram”过滤器,为现有标记的每个前缀字符串生成额外的索引标记。
Add another analyzer to your configurer:将另一个分析器添加到您的配置器:
@Component("luceneTobaccoAnalysisConfigurer")
public class LuceneTobaccoAnalysisConfigurer implements LuceneAnalysisConfigurer {
@Override
public void configure(LuceneAnalysisConfigurationContext context) {
context.analyzer("name").custom()
.tokenizer("standard")
.tokenFilter("lowercase")
.tokenFilter("asciiFolding");
// THIS PART IS NEW
context.analyzer("name_prefix").custom()
.tokenizer("standard")
.tokenFilter("lowercase")
.tokenFilter("asciiFolding")
.tokenFilter("edgeNGram")
// Handling prefixes from 2 to 7 characters.
// Prefixes of 1 character or more than 7 will
// not be matched.
// You can extend the range, but this will take more
// space in the index for little gain.
.param( "minGramSize", "2" )
.param( "maxGramSize", "7" );
}
}
And change your mapping to use the name
analyzer when querying, but the name_prefix
analyzer when indexing:并更改您的映射以在查询时使用
name
分析器,但在索引时使用name_prefix
分析器:
@Data
@Entity
@Indexed
@NoArgsConstructor
@AllArgsConstructor
@Builder(toBuilder = true)
@EqualsAndHashCode(of = "id")
@EntityListeners(AuditingEntityListener.class)
public class Tobacco {
@Id
@GeneratedValue
private UUID id;
@NotBlank
@KeywordField
private String barcode;
@NotBlank
// CHANGE THIS
@FullTextField(analyzer = "name_prefix", searchAnalyzer = "name")
private String name;
@NotBlank
// CHANGE THIS
@FullTextField(analyzer = "name_prefix", searchAnalyzer = "name")
private String manufacturer;
@CreatedDate
private Instant createdAt;
@LastModifiedDate
private Instant updatedAt;
}
Now reindex your data .现在重新索引您的数据。
Now your query "green" will also match "TobaCcO GreENhouse", because "GreENhouse" was indexed as ["greenhouse", "gr", "gre", "gree", "green", "greenh", "greenho"]
.现在您的查询“green”也将匹配“TobaCcO GreENhouse”,因为“Greenhouse”被索引为
["greenhouse", "gr", "gre", "gree", "green", "greenh", "greenho"]
.
edgeNGram
filter on distinct fieldsedgeNGram
过滤器Instead of changing the analyzer of your current fields, you could add new fields for the same Java properties, but using the new analyzer with the edgeNGram
filter:您可以为相同的 Java 属性添加新字段,而不是更改当前字段的分析器,但使用带有
edgeNGram
过滤器的新分析器:
@Data
@Entity
@Indexed
@NoArgsConstructor
@AllArgsConstructor
@Builder(toBuilder = true)
@EqualsAndHashCode(of = "id")
@EntityListeners(AuditingEntityListener.class)
public class Tobacco {
@Id
@GeneratedValue
private UUID id;
@NotBlank
@KeywordField
private String barcode;
@NotBlank
@FullTextField(analyzer = "name")
// ADD THIS
@FullTextField(name = "name_prefix", analyzer = "name_prefix", searchAnalyzer = "name")
private String name;
@NotBlank
@FullTextField(analyzer = "name")
// ADD THIS
@FullTextField(name = "manufacturer_prefix", analyzer = "name_prefix", searchAnalyzer = "name")
private String manufacturer;
@CreatedDate
private Instant createdAt;
@LastModifiedDate
private Instant updatedAt;
}
Then you can target these fields as well as the normal ones in your query:然后您可以在查询中定位这些字段以及普通字段:
@Component
@AllArgsConstructor
public class IndexSearchTobaccoRepository {
private final EntityManager entityManager;
public List<Tobacco> find(String query) {
return Search.session(entityManager)
.search(Tobacco.class)
.where(f -> f.match()
.fields("barcode", "name", "manufacturer").boost(2.0f)
.fields("name_prefix", "manufacturer_prefix")
.matching(query)
.fuzzy()
)
.fetchHits(10);
}
}
As you can see, I added a boost to the fields that don't use prefix.如您所见,我为不使用前缀的字段添加了增强功能。 This is the main advantage of this variant over the one I explained higher up: matches on actual words (not prefixes) will be deemed more important, yielding a better score and thus pulling documents to the top of the result list if you use a relevance sort (which is the default sort).
这是这个变体相对于我上面解释的那个的主要优势:实际单词(而不是前缀)的匹配将被认为更重要,产生更好的分数,因此如果您使用相关性,则将文档拉到结果列表的顶部排序(默认排序)。
I won't detail it here, but there's another approach if all you want is to handle compound words ("greenhouse" => "green" + "house", "superman" => "super" + "man", etc.).我不会在这里详细说明,但如果你只想处理复合词(“greenhouse”=>“green”+“house”,“superman”=>“super”+“man”等,还有另一种方法。 ). You can use the "dictionaryCompoundWord" filter.
您可以使用“dictionaryCompoundWord”过滤器。 This is less powerful, but will generate less noise in your index (fewer meaningless tokens) and thus could lead to better relevance sorts .
这不太强大,但会在您的索引中产生更少的噪音(更少无意义的标记),因此可能导致更好的相关性排序。 Another downside is that you need to provide the filter with a dictionary that contains all words that could possibly be "compounded".
另一个缺点是您需要为过滤器提供一个字典,其中包含所有可能被“复合”的单词。 For more information, see the source and javadoc of class
org.apache.lucene.analysis.compound.DictionaryCompoundWordTokenFilterFactory
, or the documentation of the equivalent filter in Elasticsearch .有关详细信息,请参阅 class
org.apache.lucene.analysis.compound.DictionaryCompoundWordTokenFilterFactory
中等效过滤器的文档。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.