Hibernate 使用 lucene 搜索未正确索引相似名称

Question

I'm learning Hibernate Search 6.1.3.Final with Lucene 8.11.1 as backend and Spring Boot 2.6.6 .我正在学习Hibernate Search 6.1.3.Final以Lucene 8.11.1作为后端和Spring Boot 2.6.6 。 I'm trying to create a search for product names, barcodes and manufacturers.我正在尝试创建对产品名称、条形码和制造商的搜索。 Currently, I'm doing an integration test to see what happens when a couple of products have similar name:目前，我正在做一个集成测试，看看当几个产品有相似的名字时会发生什么：

    @Test
    void shouldFindSimilarTobaccosByQuery() {
        var tobaccoGreen = TobaccoBuilder.builder()
            .name("TobaCcO GreEN")
            .build();
        var tobaccoRed = TobaccoBuilder.builder()
            .name("TobaCcO ReD")
            .build();
        var tobaccoGreenhouse = TobaccoBuilder.builder()
            .name("TobaCcO GreENhouse")
            .build();
        tobaccoRepository.saveAll(List.of(tobaccoGreen, tobaccoRed, tobaccoGreenhouse));

        webTestClient
            .get().uri("/tobaccos?query=green")
            .exchange()
            .expectStatus().isOk()
            .expectBodyList(Tobacco.class)
            .value(tobaccos -> assertThat(tobaccos)
                .hasSize(2)
                .contains(tobaccoGreen, tobaccoGreenhouse)
            );
    }

As you can see in the test, I expect to obtain the two tobaccos with similar names: tobaccoGreen and tobaccoGreenhouse by using a green as query for the search criteria.正如您在测试中看到的那样，我希望通过使用green作为搜索条件的查询来获得名称相似的两种烟草： tobaccoGreen和tobaccoGreenhouse 。 The entity is the following:该实体如下：

@Data
@Entity
@Indexed
@NoArgsConstructor
@AllArgsConstructor
@Builder(toBuilder = true)
@EqualsAndHashCode(of = "id")
@EntityListeners(AuditingEntityListener.class)
public class Tobacco {
    @Id
    @GeneratedValue
    private UUID id;
    @NotBlank
    @KeywordField
    private String barcode;
    @NotBlank
    @FullTextField(analyzer = "name")
    private String name;
    @NotBlank
    @FullTextField(analyzer = "name")
    private String manufacturer;
    @CreatedDate
    private Instant createdAt;
    @LastModifiedDate
    private Instant updatedAt;
}

I have followed the docs and configure an analyzer for names:我已经按照文档并为名称配置了一个分析器：

@Component("luceneTobaccoAnalysisConfigurer")
public class LuceneTobaccoAnalysisConfigurer implements LuceneAnalysisConfigurer {
    @Override
    public void configure(LuceneAnalysisConfigurationContext context) {
        context.analyzer("name").custom()
            .tokenizer("standard")
            .tokenFilter("lowercase")
            .tokenFilter("asciiFolding");
    }
}

And using a simple query with fuzzy option:并使用带有模糊选项的简单查询：

@Component
@AllArgsConstructor
public class IndexSearchTobaccoRepository {

    private final EntityManager entityManager;

    public List<Tobacco> find(String query) {
        return Search.session(entityManager)
            .search(Tobacco.class)
            .where(f -> f.match()
                .fields("barcode", "name", "manufacturer")
                .matching(query)
                .fuzzy()
            )
            .fetch(10)
            .hits();
    }
}

The test shows that is only able to find tobaccoGreen but not tobaccoGreenhouse and I don't understand why, how can I search similar product names (or barcodes, manufacturer)?测试显示只能找到tobaccoGreen而不能找到tobaccoGreenhouse我不明白为什么，如何搜索相似的产品名称（或条形码，制造商）？

Answer 1

Before I answer your question, I'd like to point out that calling .fetch(10).hits() is suboptimal, especially when using the default sort (like you do):在我回答你的问题之前，我想指出调用.fetch(10).hits()是次优的，尤其是在使用默认排序时（就像你一样）：

        return Search.session(entityManager)
            .search(Tobacco.class)
            .where(f -> f.match()
                .fields("barcode", "name", "manufacturer")
                .matching(query)
                .fuzzy()
            )
            .fetch(10)
            .hits();

If you call .fetchHits(10) directly, Lucene will be able to skip part of the search (the part where it counts the total hit count), and in large indexes this could lead to sizeable performance gains.如果您直接调用.fetchHits(10) ，Lucene 将能够跳过部分搜索（它计算总命中数的部分），并且在大型索引中，这可能会带来相当大的性能提升。 So, do this instead:所以，改为这样做：

        return Search.session(entityManager)
            .search(Tobacco.class)
            .where(f -> f.match()
                .fields("barcode", "name", "manufacturer")
                .matching(query)
                .fuzzy()
            )
            .fetchHits(10);

Now, the actual answer:现在，实际答案：

Approaching this through the search query通过搜索查询来解决这个问题

.fuzzy() isn't magic, it won't just match anything you think should match:) There's a specific definition of what it does , and that's not what you want here. .fuzzy()不是魔法，它不会只匹配任何你认为应该匹配的东西:) 它的作用有一个具体的定义，这不是你想要的。

To get the behavior you want, you could use this instead of your current predicate:要获得所需的行为，您可以使用它来代替当前的谓词：

            .where(f -> f.simpleQueryString()
                .fields("barcode", "name", "manufacturer")
                .matching("green*")
            )

You lose fuzziness, but you get the ability to perform prefix queries , which would give the results you want ( green* would match greenhouse ).你失去了模糊性，但你获得了执行前缀查询的能力，这将给出你想要的结果（ green*将匹配greenhouse ）。

However, the prefix queries are explicit: the user must add * after "green" in order to match "all words that start with green".但是，前缀查询是显式的：用户必须在“green”之后添加*才能匹配“所有以 green 开头的单词”。

Which leads us to...这导致我们...

Approaching this through analyzers通过分析器来解决这个问题

If you want this "prefix matching" behavior to be automatic, without the need to add * in the query, then what you need is a different analyzer.如果您希望这种“前缀匹配”行为是自动的，而不需要在查询中添加* ，那么您需要的是一个不同的分析器。

Your current analyzer breaks down indexed text using space as a separator (more or less; it's a bit more complex but that's the idea).您当前的分析器使用空格作为分隔符来分解索引文本（或多或少；它有点复杂，但就是这个想法）。 But you apparently want it to break down "greenhouse" into "green" and "house";但是您显然希望它将“温室”分解为“绿色”和“房屋”； that's the only way a query with the word "green" would match the word "greenhouse".这是带有单词“green”的查询与单词“greenhouse”相匹配的唯一方式。

To do that, you can use an analyzer similar to yours, but with an additional "edge_ngram" filter, to generate additional indexed tokens for every prefix string of your existing tokens.为此，您可以使用与您的类似的分析器，但使用额外的“edge_ngram”过滤器，为现有标记的每个前缀字符串生成额外的索引标记。

Add another analyzer to your configurer:将另一个分析器添加到您的配置器：

@Component("luceneTobaccoAnalysisConfigurer")
public class LuceneTobaccoAnalysisConfigurer implements LuceneAnalysisConfigurer {
    @Override
    public void configure(LuceneAnalysisConfigurationContext context) {
        context.analyzer("name").custom()
            .tokenizer("standard")
            .tokenFilter("lowercase")
            .tokenFilter("asciiFolding");

        // THIS PART IS NEW
        context.analyzer("name_prefix").custom()
            .tokenizer("standard")
            .tokenFilter("lowercase")
            .tokenFilter("asciiFolding")
            .tokenFilter("edgeNGram")
                    // Handling prefixes from 2 to 7 characters.
                    // Prefixes of 1 character or more than 7 will
                    // not be matched.
                    // You can extend the range, but this will take more
                    // space in the index for little gain.
                    .param( "minGramSize", "2" )
                    .param( "maxGramSize", "7" );
    }
}

And change your mapping to use the name analyzer when querying, but the name_prefix analyzer when indexing:并更改您的映射以在查询时使用name分析器，但在索引时使用name_prefix分析器：

@Data
@Entity
@Indexed
@NoArgsConstructor
@AllArgsConstructor
@Builder(toBuilder = true)
@EqualsAndHashCode(of = "id")
@EntityListeners(AuditingEntityListener.class)
public class Tobacco {
    @Id
    @GeneratedValue
    private UUID id;
    @NotBlank
    @KeywordField
    private String barcode;
    @NotBlank
    // CHANGE THIS
    @FullTextField(analyzer = "name_prefix", searchAnalyzer = "name")
    private String name;
    @NotBlank
    // CHANGE THIS
    @FullTextField(analyzer = "name_prefix", searchAnalyzer = "name")
    private String manufacturer;
    @CreatedDate
    private Instant createdAt;
    @LastModifiedDate
    private Instant updatedAt;
}

Now reindex your data .现在重新索引您的数据。

Now your query "green" will also match "TobaCcO GreENhouse", because "GreENhouse" was indexed as ["greenhouse", "gr", "gre", "gree", "green", "greenh", "greenho"] .现在您的查询“green”也将匹配“TobaCcO GreENhouse”，因为“Greenhouse”被索引为["greenhouse", "gr", "gre", "gree", "green", "greenh", "greenho"] .

Variations变化

`edgeNGram` filter on distinct fields不同字段上的`edgeNGram`过滤器

Instead of changing the analyzer of your current fields, you could add new fields for the same Java properties, but using the new analyzer with the edgeNGram filter:您可以为相同的 Java 属性添加新字段，而不是更改当前字段的分析器，但使用带有edgeNGram过滤器的新分析器：

@Data
@Entity
@Indexed
@NoArgsConstructor
@AllArgsConstructor
@Builder(toBuilder = true)
@EqualsAndHashCode(of = "id")
@EntityListeners(AuditingEntityListener.class)
public class Tobacco {
    @Id
    @GeneratedValue
    private UUID id;
    @NotBlank
    @KeywordField
    private String barcode;
    @NotBlank
    @FullTextField(analyzer = "name")
    // ADD THIS
    @FullTextField(name = "name_prefix", analyzer = "name_prefix", searchAnalyzer = "name")
    private String name;
    @NotBlank
    @FullTextField(analyzer = "name")
    // ADD THIS
    @FullTextField(name = "manufacturer_prefix", analyzer = "name_prefix", searchAnalyzer = "name")
    private String manufacturer;
    @CreatedDate
    private Instant createdAt;
    @LastModifiedDate
    private Instant updatedAt;
}

Then you can target these fields as well as the normal ones in your query:然后您可以在查询中定位这些字段以及普通字段：

@Component
@AllArgsConstructor
public class IndexSearchTobaccoRepository {

    private final EntityManager entityManager;

    public List<Tobacco> find(String query) {
        return Search.session(entityManager)
            .search(Tobacco.class)
            .where(f -> f.match()
                .fields("barcode", "name", "manufacturer").boost(2.0f)
                .fields("name_prefix", "manufacturer_prefix")
                .matching(query)
                .fuzzy()
            )
            .fetchHits(10);
    }
}

As you can see, I added a boost to the fields that don't use prefix.如您所见，我为不使用前缀的字段添加了增强功能。 This is the main advantage of this variant over the one I explained higher up: matches on actual words (not prefixes) will be deemed more important, yielding a better score and thus pulling documents to the top of the result list if you use a relevance sort (which is the default sort).这是这个变体相对于我上面解释的那个的主要优势：实际单词（而不是前缀）的匹配将被认为更重要，产生更好的分数，因此如果您使用相关性，则将文档拉到结果列表的顶部排序（默认排序）。

Handling only compound words instead of all words只处理复合词而不是所有词

I won't detail it here, but there's another approach if all you want is to handle compound words ("greenhouse" => "green" + "house", "superman" => "super" + "man", etc.).我不会在这里详细说明，但如果你只想处理复合词（“greenhouse”=>“green”+“house”，“superman”=>“super”+“man”等，还有另一种方法。 ). You can use the "dictionaryCompoundWord" filter.您可以使用“dictionaryCompoundWord”过滤器。 This is less powerful, but will generate less noise in your index (fewer meaningless tokens) and thus could lead to better relevance sorts .这不太强大，但会在您的索引中产生更少的噪音（更少无意义的标记），因此可能导致更好的相关性排序。 Another downside is that you need to provide the filter with a dictionary that contains all words that could possibly be "compounded".另一个缺点是您需要为过滤器提供一个字典，其中包含所有可能被“复合”的单词。 For more information, see the source and javadoc of class org.apache.lucene.analysis.compound.DictionaryCompoundWordTokenFilterFactory , or the documentation of the equivalent filter in Elasticsearch .有关详细信息，请参阅 class org.apache.lucene.analysis.compound.DictionaryCompoundWordTokenFilterFactory中等效过滤器的文档。

Hibernate 使用 lucene 搜索未正确索引相似名称

问题描述

1 个解决方案

解决方案1
1 已采纳 2022-04-06 07:48:00

Approaching this through the search query通过搜索查询来解决这个问题

Approaching this through analyzers通过分析器来解决这个问题

Variations变化

`edgeNGram` filter on distinct fields不同字段上的`edgeNGram`过滤器

Handling only compound words instead of all words只处理复合词而不是所有词

Hibernate 使用 lucene 搜索未正确索引相似名称

问题描述

1 个解决方案

解决方案1 1 已采纳 2022-04-06 07:48:00

Approaching this through the search query通过搜索查询来解决这个问题

Approaching this through analyzers通过分析器来解决这个问题

Variations变化

edgeNGram filter on distinct fields不同字段上的edgeNGram过滤器

Handling only compound words instead of all words只处理复合词而不是所有词

解决方案1
1 已采纳 2022-04-06 07:48:00

`edgeNGram` filter on distinct fields不同字段上的`edgeNGram`过滤器