Solr 中的电话号码同义词过滤器/分词器？

Question

I'm trying to make Solr search phone numbers which are stored like this +79876543210 using a query like these:我正在尝试使用如下查询使 Solr 搜索像这样+79876543210存储的电话号码：

+79876543210
 79876543210
 89876543210  <-- '+7' is replaced with region specific code '8'
  9876543210  <-- '+7' entirely removed

This is just an example.这只是一个例子。 Another one is wired line phone numbers:另一个是有线电话号码：

+78662123456  <-- '+78662' is a specific region code
 78662123456
 88662123456
  8662123456
      123456  <-- region code entirely removed

One way I could manage this is using a separate field which is filled with these variants and used solely during search.我可以解决这个问题的一种方法是使用一个单独的字段，其中填充了这些变体并且仅在搜索期间使用。 But this has issues with highlighting (it returns <em>123456</em> to be highlighted whereas the real value shown to user is +78662123456 ).但这在突出显示方面存在问题（它返回<em>123456</em>以突出显示，而向用户显示的实际值是+78662123456 ）。 I thought that maybe it's best to make these indices using just Solr, but how?我认为也许最好只使用 Solr 来制作这些索引，但是怎么做呢？

First thought was to use managed synonyms filter and pass them along with each added record.首先想到的是使用托管同义词过滤器并将它们与每条添加的记录一起传递。 But the docs explicitly states:但文档明确指出：

Changes made to managed resources via this REST API are not applied to the active Solr components until the Solr collection (or Solr core in single server mode) is reloaded.在重新加载 Solr 集合（或单服务器模式下的 Solr 核心）之前，通过此 REST API 对托管资源所做的更改不会应用于活动的 Solr 组件。

So reloading a core every time after adding a record is not the way to go. Other issues involve keeping these synonyms up to date with records.所以每次添加记录后都重新加载一个核心不是go的方法。其他问题涉及使这些同义词与记录保持同步。

Could there be another way to solve this?可以有另一种方法来解决这个问题吗？

Answer 1

Thanks to this comment (by MatsLindh) I've managed to assemble a simple filter based on bult-in EdgeNGramTokenFilter :多亏了这条评论（来自 MatsLindh），我设法组装了一个基于内置EdgeNGramTokenFilter的简单过滤器：

package com.step4;

import org.apache.lucene.analysis.TokenFilter;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
import org.apache.lucene.analysis.tokenattributes.PositionIncrementAttribute;

import java.io.IOException;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class ReverseCustomFilter extends TokenFilter {
    private static final PatternReplacementPair[] phonePatterns = {
            new PatternReplacementPair("\\+7", "7"),
            new PatternReplacementPair("\\+7", "8"),
            new PatternReplacementPair("\\+7", ""),
            new PatternReplacementPair("\\+78662", ""),
            new PatternReplacementPair("\\+78663", ""),
    };

    private final CharTermAttribute termAtt = addAttribute(CharTermAttribute.class);
    private final PositionIncrementAttribute posIncrAtt = addAttribute(PositionIncrementAttribute.class);
    private int curPatternIndex;
    private int curPosIncr;
    private State curState;

    public ReverseCustomFilter(TokenStream input) {
        super(input);
    }

    @Override
    public final boolean incrementToken() throws IOException {
        while (true) {
            if (curPatternIndex == 0) {
                if (!input.incrementToken()) {
                    return false;
                }

                curState = captureState();
                curPosIncr += posIncrAtt.getPositionIncrement();
                curPatternIndex = 1;
            }

            if (curPatternIndex <= phonePatterns.length) {
                PatternReplacementPair replacementPair = phonePatterns[curPatternIndex - 1];
                curPatternIndex++;

                restoreState(curState);
                Matcher matcher = replacementPair.getPattern().matcher(termAtt);
                if (matcher.find()) {
                    posIncrAtt.setPositionIncrement(curPosIncr);
                    curPosIncr = 0;

                    String replaced = matcher.replaceFirst(replacementPair.getReplacement());
                    termAtt.setEmpty().append(replaced);

                    return true;
                }
            }
            else {
                restoreState(curState);
                posIncrAtt.setPositionIncrement(0);

                curPatternIndex = 0;
                return true;
            }
        }
    }

    @Override
    public void reset() throws IOException {
        super.reset();
        curPatternIndex = 0;
        curPosIncr = 0;
    }

    @Override
    public void end() throws IOException {
        super.end();
        posIncrAtt.setPositionIncrement(curPosIncr);
    }

    private static class PatternReplacementPair {
        private final Pattern pattern;
        private final String replacement;
        public PatternReplacementPair(String pattern, String replacement) {
            this.pattern = Pattern.compile(pattern);
            this.replacement = replacement;
        }

        public Pattern getPattern() {
            return pattern;
        }

        public String getReplacement() {
            return replacement;
        }
    }
}

Solr 中的电话号码同义词过滤器/分词器？

问题描述

1 个解决方案

解决方案1
0 已采纳 2023-01-06 13:39:44

Solr 中的电话号码同义词过滤器/分词器？

问题描述

1 个解决方案

解决方案1 0 已采纳 2023-01-06 13:39:44

解决方案1
0 已采纳 2023-01-06 13:39:44