如何将自定义 TokenFilter 从 Lucene.Net 3.0.3 迁移到 4.8

Question

我有以下适用于 Lucene.Net 3.0.3 的自定义 TokenFilter，我需要将其迁移到 Lucene.Net 4.8：

public sealed class AccentFoldingFilter : TokenFilter
{
    private ITermAttribute termAttribute;

    public AccentFoldingFilter(TokenStream input) : base(input)
    {
        termAttribute = this.input.GetAttribute<ITermAttribute>();
    }

    public override bool IncrementToken()
    {
        if (this.input.IncrementToken())
        {
            termAttribute.SetTermBuffer(termAttribute.Term.RemoveDiacritics());
            return true;
        }
        return false;
    }
}

ITermAttribute不再存在，我想我需要使用ICharTermAttribute但我不知道该怎么做。

如何在 4.8 中做同样的事情？

作为参考，这是RemoveDiacritics扩展方法：

public static string RemoveDiacritics(this string text)
{
    var normalizedString = text.Normalize(NormalizationForm.FormD);
    var stringBuilder = new StringBuilder();

    foreach (var c in normalizedString)
    {
        var unicodeCategory = CharUnicodeInfo.GetUnicodeCategory(c);
        if (unicodeCategory != UnicodeCategory.NonSpacingMark)
        {
            stringBuilder.Append(c);
        }
    }

    return stringBuilder.ToString().Normalize(NormalizationForm.FormC);
}

Answer 1

尽管您可以使用下面的答案，但请注意 Lucene.NET 4.8.0在框中包含ICUNormalizer2Filter 、 ICUNormalizer2CharFilter和ICUFoldingFilter 。 但是，您可能仍然倾向于使用现有的解决方案，而不是拖入 20MB 以上的依赖项 (ICU4N)。

要进行翻译，您需要将ICharTermAttribute直接添加到您的过滤器中（而不是在TokenStream ）。 该属性将通过调用GetAttribute<ICharTermAttribute>()从令牌流的共享上下文中拉出。

public sealed class AccentFoldingFilter : TokenFilter
{
    private ICharTermAttribute termAttribute;

    public AccentFoldingFilter(TokenStream input) : base(input)
    {
        termAttribute = this.GetAttribute<ICharTermAttribute>();
    }

    public override bool IncrementToken()
    {
        if (this.m_input.IncrementToken())
        {
            string buffer = termAttribute.ToString().RemoveDiacritics();
            termAttribute.SetEmpty().Append(buffer);
            return true;
        }
        return false;
    }
}

此外， RemoveDiacritics方法实现不考虑代理对，这可能导致难以诊断错误。

public static string RemoveDiacritics(this string text)
{
    var normalizedString = text.Normalize(NormalizationForm.FormD);
    int inputLength = normalizedString.Length;
    char[] buffer = new char[inputLength];

    // TODO: If the strings are short (less than 256 chars),
    // consider using this (must be unsafe context)

    // char* buffer = stackalloc char[inputLength];

    int bufferLength = 0;

    for (int i = 0; i < inputLength;)
    {
        // Handle surrogate pairs
        int charCount = char.IsHighSurrogate(normalizedString, i)
            && i < inputLength - 1
            && char.IsLowSurrogate(normalizedString, i + 1) ? 2 : 1;

        var unicodeCategory = CharUnicodeInfo.GetUnicodeCategory(normalizedString, i);
        if (unicodeCategory != UnicodeCategory.NonSpacingMark)
        {
            buffer[bufferLength++] = normalizedString[i]; // high surrogate / BMP char
            if (charCount == 2)
            {
                buffer[bufferLength++] = normalizedString[i + 1]; // low surrogate
            }
        }
        i += charCount;
    }

    return new string(buffer, 0, bufferLength).Normalize(NormalizationForm.FormC);
}

如何将自定义 TokenFilter 从 Lucene.Net 3.0.3 迁移到 4.8

问题描述

1 个解决方案

解决方案1
1 已采纳 2021-07-15 22:10:28

如何将自定义 TokenFilter 从 Lucene.Net 3.0.3 迁移到 4.8

问题描述

1 个解决方案

解决方案1 1 已采纳 2021-07-15 22:10:28

解决方案1
1 已采纳 2021-07-15 22:10:28