繁体   English   中英

如何编写LuceneFilter来规范化文本

[英]How do I write a LuceneFilter to normalize text

所以我有我的基本代码

public static final Pattern DIACRITICS_AND_FRIENDS
        = Pattern.compile("[\\p{InCombiningDiacriticalMarks}\\p{IsLm}\\p{IsSk}]+");


private static String stripDiacritics(String str) {
    str = Normalizer.normalize(str, Normalizer.Form.NFD);
    str = DIACRITICS_AND_FRIENDS.matcher(str).replaceAll("");
    return str;
}

但是如何将其放入TokenFilter中,我之前使用过NormalizeCharMap,但这仅对修改字符串文字有好处,我使用Lucene 4

您需要重写CharTermAttribute incrementToken()方法,在其中更新CharTermAttribute

public final class DiacriticFilter extends TokenFilter {
    private final CharTermAttribute termAtt = addAttribute(CharTermAttribute.class);

    @Override
    public final boolean incrementToken() throws IOException {
        if (input.incrementToken()) {
            String result = stripDiacritics(new String(termAtt.buffer()));
            char[] newBuffer = result.toCharArray();
            termAtt.copyBuffer(newBuffer, 0, newBuffer.length)
            termAtt.setLength(newBuffer.length);
            return true;
        } else {
            return false;
        }
    }

    private static String stripDiacritics(String str) {
        str = Normalizer.normalize(str, Normalizer.Form.NFD);
        str = DIACRITICS_AND_FRIENDS.matcher(str).replaceAll("");
        return str;
    }
}

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM