[英]How do I write a LuceneFilter to normalize text
所以我有我的基本代码
public static final Pattern DIACRITICS_AND_FRIENDS
= Pattern.compile("[\\p{InCombiningDiacriticalMarks}\\p{IsLm}\\p{IsSk}]+");
private static String stripDiacritics(String str) {
str = Normalizer.normalize(str, Normalizer.Form.NFD);
str = DIACRITICS_AND_FRIENDS.matcher(str).replaceAll("");
return str;
}
但是如何将其放入TokenFilter中,我之前使用过NormalizeCharMap,但这仅对修改字符串文字有好处,我使用Lucene 4
您需要重写CharTermAttribute
incrementToken()
方法,在其中更新CharTermAttribute
:
public final class DiacriticFilter extends TokenFilter {
private final CharTermAttribute termAtt = addAttribute(CharTermAttribute.class);
@Override
public final boolean incrementToken() throws IOException {
if (input.incrementToken()) {
String result = stripDiacritics(new String(termAtt.buffer()));
char[] newBuffer = result.toCharArray();
termAtt.copyBuffer(newBuffer, 0, newBuffer.length)
termAtt.setLength(newBuffer.length);
return true;
} else {
return false;
}
}
private static String stripDiacritics(String str) {
str = Normalizer.normalize(str, Normalizer.Form.NFD);
str = DIACRITICS_AND_FRIENDS.matcher(str).replaceAll("");
return str;
}
}
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.