[英]How to add phrase as a stopword while using lucene analyzer?
我正在使用Lucene 4.6.1库。 我正在尝试在停用词排除列表中添加单词-嘻哈。
如果将其写为-hiphop(作为一个词),则可以将其排除,但是当其写为嘻哈(介于两者之间)时,我不能将其排除。
以下是我的排除列表逻辑-
public static final CharArraySet STOP_SET_STEM = new CharArraySet(LUCENE_VERSION, Arrays.asList(
"hiphop","hip hop"
), false);
有关我的自定义分析器逻辑的更多详细信息-
以下是我的customanalyzer逻辑-
public final class CustomWordsAnalyzer extends StopwordAnalyzerBase {
private static final Version LUCENE_VERSION = Version.LUCENE_46;
// Regex used to exclude non-alpha-numeric tokens
private static final Pattern ALPHA_NUMERIC = Pattern.compile("^[a-z][a-z0-9_]+$");
private static final Matcher MATCHER = ALPHA_NUMERIC.matcher("");
public CustomWordsAnalyzer() {
super(LUCENE_VERSION, ProTextWordLists.STOP_SET);
}
public CustomWordsAnalyzer(CharArraySet stopSet) {
super(LUCENE_VERSION, stopSet);
}
@Override
protected TokenStreamComponents createComponents(String fieldName, Reader reader) {
Tokenizer tokenizer = new StandardTokenizer(LUCENE_VERSION, reader);
TokenStream result = new StandardFilter(LUCENE_VERSION, tokenizer);
result = new LowerCaseFilter(LUCENE_VERSION, result);
result = new ASCIIFoldingFilter(result);
result = new AlphaNumericMaxLengthFilter(result);
result = new StopFilter(LUCENE_VERSION, result, ProTextWordLists.STOP_SET);
result = new PorterStemFilter(result);
result = new StopFilter(LUCENE_VERSION, result, ProTextWordLists.STOP_SET_STEM);
return new TokenStreamComponents(tokenizer, result);
}
/**
* Matches alpha-numeric tokens between 2 and 40 chars long.
*/
static class AlphaNumericMaxLengthFilter extends TokenFilter {
private final CharTermAttribute termAtt;
private final char[] output = new char[28];
AlphaNumericMaxLengthFilter(TokenStream in) {
super(in);
termAtt = addAttribute(CharTermAttribute.class);
}
@Override
public final boolean incrementToken() throws IOException {
// return the first alpha-numeric token between 2 and 40 length
while (input.incrementToken()) {
int length = termAtt.length();
if (length >= 3 && length <= 28) {
char[] buf = termAtt.buffer();
int at = 0;
for (int c = 0; c < length; c++) {
char ch = buf[c];
if (ch != '\'') {
output[at++] = ch;
}
}
String term = new String(output, 0, at);
MATCHER.reset(term);
if (MATCHER.matches() && !term.startsWith("a0")) {
termAtt.setEmpty();
termAtt.append(term);
return true;
}
}
}
return false;
}
}
}
使用默认的Lucene实现无法做到这一点,唯一的方法是创建自己的Analyzer或TokenStream或同时创建这两者,它们将以您需要的方式处理数据/查询(例如过滤器短语)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.