繁体   English   中英

如何在Lucene中索引/搜索+和-符号?

[英]How can I index/search + and - symbol in lucene?

我需要搜索单词“ I + D”,而我的分析仪不能使用+ (加号)和- (减号)符号。 如何搜寻?

我的个人分析器:

 /**
 * Copyright (c) 2006 Hugo Zaragoza and Jose R. P�rez-Ag�era
 * All rights reserved.
 *
 * Redistribution and use in source and binary forms, with or without
 * modification, are permitted provided that the following conditions
 * are met:
 * 1. Redistributions of source code must retain the above copyright
 *    notice, this list of conditions and the following disclaimer.
 * 2. Redistributions in binary form must reproduce the above copyright
 *    notice, this list of conditions and the following disclaimer in the
 *    documentation and/or other materials provided with the distribution.
 * 3. Neither the name of copyright holders nor the names of its
 *    contributors may be used to endorse or promote products derived
 *    from this software without specific prior written permission.
 *
 * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
 * ``AS IS'' AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED
 * TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
 * PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL COPYRIGHT HOLDERS OR CONTRIBUTORS
 * BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
 * CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
 * SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
 * INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN
 * CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
 * ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
 * POSSIBILITY OF SUCH DAMAGE.
 */
import java.io.BufferedReader;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.io.Reader;
import java.util.ArrayList;
import java.util.Set;


import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.LowerCaseFilter;
import org.apache.lucene.analysis.StopFilter;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.standard.StandardFilter;
import org.apache.lucene.analysis.standard.StandardTokenizer;

/**
 * Spanish Lucene analyzer
 * @author Hugo Zaragoza and Jose R. P�rez-Ag�era
 */
public class SpanishAnalyzer extends Analyzer {

    private Set stopSet;

    /**
     * Creates the Lucene Spanish Analyzer
     * @throws IOException
     */
    public SpanishAnalyzer() throws IOException {
        super();
        stopSet = StopFilter.makeStopSet(loadStopWords());
    }

    /** Constructs a {@link StandardTokenizer} filtered by a {@link
    StandardFilter}, a {@link LowerCaseFilter} and a {@link StopFilter}. */
    public TokenStream tokenStream(String fieldName, Reader reader) {
        TokenStream result = new StandardTokenizer(reader);
        result = new StandardFilter(result);
        result = new LowerCaseFilter(result);
        result = new StopFilter(result, stopSet);
        result = new SpanishStemmerFilter(result);
        return result;
    }

    /**
     * Loads the spanish stop-words list
     * @throws IOException
     */
    private static String[] loadStopWords() throws IOException {

        InputStream inputStream = new FileInputStream("stopwords-spanish.txt");
        //InputStream inputStream = new FileInputStream("/home/becario/Escritorio/CVTKAxel/lib/stopwords-spanish.txt");
        Reader reader = new InputStreamReader(inputStream);
        BufferedReader br = new BufferedReader(reader);
        String line = br.readLine();
        ArrayList<String> list = new ArrayList<String>();
        while (line != null) {
            list.add(line.trim());
            line = br.readLine();
        }
        String stopWords[] = new String[list.toArray().length];
        for (int i = 0; i < list.toArray().length; i++) {
            stopWords[i] = (String) list.get(i);
        }

        return stopWords;
    }
}

“不起作用”是什么意思? 分析仪应能够正常处理这些字符。 您是指QueryParser吗? 如果是这样,您可以绕过它并手动创建一个查询,例如TermQuery

Query q = new TermQuery(new Term("field", "I+D"));

还是您指的是StandardTokenizer在非单词字符(例如'+'或'-')上拆分令牌的事实? 如果是这样,您可以简单地使用其他变量(例如WhitespaceTokenizer )或实现自己的变量。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM