简体   繁体   English

Lucene通配符搜索

[英]Lucene wild card search

How can I perform a wildcard search in Lucene ? 如何在Lucene中执行通配符搜索?
I have the text: "1997_titanic" 我有文字:“ 1997_titanic”
If I search like "1997_titanic", it is returning a result, but I am not able to do below two searches: 如果我搜索类似“ 1997_titanic”,则返回结果,但不能进行以下两个搜索:

1) If I search with only 1997 it is not returning any results. 1)如果我仅搜索1997,则不会返回任何结果。
2) Also if there is a space, such as in "spider man", that is not finding any results. 2)同样,如果在“蜘蛛侠”中有空格,则找不到任何结果。

I retrieve all movie information from a DB and store it in Lucene Documents: 我从数据库检索所有电影信息,并将其存储在Lucene Documents中:

public Document createMovieDoc(Movie m){
    document.add(new StoredField("moviename", m.getName()));  
    TextField field = new TextField("movienameSearch", m.getName().toLowerCase(),  Store.NO);
    field.setBoost(5.0f);
    document.add(field);
}

And to search, I have this method: 要搜索,我有这种方法:

public List searh(String txt){ 
    PhraseQuery phQuery= new PhraseQuery();
    Term term = new Term("movienameSearch", txt.toLowerCase());

    BooleanQuery b = new BooleanQuery();
    b.add(phQuery, Occur.SHOULD);

    TopFieldDocs tp= searcher.search(b, 20, ..);
    for(int i=0;i<tp.length;i++)      
    {
        int mId = tp[i].doc;
        Document d = searcher.doc(mId);
        String moviename = d.get("moviename");

        list.add(moviename);
    }
    return list;
}

I'm not sure what analyzer you are using to index. 我不确定您要使用哪个分析器建立索引。 Sounds like maybe WhitespaceAnalyzer ? 听起来可能是WhitespaceAnalyzer It sounds like, when indexing "1997_titanic" remains a single token, while "spider man" is split into the token "spider" and "man". 听起来,当索引“ 1997_titanic”仍为单个标记时,而“ spider man”又分为标记“ spider”和“ man”。

Could also be SimpleAnalyzer which uses a LetterTokenizer . 也可以是使用LetterTokenizer SimpleAnalyzer This would make it impossible to search for "1997", since that tokenizer will eliminate all numbers for the indexed representation of the text. 这将使搜索“ 1997”成为不可能,因为该标记生成器将消除文本的索引表示形式的所有数字。

Your search method doesn't look right. 您的搜索方法看起来不正确。 You aren't adding any terms to your PhraseQuery , so I wouldn't expect it to find anything. 您没有在PhraseQuery添加任何术语,所以我不希望它能找到任何东西。 You must add some terms in order for anything to be found. 您必须添加一些术语才能找到任何内容。 You create a Term in what you've provided, but nothing is ever done with that Term. 您在提供的内容中创建了一个Term ,但该字词一无所获。 Maybe this has something to do with how you've pick your excerpts, or something? 也许这与您摘录的方式有关,还是其他? Not sure, I'm a bit confused by that. 不知道,我对此感到困惑。

In order to manually construct a PhraseQuery you must add each term individually, so to search for "spider man", you would do something like: 为了手动构造PhraseQuery,您必须分别添加每个术语,因此要搜索“ spider man”,您可以执行以下操作:

PhraseQuery phQuery= new PhraseQuery();
phQuery.add(new Term("movienameSearch", "spider"));
phQuery.add(new Term("movienameSearch", "man"));

This requires you to know what the analyzer was doing at index time, and tokenize the input yourself to suit. 这要求您知道分析器在索引时间的操作,并自己对输入进行标记以使其适合。 The simpler solution is to just use the QueryParser : 比较简单的解决方案是只使用QueryParser

//With whatever analyzer you like to use.
QueryParser parser = new QueryParser(Version.LUCENE_46, "defaultField", analyzer);
Query query = parser.parse("movienameSearch:\"" + txt.toLowerCase() + "\"");
TopFieldDocs tp= searcher.search(query, 20);

This allows you to rely on the same analyzer to index and query, so you don't have to know how to tokenize your phrases to suit. 这使您可以依靠同一个分析器来索引和查询,因此您不必知道如何对短语进行标记以使其适合。

As far as finding "1997" and "titanic" individually, I would recommend just using StandardAnalyzer . 至于分别查找“ 1997”和“ titanic”,我建议仅使用StandardAnalyzer It will tokenize those into discrete tokens, allowing them to be searched very easily, with a simple query like: movienameSearch:1997 . 它将令牌化为离散令牌,从而使它们可以通过以下简单查询轻松地搜索: movienameSearch:1997

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM