简体   繁体   中英

Searching for UUID in lucene not working

I've got a UUID field I'm adding to my document in the following format: 372d325c-e01b-432f-98bd-bc4c949f15b8. However, when I try to query for documents by the UUID it will not return them no matter how I try to escape the expression. For example:

+uuid:372d325c-e01b-432f-98bd-bc4c949f15b8
+uuid:"372d325c-e01b-432f-98bd-bc4c949f15b8"
+uuid:372d325c\-e01b\-432f\-98bd\-bc4c949f15b8
+uuid:(372d325c-e01b-432f-98bd-bc4c949f15b8)
+uuid:("372d325c-e01b-432f-98bd-bc4c949f15b8")

And even skipping the QueryParser altogether using TermQuery like so:

new TermQuery(new Term("uuid", uuid.toString()))

Or

new TermQuery(new Term("uuid", QueryParser.escape(uuid.toString())))

None of these searches will return a document, but if I search for portions of the UUID it will return a document. For example these will return something:

+uuid:372d325c
+uuid:e01b
+uuid:432f

What should I do to index these documents so I can pull them back by their UUID? I've considered reformatting the UUID to remove the hyphens, but I haven't implemented it yet.

The only way I got this to work is to use WhitespaceAnalyzer instead of StandardAnalyzer. Then using a TermQuery like so:

IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_36, new WhitespaceAnalyzer(Version.LUCENE_36))
            .setOpenMode(IndexWriterConfig.OpenMode.CREATE_OR_APPEND);
writer = new IndexWriter( directory, config);

Then searching:

TopDocs docs = searcher.search(new TermQuery(new Term("uuid", uuid.toString())), 1);

WhitespaceAnalyzer prevented Lucene from splitting apart the UUID by the hyphens. Another option could be to eliminate the dashes from the UUID, but using the WhitespaceAnalyzer works just as well for my purposes.

According to the Lucene Query Syntax rules , the query

+uuid:372d325c\-e01b\-432f\-98bd\-bc4c949f15b8

should work.

I guess that if it don't, that is because the uuid field is not populated as it should when the document is inserted in the index. Could you make sure of what exactly is inserted for this field? You can use Luke to crawl the index and look for the actual values stored for the uuid field.

If you plan to a UUID field as a lookup key, you will need to ask Lucene to index the whole field as a single string without doing tokenization. This is done by setting the right FieldType for your UUID field. In Lucene 4+, you can use StringField.

import java.io.IOException;
import java.util.UUID;
import junit.framework.Assert;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.StringField;
import org.apache.lucene.document.TextField;
import org.apache.lucene.index.DirectoryReader;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.index.Term;
import org.apache.lucene.queryparser.classic.ParseException;
import org.apache.lucene.queryparser.classic.QueryParser;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.TopDocs;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.RAMDirectory;
import org.apache.lucene.util.Version;
import org.junit.Test;

/**
 * Using Lucene 4.7 on Java 7.
 */
public class LuceneUUIDFieldLookupTest {

    private Directory directory;
    private Analyzer analyzer;

    @Test
    public void testUsingUUIDAsLookupKey() throws IOException, ParseException {

        directory = new RAMDirectory();
        analyzer = new StandardAnalyzer(Version.LUCENE_47);

        UUID docUUID = UUID.randomUUID();
        String docContentText1 = "Stack Overflow is a question and answer site for professional and enthusiast programmers.";

        index(docUUID, docContentText1);

        QueryParser parser = new QueryParser(Version.LUCENE_47, MyIndexedFields.DOC_TEXT_FIELD.name(), analyzer);
        Query queryForProgrammer = parser.parse("programmers");

        IndexSearcher indexSearcher = getIndexSearcher();
        TopDocs hits = indexSearcher.search(queryForProgrammer, Integer.MAX_VALUE);
        Assert.assertTrue(hits.scoreDocs.length == 1);

        Integer internalDocId1 = hits.scoreDocs[0].doc;
        Document docRetrieved1 = indexSearcher.doc(internalDocId1);
        indexSearcher.getIndexReader().close();

        String docText1 = docRetrieved1.get(MyIndexedFields.DOC_TEXT_FIELD.name());
        Assert.assertEquals(docText1, docContentText1);

        String docContentText2 = "TechCrunch is a leading technology media property, dedicated to ... according to a new report from the Wall Street Journal confirmed by Google to TechCrunch.";
        reindex(docUUID, docContentText2);

        Query queryForTechCrunch = parser.parse("technology");
        indexSearcher = getIndexSearcher(); //you must reopen directory because the previous IndexSearcher only sees a snapshoted directory.
        hits = indexSearcher.search(queryForTechCrunch, Integer.MAX_VALUE);
        Assert.assertTrue(hits.scoreDocs.length == 1);

        Integer internalDocId2 = hits.scoreDocs[0].doc;
        Document docRetrieved2 = indexSearcher.doc(internalDocId2);
        indexSearcher.getIndexReader().close();

        String docText2 = docRetrieved2.get(MyIndexedFields.DOC_TEXT_FIELD.name());
        Assert.assertEquals(docText2, docContentText2);
    }

    private void reindex(UUID myUUID, String docContentText) throws IOException {
        try (IndexWriter indexWriter = new IndexWriter(directory, getIndexWriterConfig())) {
            Term term = new Term(MyIndexedFields.MY_UUID_FIELD.name(), myUUID.toString());
            indexWriter.updateDocument(term, buildDoc(myUUID, docContentText));
        }//auto-close
    }

    private void index(UUID myUUID, String docContentText) throws IOException {
        try (IndexWriter indexWriter = new IndexWriter(directory, getIndexWriterConfig())) {
            indexWriter.addDocument(buildDoc(myUUID, docContentText));
        }//auto-close
    }

    private IndexWriterConfig getIndexWriterConfig() {
        return new IndexWriterConfig(Version.LUCENE_47, analyzer).setOpenMode(IndexWriterConfig.OpenMode.CREATE_OR_APPEND);
    }

    private Document buildDoc(UUID myUUID, String docContentText) {
        Document doc = new Document();
        doc.add(new Field(
                MyIndexedFields.MY_UUID_FIELD.name(),
                myUUID.toString(),
                StringField.TYPE_STORED));//use TYPE_STORED if you want to read it back in search result.

        doc.add(new Field(
                MyIndexedFields.DOC_TEXT_FIELD.name(),
                docContentText,
                TextField.TYPE_STORED));

        return doc;
    }

    private IndexSearcher getIndexSearcher() throws IOException {
        DirectoryReader ireader = DirectoryReader.open(directory);
        IndexSearcher indexSearcher = new IndexSearcher(ireader);
        return indexSearcher;
    }

    enum MyIndexedFields {

        MY_UUID_FIELD,
        DOC_TEXT_FIELD
    }
}

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM