Lucene search of two or more words not working on Android

Question

I am using Lucene 3.6.2 on Android. The code used and the observations made are as below.

Indexing Code:

public void indexBookContent(Book book, File externalFilesDir) throws Exception {
    IndexWriter indexWriter = null;
    NIOFSDirectory directory = null;

    directory = new NIOFSDirectory(new File(externalFilesDir.getPath() + "/IndexFile", book.getBookId()));
    IndexWriterConfig indexWriterConfig = new IndexWriterConfig(LUCENE_36, new StandardAnalyzer(LUCENE_36));
    indexWriter = new IndexWriter(directory, indexWriterConfig);

    Document document = createFieldsForContent();

    String pageContent = Html.fromHtml(decryptedPage).toString();
    ((Field) document.getFieldable("content")).setValue(pageContent);
    ((Field) document.getFieldable("content")).setValue(pageContent);
    ((Field) document.getFieldable("content")).setValue(pageContent.toLowerCase());
}

private Document createFieldsForContent() {
    Document document = new Document();

    Field contentFieldLower = new Field("content", "", YES, NOT_ANALYZED);
    document.add(contentFieldLower);
    Field contentField = new Field("content", "", YES, ANALYZED);
    document.add(contentField);
    Field contentFieldNotAnalysed = new Field("content", "", YES, NOT_ANALYZED);
    document.add(contentFieldNotAnalysed);
    Field recordIdField = new Field("recordId", "", YES, ANALYZED);
    document.add(recordIdField);
    return document;
}

public JSONArray searchBook(String bookId, String searchText, File externalFieldsDir, String filter) throws Exception {
    List<SearchResultData> searchResults = null;
    NIOFSDirectory directory = null;
    IndexReader indexReader = null;
    IndexSearcher indexSearcher = null;

    directory = new NIOFSDirectory(new File(externalFieldsDir.getPath() + "/IndexFile", bookId));
    indexReader = IndexReader.open(directory);
    indexSearcher = new IndexSearcher(indexReader);

    Query finalQuery = constructSearchQuery(searchText, filter);

    TopScoreDocCollector collector = TopScoreDocCollector.create(100, false);
    indexSearcher.search(finalQuery, collector);
    ScoreDoc[] scoreDocs = collector.topDocs().scoreDocs;
}

private Query constructSearchQuery(String searchText, String filter) throws ParseException {
    QueryParser contentQueryParser = new QueryParser(LUCENE_36, "content", new StandardAnalyzer(LUCENE_36));
    contentQueryParser.setAllowLeadingWildcard(true);
    contentQueryParser.setLowercaseExpandedTerms(false);

    String wildCardSearchText = "*" + QueryParser.escape(searchText) + "*";

    // Query Parser used.
    Query contentQuery = contentQueryParser.parse(wildCardSearchText);
    return contentQueryParser.parse(wildCardSearchText);
}

I have gone through this: " Lucene: Multi-word phrases as search terms ", and my logic didn't seem to different.

My doubt is that the fields are getting overwritten. Also, I need Chinese language support which works with this code except the problem of two or more word support.

Answer 1

One note, up front:

Seeing a search implementation like this seems immediately a bit strange. It looks like an overly complicated way to do a linear search through all the available strings. I don't know what exactly you need to accomplish, but I suspect you would be better served working on appropriate analysis of your text, rather than doing a double wildcard on keyword analyzed text, which will perform poorly, and not provide much flexibility in the search.

Moving on to more specific issues:

You are analyzing the same content in the same field multiple times with different analysis methods.

Field contentFieldLower = new Field("content", "", YES, NOT_ANALYZED);
document.add(contentFieldLower);
Field contentField = new Field("content", "", YES, ANALYZED);
document.add(contentField);
Field contentFieldNotAnalysed = new Field("content", "", YES, NOT_ANALYZED);
document.add(contentFieldNotAnalysed);

Instead, if you really need all these analysis methods to be available for searching, you should probably be indexing them in distinct fields. Searching these together doesn't make sense, so they shouldn't be in the same field.

Then you have this sort of pattern:

Field contentField = new Field("content", "", YES, ANALYZED);
document.add(contentField);
//Somewhat later
((Field) document.getFieldable("content")).setValue(pageContent);

Don't do this, this doesn't make sense. Just pass your content into the constructor, and add it to your document:

Field contentField = new Field("content", pageContent, YES, ANALYZED);
document.add(contentField);

Especially if you do opt to continue to analyzing in multiple ways in the same field, there is no way to get one among the different Field implementations ( getFieldable will always return the first one added)

And this query:

String wildCardSearchText = "*" + QueryParser.escape(searchText) + "*";

As you mentioned, won't work well with multiple terms. It runs afoul of QueryParser syntax. What you end up with is something like: *two terms* , which will be searched as:

field:*two field:terms*

Which won't generate any matches against your keyword field (presumably). The QueryParser won't do well with this sort of query at all. You'll need to construct a wildcard query yourself here:

WildcardQuery query  = new WildcardQuery(new Term("field", "*two terms*"));

Lucene search of two or more words not working on Android

Question

1 answers

solution1
1 ACCPTED 2014-06-20 16:55:24

Lucene search of two or more words not working on Android

Question

1 answers

solution1 1 ACCPTED 2014-06-20 16:55:24

solution1
1 ACCPTED 2014-06-20 16:55:24