简体繁体中英

Impact of document size in Lucene

原文 2012-01-24 19:13:21 2 1 java/ apache/ search/ lucene

I have just started reading up on Lucene. In one of the examples provided, an entire file was being added to a Document prior to adding the Document to an Index.

However the documentation suggested that this indexing technique would not give good performance. The recommended way is to store each line of the file within a separate document.

I was curious to know how this helps to improve indexing performance.

Also, I wanted to validate my understanding that to add every line of file as a Document field, we will have to first tokenize the line to obtain the tokens and then create a field for the same.

1 answers

Even if you don't take performance into account, these two approaches won't yield the same results. If you have a single document whose first line is "fox" and second line is "dog", and if you search for "fox" AND "dog", there will be no results with the second approach.

Regarding your second question, no, you don't need to perform any tokenization before creating documents and fields. Tokenization will be performed when you call IndexWriter#add(Document).

If you are getting started with Lucene, I highly recommend you read the demo code . This will show you how to create and then search a Lucene index.

And if indexing speed is critical for the application you are developing, there are very good advices on Lucene wiki .

Impact of repeat value across multiple fields in Lucene

Lucene: Multithread document duplication

Add field in Lucene document

Concurrent edits to a lucene document

Associating binary to a Lucene Document

Lucene document Boosting

search lucene on a specific document

Lucene - not unique name for Fields in Lucene Document

Lucene Index Size

Large Permgen size + performance impact

暂无

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Question Impact of repeat value across multiple fields in Lucene Lucene: Multithread document duplication Add field in Lucene document Concurrent edits to a lucene document Associating binary to a Lucene Document Lucene document Boosting search lucene on a specific document Lucene - not unique name for Fields in Lucene Document Lucene Index Size Large Permgen size + performance impact

Related Tags

Impact of document size in Lucene

Question

1 answers

solution1 1 ACCPTED 2012-01-25 00:01:21

solution1
1 ACCPTED 2012-01-25 00:01:21