简体   繁体   中英

Impact of document size in Lucene

I have just started reading up on Lucene. In one of the examples provided, an entire file was being added to a Document prior to adding the Document to an Index.

However the documentation suggested that this indexing technique would not give good performance. The recommended way is to store each line of the file within a separate document.

I was curious to know how this helps to improve indexing performance.

Also, I wanted to validate my understanding that to add every line of file as a Document field, we will have to first tokenize the line to obtain the tokens and then create a field for the same.

Even if you don't take performance into account, these two approaches won't yield the same results. If you have a single document whose first line is "fox" and second line is "dog", and if you search for "fox" AND "dog", there will be no results with the second approach.

Regarding your second question, no, you don't need to perform any tokenization before creating documents and fields. Tokenization will be performed when you call IndexWriter#add(Document).

If you are getting started with Lucene, I highly recommend you read the demo code . This will show you how to create and then search a Lucene index.

And if indexing speed is critical for the application you are developing, there are very good advices on Lucene wiki .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM