简体   繁体   中英

How much space and processing will be optimized in Lucene index by storing a field as Byte instead of String for billions of documents

I understand the concept of inverted-index and how Dictionary storage optimization could help to load entire dictionary in main memory for the faster query.

I am trying to understand how Lucene index work.

Suppose I have a String type field which has only four distinct values for the 200 billion documents indexed in Lucene. This field is a Stored field.

If I change the field to Byte or Int type to represent all 4 distinct values and re-index and store all the 200 billion documents.

What would be storage and query optimization for this data type change? If there would be any.

Please suggest if I can do some test on my laptop to get a sense.

As far as I know, a document in Lucene consists of a simple list of field-value pairs. A field must have at least one value, but any field can contain multiple values. Similarly, a single string value may be converted into multiple values by the analysis process.

Lucene doesn't care if the values are strings or numbers or dates. All values are just treated as opaque bytes.

For more information, please see this document .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM