简体   繁体   中英

Lucene indexer goes OutOfMemory on a tiny document collection

I'm trying to make an index of several text documents.

Their content is just field tab-separated strings:

WORD<\t>w1<\t>w2<\t>...<\t>wn

POS<\t>pos1<\t>pos2_a:pos2_b:pos2_c<\t>...<\t>posn_a:posn_b
...

For the POS field, ' :'- separated tokens correspond to the same ambiguous word.

There are 5 documents with the total size of 10 MB. While indexing, java uses about 2 GB of RAM and finally throws an OOM error.

String join_token = tok.nextToken();
// atomic tokens correspond to separate parses
String[] atomic_tokens = StringUtils.split(join_token, ':');
// marking each token with the parse number
for (int token_index = 0; token_index < atomic_tokens.length; ++token_index) {
  atomic_tokens[token_index] += String.format("|%d", token_index);
}
String join_token_with_payloads = StringUtils.join(atomic_tokens, " ");
TokenStream stream = new WhitespaceTokenizer(Version.LUCENE_41, // OOM exception appears here
                                             new StringReader(join_token_with_payloads));
// all these parses belong to the same position in the document
stream = new PositionFilter(stream, 0);
stream = new DelimitedPayloadTokenFilter(stream, '|', new IntegerEncoder());
stream.addAttribute(OffsetAttribute.class);
stream.addAttribute(CharTermAttribute.class);
feature = new Field(name,
                    join_token,
                    attributeFieldType);
feature.setTokenStream(stream);
inDocument.add(feature);

What is wrong with this code from the memory point of view, And how to do indexing with as little data as possible held in RAM?

If i understood the problem right (i didn't tried it out) this are my suggestions

  1. It's good practice to use Camel case in the code which is a convention for java
  2. You don't need to generate the positions manually just create a field with Field.TermVector.WITH_POSITIONS_OFFSETS and the metrics will end up in the index.
  3. The creation of such huge arrays of String causes an really big memory overhead -> use StringBuilder.
  4. Use LetterTokenizer to tokenize the stream or write your own tokenizer by extending CharTokenizer
  5. Btw great book Lucene in Action

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM