Lucene indexer goes OutOfMemory on a tiny document collection

Question

I'm trying to make an index of several text documents.

Their content is just field tab-separated strings:

WORD<\t>w1<\t>w2<\t>...<\t>wn

POS<\t>pos1<\t>pos2_a:pos2_b:pos2_c<\t>...<\t>posn_a:posn_b
...

For the POS field, ' :'- separated tokens correspond to the same ambiguous word.

There are 5 documents with the total size of 10 MB. While indexing, java uses about 2 GB of RAM and finally throws an OOM error.

String join_token = tok.nextToken();
// atomic tokens correspond to separate parses
String[] atomic_tokens = StringUtils.split(join_token, ':');
// marking each token with the parse number
for (int token_index = 0; token_index < atomic_tokens.length; ++token_index) {
  atomic_tokens[token_index] += String.format("|%d", token_index);
}
String join_token_with_payloads = StringUtils.join(atomic_tokens, " ");
TokenStream stream = new WhitespaceTokenizer(Version.LUCENE_41, // OOM exception appears here
                                             new StringReader(join_token_with_payloads));
// all these parses belong to the same position in the document
stream = new PositionFilter(stream, 0);
stream = new DelimitedPayloadTokenFilter(stream, '|', new IntegerEncoder());
stream.addAttribute(OffsetAttribute.class);
stream.addAttribute(CharTermAttribute.class);
feature = new Field(name,
                    join_token,
                    attributeFieldType);
feature.setTokenStream(stream);
inDocument.add(feature);

What is wrong with this code from the memory point of view, And how to do indexing with as little data as possible held in RAM?

Answer 1

If i understood the problem right (i didn't tried it out) this are my suggestions

It's good practice to use Camel case in the code which is a convention for java
You don't need to generate the positions manually just create a field with Field.TermVector.WITH_POSITIONS_OFFSETS and the metrics will end up in the index.
The creation of such huge arrays of String causes an really big memory overhead -> use StringBuilder.
Use LetterTokenizer to tokenize the stream or write your own tokenizer by extending CharTokenizer
Btw great book Lucene in Action

Lucene indexer goes OutOfMemory on a tiny document collection

Question

1 answers

solution1
1 2013-03-18 13:15:31

Lucene indexer goes OutOfMemory on a tiny document collection

Question

1 answers

solution1 1 2013-03-18 13:15:31

solution1
1 2013-03-18 13:15:31