简体   繁体   中英

A memory-efficient large array of words

I am looking for a Java data structure for storing a large text (about a million words), such that I can get a word by index (for example, get the 531467 word).

The problem with String[] or ArrayList is that they take too much memory - about 40 bytes per word on my environment.

I thought of using a String[] where each element is a chunk of 10 words, joined by a space. This is much more memory-efficient - about 20 bytes per word; but the access is much slower.

Is there a more efficient way to solve this problem?

As Jon Skeet already mentioned, 40mb isn't too large.

But you stated that you are storing a text, so there may be many same Strings. For example stop words like "and" and "or".

You can use String.intern()[1]. This will pool your String and returns a reference to an already existing String.

intern() quite slow, so you can replace this by a HashMap that will do the same trick for you.

[1] http://download.oracle.com/javase/6/docs/api/java/lang/String.html#intern%28%29

You could look at using memory mapping the data structure , but performance might be completely horrible.

Store all the words in a single string:

class WordList {

    private final String content;
    private final int[] indices;

    public WordList(Collection<String> words) {
        StringBuilder buf = new StringBuilder();
        indices = new int[words.size()];
        int currentWordIndex = 0;
        int previousPosition = 0;
        for (String word : words) {
            buf.append(word);
            indices[currentWordIndex++] = previousPosition;
            previousPosition += word.length();
        }
        content = buf.toString();
    }

    public String wordAt(int index) {
        if (index == indices.length - 1) return content.substring(indices[index]);
        return content.substring(indices[index], indices[index + 1]);
    }

    public static void main(String... args) {
        WordList list = new WordList(Arrays.asList(args));
        for (int i = 0; i < args.length; ++i) {
            System.out.printf("Word %d: %s%n", i, list.wordAt(i));
        }
    }

}

Apart from the characters they contain, each word has an overhead of four bytes using this solution (the entry in indices ). Retrieving a word with wordAt will always allocate a new string; you could avoid this by saving the toString() of the StringBuilder rather than the builder itself, although it uses more memory on construction.

Depending on the kind of text, language, and more, you might want a solution that deals with recurring words better (like the one previously proposed ).

One option would be to store byte arrays instead with the text encoded in UTF-8:

byte[][] words = ...;

Then:

public String getWord(int index)
{
   return new String(words[index], "UTF-8");
}

This will be smaller in two ways:

  • The data for each string is directly in a byte[], rather than the String having a couple of integer members and a reference to a separate char[] object
  • If your text is mostly-ASCII, you'll benefit from UTF-8 using a single byte per character for those ASCII characters

I wouldn't really recommend this approach though... again it will be slower on access, as it needs to create a new String each time. Fundamentally, if you need a million string objects (so you don't want to pay the recreation penalty each time) then you're going to have to use the memory for a million string objects...

You could create a datastructure like this:

  • List<string> wordlist
  • Dictionary<string, int> tsildrow // for reverse lookup while building the structure
  • List<int> wordindex

wordlist will contain a list of all (unique) words, tsildrow will give the index of a word in wordlist and wordindex will tell you the index in wordlist of a specific index in your text.

You would operate it in the following fashion:

for word in text:
    if not word in tsildrow:
        wordlist.append(word)
        tsildrow.add(word, wordlist.last_index)
    wordindex.append(tsildrow[word])

this fills up your datastructure. Now, to find the word at index 531467:

print wordlist[wordindex[531467]]

you can reproduce the entire text like this:

for index in wordindex:
    print wordlist[index] + ' '

except, that you will still have a problem of punctuation etc...

if you won't be adding any more words (ie your text is stable), you can delete tsildrow to free up some memory if this is a concern of yours.

-XX:+UseCompressedStrings

Use a byte[] for Strings which can be represented as pure ASCII. (Introduced in Java 6 Update 21 Performance Release)

http://www.oracle.com/technetwork/java/javase/tech/vmoptions-jsp-140102.html

Seems like a interesting article: http://www.javamex.com/tutorials/memory/string_saving_memory.shtml

I hear ropes are quite good in terms of speed in storing large strings, though not sure memory wise. But you might want to check it out. http://ahmadsoft.org/ropes/ http://en.wikipedia.org/wiki/Rope_%28computer_science%29

OK, I have experimented with several of your suggestions, and here are my results (I checked (Runtime.getRuntime().totalMemory()-Runtime.getRuntime().freeMemory()) before filling the array, and checked again after filling the array and gc()):

  • Original (array of strings): 54 bytes/word (not 40 as I mistakenly wrote)
  • My solution (array of chunks of strings, separated by spaces):
    • 2 words per chunk - 36 b/w (but unacceptable performance)
    • 10 words per chunk - 18 b/w
    • 100 words per chunk - 14 b/w
  • byte arrays - 40 b/w
  • char arrays - 36 b/w
  • HashMap, either mapping a string to itself, or mapping a string to its index - 26 b/w
    • (not sure I implemented this correctly)
  • intern - 10 b/w
  • baseline (empty array) - 4 b/w

The average word length is about 3 chars, and most chars are non-ASCII so it's probably about 6 bytes. So, it seems that intern is close to the optimum. It makes sense, since it's an array of words, and many of the words appear much more than once.

I would probably consider using a file, with either fixed sized words or some sort of index. FileInputStream with skip can be pretty efficient

If you have a mobile device you can use TIntArrayList which would use 4 bytes per int value. If you use one index per word it will one need a couple of MB. You can also use int[]

If you have a PC or server, this is trivial amount of memory. Memory cost about £6 per GB or 1 cent per MB.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM