Storing large amount of configurations in java

Question

I have a datatype (let's call it data) that contains 2 pieces of information:

int config
byte weight

This datatype is the conversion of a series of 32 booleans. I have to perform changes to these 32 booleans convert it back to this data type and store it. The problem is I want to only store unique entries eliminating any duplicates. The problem is there exists 2^33 possible configurations for this data type.

I have tried something like this:

static class searchedconfigs {
    Data[] searchedconfigs;
    int position;
    public searchedconfigs() {
        searchedconfigs = new Data[150000];
    }
    public void initiateposition() {
        position = 0;
    }
    public boolean searchfield(Data Key, int entries) {
        boolean exists = false;
        for (int i = 0; i <= entries; i++) {
            if (searchedconfigs[i] == Key) {
                System.out.println("break");
                exists = true;
                break;
            }
        }
        return exists;
    }
    public void add(Data config, int position) {
        searchedconfigs[position] = config;
    }
    public int getPosition() {
        return position;
    }
    public void storePosition() {
        position++;
    }
}

The position initiation is done and increase is done so each time I search the array only in the occupied positions. My problem is as you can see the array is only of size 1500000. Which I need to be much bigger. However even assigning an int of max size (I need a long to make an array of the size I actually need) causes an out of memory error. Furthermore my searchfield function seems to not correctly compare the key and config stored at this position.

Can anyone tell me what I can do to fix these mistakes or suggest a different approach to store this data.

Answer 1

Use a HashSet , and implement equals and hashCode in Data , like so:

import java.util.Objects;

class Data {
    int config;
    byte weight;

    @Override
    public int hashCode() {
        return Objects.hash(config, weight);
    }

    @Override
    public boolean equals(Object other) {
        if (other == null) return false;
        if (!(other instanceof Data)) return false;
        if (other == this) return true;

        return this.config == other.config && this.weight == other.weight;
    }
}

Set s of any kind do not contain any duplicate elements. Since your Data class appears to be a value type (ie the member values are more important than its identity when comparing for equality), failing to implement these two methods will still leave duplicates in your data structure of choice.

Answer 2

What is the space limitation you're actually running into? Arrays in java are limited to Integer.MAX_VALUE (2^31-1 ?). Are you overrunning:

Maximum number of elements in an array?
The heap allocated to the JVM?
The available RAM + swap space on the machine?

If it's the number of elements, then look at an alternative data structure (see below). If you're overrunning the heap, then you should allocate more memory to your application (-Xmx arg to the JVM when running your program). If you're actually running out of memory on the box space saving tricks will only get you so far; eventually data growth will surpass those things. At that point you need to look at either horizontal scaling (distributed computing) or vertical scaling (getting a bigger box with more RAM).

If you're simply overrunning an Array because it can't be sized beyond max int and space is really a concern I'd avoid using HashSet as it will take more space than either a straight List/Array or an alternate Set implementation like a TreeSet.

For HashSets to work efficiently they need an oversized hashtable to reduce the number of hash collisions in the space. HashSet in Java has a default load factor of 75%, which means when it gets over that capacity it will resize itself larger to stay under the load factor. In general you're trading a larger amount of space for faster insertion/removal/lookup time for elements in the set which I believe is a constant time (Big O of 1).

A TreeSet should only require your storage capacity to be the same as the number of elements (negligible overhead) but at the trade off of an increased search & insertion time which is logarithmic (Big O of Log(n)). A List shares a similar storage characteristic (depends on the implementation used) but has a search time of N if it is unordered. (You can look up the various insertion/deletion/search times of different list implementations & ordered vs. unordered they are very well documented)

I just want to note when using a HashSet you're trading space efficiency for faster look-up time (Big O of 1). You have to allocate space for the hashtable which has to be bigger than the total number of elements in your collection. (Of course there is the caveat that you can force the size of your bucket to basically be 1 by having a horrid hashing function which would effectively put you right back at the performance characteristics of an un-ordered list ;)

Storing large amount of configurations in java

Question

2 answers

solution1
0 2016-04-03 00:00:27

solution2
0 2016-04-03 01:31:08

Storing large amount of configurations in java

Question

2 answers

solution1 0 2016-04-03 00:00:27

solution2 0 2016-04-03 01:31:08

solution1
0 2016-04-03 00:00:27

solution2
0 2016-04-03 01:31:08