Data structure for set of (non-disjoint) sets

Question

I'm looking for a data structure that roughly corresponds to (in Java terms) Map<Set<int>, double> . Essentially a set of sets of labeled marbles, where each set of marbles is associated with a scalar. I want it to be able to efficiently handle the following operations:

Add a given integer to every set.
Remove every set that contains (or does not contain) a given integer, or at least set the associated double to 0.
Union two of the maps, adding together the doubles for sets that appear in both.
Multiply all of the doubles by a given double.
Rarely, iterate over the entire map.

under the following conditions:

The integers will fall within a constrained range (between 1 and 10,000 or so); the exact range will be known at compile-time.
Most of the integers within the range (80-90%) will never be used, but which ones will not be easily determinable until the end of the calculation.
- The number of integers used will almost always still be over 100.
Many of the sets will be very similar, differing only by a few elements.
It may be possible to identify certain groups of integers that frequently appear only in sequential order: for example, if a set contains the integers 27 and 29 then it (almost?) certainly contains 28 as well.
- It may be possible to identify these groups prior to running the calculation.
- These groups would typically have 100 or so integers.

I've considered tries, but I don't see a good way to handle the "remove every set that contains a given integer" operation.

The purpose of this data structure would be to represent discrete random variables and permit addition, multiplication, and scalar multiplication operations on them. Each of these discrete random variables would ultimately have been created by applying these operations to a fixed (at compile-time) set of independent Bernoulli random variables (ie each takes the value 1 or 0 with some probability).

The systems being modeled are close to being representable as a time-inhomogeneous Markov chains (which would of course simplify this immensely) but, unfortunately, it is essential to track the duration since various transitions.

Answer 1

Here's a data structure, that can do all of your operations pretty efficiently:

I'm going to refer to it as a BitmapArray for this explanation.

Thinking about it, apparently for just the operations you have described a sorted array with bitmaps as keys and weights(your doubles) as values will be pretty efficient.

The bitmaps are what maintain membership in your set. Since you said the range of integers in the set are between 1-10,000, we can maintain information about any set with a bitmap of length 10,000.

It's gonna be tough sorting an array where the keys can be as big as 2^10000, but you can be smart about implementing the comparison function in the following way:

Iterate from left to right on the two bitmaps
XOR the bits on each index
Say you get a 1 at ith position
Whichever bitmap has 1 at ith position is greater
If you never get a 1 they're equal

I know this is still a slow comparison. But not too slow, Here 'sa benchmark fiddle I did on bitmaps with length 10000. This is in Javascript, if you're going to write in Java, it's going to perform even better.

    function runTest() {
    var num = document.getElementById("txtValue").value;
    num = isNaN(num * 1) ? 0 : num * 1;

    /*For integers in the range 1-10,000 the worst case for comparison are any equal integers which will cause the comparision to iterate over the whole BitArray*/
    bitmap1 = convertToBitmap(10000, num);
    bitmap2 = convertToBitmap(10000, num);

    before = new Date().getMilliseconds();
    var result = firstIsGreater(bitmap1, bitmap2, 10000);
    after = new Date().getMilliseconds();
    alert(result + " in time: " + (after-before) + " ms");

}


function convertToBitmap(size, number) {
    var bits = new Array();
    var q = number;
    do {
        bits.push(q % 2);
        q = Math.floor(q / 2);
    } while (q > 0);


    xbitArray = new Array();
    for (var i = 0; i < size; i++) {
        xbitArray.push(0);
    }

    var j = xbitArray.length - 1;
    for (var i = bits.length - 1; i >= 0; i--) {
        xbitArray[j] = bits[i];
        j--
    }
    return xbitArray;
}

function firstIsGreater(bitArray1, bitArray2, lengthOfArrays) {
    for (var i = 0; i < lengthOfArrays; i++) {
        if (bitArray1[i] ^ bitArray2[i]) {
            if (bitArray1[i]) return true;
            else return false;
        }
    }
    return false;
}

document.getElementById("btnTest").onclick = function (e) {
    runTest();
};

Also, remember that you only have to do this once, when building your BitmapArray (or while taking unions) and then it's going to become pretty efficient for the operations you'd do most often:

Note : N is the length of the BitmapArray.

Add integer to every set : Worst/best case O(N) time. Flip a 0 to 1 in each bitmap.

Remove every set that contains a given integer : Worst case O(N) time.

For each bitmap check the bit that represents the given integer, if 1 mark it's index.
Compress the array by deleting all marked indices.

If you're okay with just setting the weights to 0 it'll be even more efficient. This also makes it very easy if you want to remove all sets that have any element in a given set.

Union of two maps : Worst case O(N1+N2) time. Just like merging two sorted arrays, except you have to be smart about comparisons once more.

Multiply all of the doubles by a given double : Worst/best case O(N) time. Iterate and multiply each value by the input double.

Iterate over the BitmapArray : Worst/best case O(1) time for next element.

Data structure for set of (non-disjoint) sets

Question

1 answers

solution1
1 ACCPTED 2014-06-21 03:50:38

Data structure for set of (non-disjoint) sets

Question

1 answers

solution1 1 ACCPTED 2014-06-21 03:50:38

solution1
1 ACCPTED 2014-06-21 03:50:38