简体   繁体   中英

Best big set data structure in Java

I need to find gaps in a big Integer Set populated with a read loop through files and I want to know if exists something already done for this purpose to avoid a simple Set object with heap overflow risk.

To better explain my question I have to tell you how my ticketing java software works. Every ticket has a global progressive number stored in a daily log file with other informations. I have to write a check procedure to verify if there are number gaps inside daily log files.

The first idea was to create a read loop with all log files, read each line, get the ticket number and store it in a Integer TreeSet Object and then find gaps in this Set. The problem is that ticket number can be very high and could saturate the memory heap space and I want a good solution also if I have to switch to Long objects. The Set solution waste a lot of memory because if I find that there are no gap in the first 100 number has no sense to store them in the Set.

How can I solve? Can I use some datastructure already done for this purpose?

I'm assuming that (A) the gaps you are looking for are the exception and not the rule and (B) the log files you are processing are mostly sorted by ticket number (though some out-of-sequence entries are OK).

If so, then I'd think about rolling your own data structure for this. Here's a quick example of what I mean (with a lot left to the reader).

Basically what it does is implement Set but actually store it as a Map , with each entry representing a range of contiguous values in the set.

The add method is overridden to maintain the backing Map appropriately. Eg, if you add 5 to the set and already have a range containing 4, then it just extends that range instead of adding a new entry.

Note that the reason for the "mostly sorted" assumption is that, for totally unsorted data, this approach will still use a lot of memory: the backing map will grow large (as unsorted entries get added all over the place) before growing smaller (as additional entries fill in the gaps, allowing contiguous entries to be combined).

Here's the code:

package com.matt.tester;

import java.util.Collection;
import java.util.Comparator;
import java.util.Iterator;
import java.util.Map;
import java.util.SortedSet;
import java.util.TreeMap;



public class SE {


    public class RangeSet<T extends Long> implements SortedSet<T> {

        private final TreeMap<T, T> backingMap = new TreeMap<T,T>();

        @Override
        public int size() {
            // TODO Auto-generated method stub
            return 0;
        }

        @Override
        public boolean isEmpty() {
            // TODO Auto-generated method stub
            return false;
        }

        @Override
        public boolean contains(Object o) {
            if ( ! ( o instanceof Number ) ) {
                throw new IllegalArgumentException();
            }
            T n = (T) o;
            // Find the greatest backingSet entry less than n
            Map.Entry<T,T> floorEntry = backingMap.floorEntry(n);
            if ( floorEntry == null ) {
                return false;
            }
            final Long endOfRange = floorEntry.getValue();
            if ( endOfRange >= n) {
                return true;
            }
            return false;
        }

        @Override
        public Iterator<T> iterator() {
            throw new IllegalAccessError("Method not implemented.  Left for the reader.  (You'd need a custom Iterator class, I think)");
        }

        @Override
        public Object[] toArray() {
            throw new IllegalAccessError("Method not implemented.  Left for the reader.");
        }

        @Override
        public <T> T[] toArray(T[] a) {
            throw new IllegalAccessError("Method not implemented.  Left for the reader.");
        }

        @Override
        public boolean add(T e) {
            if ( (Long) e < 1L ) {
                throw new IllegalArgumentException("This example only supports counting numbers, mainly because it simplifies printGaps() later on");
            }
            if ( this.contains(e) ) {
                // Do nothing.  Already in set.
            }
            final Long previousEntryKey;
            final T eMinusOne = (T) (Long) (e-1L); 
            final T nextEntryKey = (T) (Long) (e+1L); 
            if ( this.contains(eMinusOne ) ) {
                // Find the greatest backingSet entry less than e
                Map.Entry<T,T> floorEntry = backingMap.floorEntry(e);
                final T startOfPrecedingRange;
                startOfPrecedingRange = floorEntry.getKey();
                if ( this.contains(nextEntryKey) ) {
                    // This addition will join two previously separated ranges
                    T endOfRange = backingMap.get(nextEntryKey);
                    backingMap.remove(nextEntryKey);
                    // Extend the prior entry to include the whole range
                    backingMap.put(startOfPrecedingRange, endOfRange);
                    return true;
                } else {
                    // This addition will extend the range immediately preceding
                    backingMap.put(startOfPrecedingRange,  e);
                    return true;
                }
            } else if ( this.backingMap.containsKey(nextEntryKey) ) {
                // This addition will extend the range immediately following
                T endOfRange = backingMap.get(nextEntryKey);
                backingMap.remove(nextEntryKey);
                // Extend the prior entry to include the whole range
                backingMap.put(e, endOfRange);
                return true;
            } else {
                // This addition is a new range, it doesn't touch any others
                backingMap.put(e,e);
                return true;
            }
        }

        @Override
        public boolean remove(Object o) {
            throw new IllegalAccessError("Method not implemented.  Left for the reader.");
        }

        @Override
        public boolean containsAll(Collection<?> c) {
            throw new IllegalAccessError("Method not implemented.  Left for the reader.");
        }

        @Override
        public boolean addAll(Collection<? extends T> c) {
            throw new IllegalAccessError("Method not implemented.  Left for the reader.");
        }

        @Override
        public boolean retainAll(Collection<?> c) {
            throw new IllegalAccessError("Method not implemented.  Left for the reader.");
        }

        @Override
        public boolean removeAll(Collection<?> c) {
            throw new IllegalAccessError("Method not implemented.  Left for the reader.");
        }

        @Override
        public void clear() {
            this.backingMap.clear();
        }

        @Override
        public Comparator<? super T> comparator() {
            throw new IllegalAccessError("Method not implemented.  Left for the reader.");
        }

        @Override
        public SortedSet<T> subSet(T fromElement, T toElement) {
            throw new IllegalAccessError("Method not implemented.  Left for the reader.");
        }

        @Override
        public SortedSet<T> headSet(T toElement) {
            throw new IllegalAccessError("Method not implemented.  Left for the reader.");
        }

        @Override
        public SortedSet<T> tailSet(T fromElement) {
            throw new IllegalAccessError("Method not implemented.  Left for the reader.");
        }

        @Override
        public T first() {
            throw new IllegalAccessError("Method not implemented.  Left for the reader.");
        }

        @Override
        public T last() {
            throw new IllegalAccessError("Method not implemented.  Left for the reader.");
        }

        public void printGaps() {
            Long lastContiguousNumber = 0L;
            for ( Map.Entry<T, T> entry : backingMap.entrySet() ) {
                Long startOfNextRange = (Long) entry.getKey();
                Long endOfNextRange = (Long) entry.getValue();
                if ( startOfNextRange > lastContiguousNumber + 1 ) {
                    System.out.println( String.valueOf(lastContiguousNumber+1) + ".." + String.valueOf(startOfNextRange - 1) );
                }
                lastContiguousNumber = endOfNextRange;
            }
            System.out.println( String.valueOf(lastContiguousNumber+1) + "..infinity");
            System.out.println("Backing map size is " + this.backingMap.size());
            System.out.println(backingMap.toString());
        }




    }


    public static void main(String[] args) {

        SE se = new SE();

        RangeSet<Long> testRangeSet = se.new RangeSet<Long>();

        // Start by putting 1,000,000 entries into the map with a few, pre-determined, hardcoded gaps
        for ( long i = 1; i <= 1000000; i++ ) {
            // Our pre-defined gaps...
            if ( i == 58349 || ( i >= 87333 && i <= 87777 ) || i == 303998 ) {
                // Do not put these numbers in the set
            } else {
                testRangeSet.add(i);
            }
        }

        testRangeSet.printGaps();

    }
}

And the output is:

58349..58349
87333..87777
303998..303998
1000001..infinity
Backing map size is 4
{1=58348, 58350=87332, 87778=303997, 303999=1000000}

Well either you store everything in memory, and you risk overflowing the heap, or you don't store it in memory and you need to do a lot of computing.

I would suggest something in between - store the minimum needed information needed during processing. You could store the endpoints of the known non-gap sequence in a class with two Long fields. And all these sequence datatypes could be stored in a sorted list. When you find a new number, iterate through the list to see if it is adjacent to one of the endpoints. If so, change the endpoint to the new integer, and check if you can merge the adjacent sequence-objects (and hence remove one of the objects). If not, create a new sequence object in the properly sorted place.

This will end up being O(n) in memory usage and O(n) in cpu usage. But using any data structure which stores information about all numbers will simply be n in memory usage, and O(n*lookuptime) in cpu if lookuptime is not done in constant time.

I believe it's a perfect moment to get familiar with bloom-filter . It's a wonderful probabilistic data-structure which can be used for immediate proof that an element isn't in the set.

How does it work? The idea is pretty simple, the boost more complicated and the implementation can be found in Guava .

The idea

Initialize a filter which will be an array of bits of length which would allow you to store maximum value of used hash function . When adding element to the set, calculate it's hash. Determinate what bit's are 1 s and assure, that all of them are switched to 1 in the filter (array). When you want to check if an element is in the set, simply calculate it's hash and then check if all bits that are 1 s in the hash, are 1 s in the filter. If any of those bits is a 0 in the filter, the element definitely isn't in the set. If all of them are set to 1 , the element might be in the filter so you have to loop through all of the elements. The Boost

Simple probabilistic model provides the answer on how big should the filter (and the range of hash function) be to provide optimal chance for false positive which is the situation, that all bits are 1 s but the element isn't in the set.

Implementation

The Guava implementation provides the following constructor to the bloom-filter : create(Funnel funnel, int expectedInsertions, double falsePositiveProbability) . You can configure the filter on your own depending on the expectedInsertions and falsePositiveProbability .

False positive

Some people are aware of bloom-filters because of false-positive possibility. Bloom filter can be used in a way that don't rely on mightBeInFilter flag. If it might be, you should loop through all the elements and check one by one if the element is in the set or not.

Possible usage In your case, I'd create the filter for the set, then after all tickets are added simply loop through all the numbers (as you have to loop anyway) and check if they filter#mightBe int the set. If you set falsePositiveProbability to 3%, you'll achieve complexity around O(n^2-0.03m*n) where m stands for the number of gaps. Correct me if I'm wrong with the complexity estimation.

Read as many ticket numbers as you can fit into available memory.

Sort them, and write the sorted list to a temporary file. Depending on the expected number of gaps, it might save time and space to use a run-length–encoding scheme when writing the sorted numbers.

After all the ticket numbers have been sorted into temporary files, you can merge them into a single, sorted stream of ticket numbers, looking for gaps.

If this would result in too many temporary files to open at once for merging, groups of files can be merged into intermediate files, and so on, maintaining the total number below a workable limit. However, this extra copying can slow the process significantly.

The old tape-drive algorithms are still relevant.

Here is an idea: if you know in advance the range of your numbers, then

pre-calculate the sum of all the numbers that you expect to be there. 2. Then keep reading your numbers and produce the sum of all read numbers as well as the number of your numbers. 3. If the sum you come up with is the same as pre-calculated one, then there are no gaps. 4. If the sum is different and the number of your numbers is short just by one of the expected number then pre-calculated sum - actual sum will give you your missing number. 5. If the number of your numbers is short by more then one, then you will know how many numbers are missing and what their sum is.

The best part is that you will not need to store the collection of your numbers in memory.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM