简体   繁体   English

Java中最好的大集合数据结构

[英]Best big set data structure in Java

I need to find gaps in a big Integer Set populated with a read loop through files and I want to know if exists something already done for this purpose to avoid a simple Set object with heap overflow risk. 我需要找到一个大的整数集中的空白,该整数集中有一个遍历文件的读取循环,并且我想知道是否已经为此目的做了一些操作,以避免一个简单的Set对象存在堆溢出风险。

To better explain my question I have to tell you how my ticketing java software works. 为了更好地解释我的问题,我必须告诉您我的票务Java软件是如何工作的。 Every ticket has a global progressive number stored in a daily log file with other informations. 每张票证都有一个全球累进号码,该号码与其他信息一起存储在每日日志文件中。 I have to write a check procedure to verify if there are number gaps inside daily log files. 我必须编写一个检查过程来验证每日日志文件中是否存在数字间隔。

The first idea was to create a read loop with all log files, read each line, get the ticket number and store it in a Integer TreeSet Object and then find gaps in this Set. 第一个想法是创建一个包含所有日志文件的读取循环,读取每一行,获取票证编号并将其存储在Integer TreeSet对象中,然后在此Set中查找空白。 The problem is that ticket number can be very high and could saturate the memory heap space and I want a good solution also if I have to switch to Long objects. 问题在于票证编号可能很高,并且可能会使内存堆空间饱和,并且如果我必须切换到Long对象,我也想要一个好的解决方案。 The Set solution waste a lot of memory because if I find that there are no gap in the first 100 number has no sense to store them in the Set. Set解决方案浪费了很多内存,因为如果我发现前100个数字中没有空格,就没有意义将它们存储在Set中。

How can I solve? 我该如何解决? Can I use some datastructure already done for this purpose? 我可以使用一些已经为此目的完成的数据结构吗?

I'm assuming that (A) the gaps you are looking for are the exception and not the rule and (B) the log files you are processing are mostly sorted by ticket number (though some out-of-sequence entries are OK). 我假设(A)您正在寻找的差距是例外情况而不是规则,并且(B)您正在处理的日志文件主要按票证编号排序(尽管某些顺序错误的记录是可以的)。

If so, then I'd think about rolling your own data structure for this. 如果是这样,那么我会考虑为此滚动自己的数据结构。 Here's a quick example of what I mean (with a lot left to the reader). 这是我的意思的简单示例(还有很多遗留给读者)。

Basically what it does is implement Set but actually store it as a Map , with each entry representing a range of contiguous values in the set. 基本上,它执行的是Set但实际上将其存储为Map ,每个条目代表该集中的一系列连续值。

The add method is overridden to maintain the backing Map appropriately. 重写add方法以适当地维护支持Map Eg, if you add 5 to the set and already have a range containing 4, then it just extends that range instead of adding a new entry. 例如,如果您向集合中添加5,并且已经有一个包含4的范围,那么它只是扩展了该范围,而不是添加新条目。

Note that the reason for the "mostly sorted" assumption is that, for totally unsorted data, this approach will still use a lot of memory: the backing map will grow large (as unsorted entries get added all over the place) before growing smaller (as additional entries fill in the gaps, allowing contiguous entries to be combined). 请注意,“基本排序”假设的原因是,对于完全未排序的数据,此方法仍会占用大量内存:后备映射在变小之前会变大(因为未排序的条目会在整个地方被添加)。因为其他条目填补了空白,因此可以合并连续的条目)。

Here's the code: 这是代码:

package com.matt.tester;

import java.util.Collection;
import java.util.Comparator;
import java.util.Iterator;
import java.util.Map;
import java.util.SortedSet;
import java.util.TreeMap;



public class SE {


    public class RangeSet<T extends Long> implements SortedSet<T> {

        private final TreeMap<T, T> backingMap = new TreeMap<T,T>();

        @Override
        public int size() {
            // TODO Auto-generated method stub
            return 0;
        }

        @Override
        public boolean isEmpty() {
            // TODO Auto-generated method stub
            return false;
        }

        @Override
        public boolean contains(Object o) {
            if ( ! ( o instanceof Number ) ) {
                throw new IllegalArgumentException();
            }
            T n = (T) o;
            // Find the greatest backingSet entry less than n
            Map.Entry<T,T> floorEntry = backingMap.floorEntry(n);
            if ( floorEntry == null ) {
                return false;
            }
            final Long endOfRange = floorEntry.getValue();
            if ( endOfRange >= n) {
                return true;
            }
            return false;
        }

        @Override
        public Iterator<T> iterator() {
            throw new IllegalAccessError("Method not implemented.  Left for the reader.  (You'd need a custom Iterator class, I think)");
        }

        @Override
        public Object[] toArray() {
            throw new IllegalAccessError("Method not implemented.  Left for the reader.");
        }

        @Override
        public <T> T[] toArray(T[] a) {
            throw new IllegalAccessError("Method not implemented.  Left for the reader.");
        }

        @Override
        public boolean add(T e) {
            if ( (Long) e < 1L ) {
                throw new IllegalArgumentException("This example only supports counting numbers, mainly because it simplifies printGaps() later on");
            }
            if ( this.contains(e) ) {
                // Do nothing.  Already in set.
            }
            final Long previousEntryKey;
            final T eMinusOne = (T) (Long) (e-1L); 
            final T nextEntryKey = (T) (Long) (e+1L); 
            if ( this.contains(eMinusOne ) ) {
                // Find the greatest backingSet entry less than e
                Map.Entry<T,T> floorEntry = backingMap.floorEntry(e);
                final T startOfPrecedingRange;
                startOfPrecedingRange = floorEntry.getKey();
                if ( this.contains(nextEntryKey) ) {
                    // This addition will join two previously separated ranges
                    T endOfRange = backingMap.get(nextEntryKey);
                    backingMap.remove(nextEntryKey);
                    // Extend the prior entry to include the whole range
                    backingMap.put(startOfPrecedingRange, endOfRange);
                    return true;
                } else {
                    // This addition will extend the range immediately preceding
                    backingMap.put(startOfPrecedingRange,  e);
                    return true;
                }
            } else if ( this.backingMap.containsKey(nextEntryKey) ) {
                // This addition will extend the range immediately following
                T endOfRange = backingMap.get(nextEntryKey);
                backingMap.remove(nextEntryKey);
                // Extend the prior entry to include the whole range
                backingMap.put(e, endOfRange);
                return true;
            } else {
                // This addition is a new range, it doesn't touch any others
                backingMap.put(e,e);
                return true;
            }
        }

        @Override
        public boolean remove(Object o) {
            throw new IllegalAccessError("Method not implemented.  Left for the reader.");
        }

        @Override
        public boolean containsAll(Collection<?> c) {
            throw new IllegalAccessError("Method not implemented.  Left for the reader.");
        }

        @Override
        public boolean addAll(Collection<? extends T> c) {
            throw new IllegalAccessError("Method not implemented.  Left for the reader.");
        }

        @Override
        public boolean retainAll(Collection<?> c) {
            throw new IllegalAccessError("Method not implemented.  Left for the reader.");
        }

        @Override
        public boolean removeAll(Collection<?> c) {
            throw new IllegalAccessError("Method not implemented.  Left for the reader.");
        }

        @Override
        public void clear() {
            this.backingMap.clear();
        }

        @Override
        public Comparator<? super T> comparator() {
            throw new IllegalAccessError("Method not implemented.  Left for the reader.");
        }

        @Override
        public SortedSet<T> subSet(T fromElement, T toElement) {
            throw new IllegalAccessError("Method not implemented.  Left for the reader.");
        }

        @Override
        public SortedSet<T> headSet(T toElement) {
            throw new IllegalAccessError("Method not implemented.  Left for the reader.");
        }

        @Override
        public SortedSet<T> tailSet(T fromElement) {
            throw new IllegalAccessError("Method not implemented.  Left for the reader.");
        }

        @Override
        public T first() {
            throw new IllegalAccessError("Method not implemented.  Left for the reader.");
        }

        @Override
        public T last() {
            throw new IllegalAccessError("Method not implemented.  Left for the reader.");
        }

        public void printGaps() {
            Long lastContiguousNumber = 0L;
            for ( Map.Entry<T, T> entry : backingMap.entrySet() ) {
                Long startOfNextRange = (Long) entry.getKey();
                Long endOfNextRange = (Long) entry.getValue();
                if ( startOfNextRange > lastContiguousNumber + 1 ) {
                    System.out.println( String.valueOf(lastContiguousNumber+1) + ".." + String.valueOf(startOfNextRange - 1) );
                }
                lastContiguousNumber = endOfNextRange;
            }
            System.out.println( String.valueOf(lastContiguousNumber+1) + "..infinity");
            System.out.println("Backing map size is " + this.backingMap.size());
            System.out.println(backingMap.toString());
        }




    }


    public static void main(String[] args) {

        SE se = new SE();

        RangeSet<Long> testRangeSet = se.new RangeSet<Long>();

        // Start by putting 1,000,000 entries into the map with a few, pre-determined, hardcoded gaps
        for ( long i = 1; i <= 1000000; i++ ) {
            // Our pre-defined gaps...
            if ( i == 58349 || ( i >= 87333 && i <= 87777 ) || i == 303998 ) {
                // Do not put these numbers in the set
            } else {
                testRangeSet.add(i);
            }
        }

        testRangeSet.printGaps();

    }
}

And the output is: 输出为:

58349..58349
87333..87777
303998..303998
1000001..infinity
Backing map size is 4
{1=58348, 58350=87332, 87778=303997, 303999=1000000}

Well either you store everything in memory, and you risk overflowing the heap, or you don't store it in memory and you need to do a lot of computing. 好吧,要么将所有内容都存储在内存中,就有可能使堆溢出,或者没有将其存储在内存中,并且需要进行大量计算。

I would suggest something in between - store the minimum needed information needed during processing. 我建议介于两者之间-存储处理期间所需的最少信息。 You could store the endpoints of the known non-gap sequence in a class with two Long fields. 您可以将已知的非空缺序列的端点存储在具有两个Long字段的类中。 And all these sequence datatypes could be stored in a sorted list. 所有这些序列数据类型都可以存储在排序列表中。 When you find a new number, iterate through the list to see if it is adjacent to one of the endpoints. 当您找到一个新号码时,请遍历列表以查看其是否与端点之一相邻。 If so, change the endpoint to the new integer, and check if you can merge the adjacent sequence-objects (and hence remove one of the objects). 如果是这样,请将端点更改为新的整数,并检查是否可以合并相邻的序列对象(从而删除其中一个对象)。 If not, create a new sequence object in the properly sorted place. 如果不是,请在正确排序的位置创建一个新的序列对象。

This will end up being O(n) in memory usage and O(n) in cpu usage. 这将最终被O(n)中的内存使用情况和O(n)中的CPU使用率。 But using any data structure which stores information about all numbers will simply be n in memory usage, and O(n*lookuptime) in cpu if lookuptime is not done in constant time. 但是,使用任何存储有关所有数字的信息的数据结构,在内存使用情况中将只是n ;如果在恒定时间内未完成查找时间,则在cpu中将使用O(n*lookuptime)

I believe it's a perfect moment to get familiar with bloom-filter . 我相信这是一个熟悉bloom-filter的绝佳时机。 It's a wonderful probabilistic data-structure which can be used for immediate proof that an element isn't in the set. 这是一个极好的概率数据结构,可用于立即证明元素不在集合中。

How does it work? 它是如何工作的? The idea is pretty simple, the boost more complicated and the implementation can be found in Guava . 这个想法很简单,提升更加复杂,实现可以在Guava中找到。

The idea 这个主意

Initialize a filter which will be an array of bits of length which would allow you to store maximum value of used hash function . 初始化一个过滤器,该过滤器将是一个长度为若干位的数组,该数组将允许您存储所用hash function最大值。 When adding element to the set, calculate it's hash. 将元素添加到集合中时,计算其哈希值。 Determinate what bit's are 1 s and assure, that all of them are switched to 1 in the filter (array). 确定1位是什么,并确保在滤波器(阵列)中将所有位都切换为1 When you want to check if an element is in the set, simply calculate it's hash and then check if all bits that are 1 s in the hash, are 1 s in the filter. 当您要检查元素是否在集合中时,只需计算其哈希值,然后检查哈希值中所有1 s的所有位在过滤器中是否为1 s。 If any of those bits is a 0 in the filter, the element definitely isn't in the set. 如果这些位中的任何一位在过滤器中为0 ,则该元素肯定不在集合中。 If all of them are set to 1 , the element might be in the filter so you have to loop through all of the elements. 如果将所有元素都设置为1 ,则该元素可能在过滤器中,因此您必须遍历所有元素。 The Boost 助推器

Simple probabilistic model provides the answer on how big should the filter (and the range of hash function) be to provide optimal chance for false positive which is the situation, that all bits are 1 s but the element isn't in the set. 简单的概率模型提供了以下答案:所有位均为1 s但元素不在集合中的情况下,滤波器(以及散列函数的范围)应有多大才能为false positive提供最佳机会。

Implementation 实作

The Guava implementation provides the following constructor to the bloom-filter : create(Funnel funnel, int expectedInsertions, double falsePositiveProbability) . Guava实现为bloom-filter提供了以下构造函数: create(Funnel funnel, int expectedInsertions, double falsePositiveProbability) You can configure the filter on your own depending on the expectedInsertions and falsePositiveProbability . 你可以根据你自己的配置过滤器expectedInsertionsfalsePositiveProbability

False positive 假阳性

Some people are aware of bloom-filters because of false-positive possibility. 由于误报的可能性,有些人知道bloom-filters Bloom filter can be used in a way that don't rely on mightBeInFilter flag. 布隆过滤器可以不依赖mightBeInFilter标记的方式使用。 If it might be, you should loop through all the elements and check one by one if the element is in the set or not. 如果可能,则应遍历所有元素,并逐个检查元素是否在集合中。

Possible usage In your case, I'd create the filter for the set, then after all tickets are added simply loop through all the numbers (as you have to loop anyway) and check if they filter#mightBe int the set. 可能的用法根据您的情况,我将为集合创建过滤器,然后在添加所有票证之后,简单地循环遍历所有数字(因为无论如何您都必须循环),并检查它们filter#mightBe将集合filter#mightBe整数。 If you set falsePositiveProbability to 3%, you'll achieve complexity around O(n^2-0.03m*n) where m stands for the number of gaps. 如果将falsePositiveProbability设置为3%, falsePositiveProbability实现O(n^2-0.03m*n)左右的复杂度,其中m表示间隙数。 Correct me if I'm wrong with the complexity estimation. 如果我对复杂度估算有误,请纠正我。

Read as many ticket numbers as you can fit into available memory. 读取尽可能多的凭单号以放入可用内存。

Sort them, and write the sorted list to a temporary file. 对它们进行排序,然后将排序后的列表写入临时文件。 Depending on the expected number of gaps, it might save time and space to use a run-length–encoding scheme when writing the sorted numbers. 根据期望的间隔数,在写入排序的数字时可以节省时间和空间,以使用游程长度编码方案。

After all the ticket numbers have been sorted into temporary files, you can merge them into a single, sorted stream of ticket numbers, looking for gaps. 将所有票证编号分类到临时文件中后,您可以将它们合并到一个单独的,已分类的票证流中,以查找差距。

If this would result in too many temporary files to open at once for merging, groups of files can be merged into intermediate files, and so on, maintaining the total number below a workable limit. 如果这将导致太多临时文件无法同时打开,则可以将文件组合并为中间文件,依此类推,以使总数保持在可行限制以下。 However, this extra copying can slow the process significantly. 但是,这种额外的复制会大大减慢该过程。

The old tape-drive algorithms are still relevant. 旧的磁带驱动器算法仍然很重要。

Here is an idea: if you know in advance the range of your numbers, then 这是一个主意:如果您事先知道您的数字范围,那么

pre-calculate the sum of all the numbers that you expect to be there. 预先计算您希望在那里的所有数字的总和。 2. Then keep reading your numbers and produce the sum of all read numbers as well as the number of your numbers. 2.然后继续阅读您的数字,并生成所有阅读数字的总和以及您的数字。 3. If the sum you come up with is the same as pre-calculated one, then there are no gaps. 3.如果您得出的总和与预先计算的总和相同,则没有差距。 4. If the sum is different and the number of your numbers is short just by one of the expected number then pre-calculated sum - actual sum will give you your missing number. 4.如果总和不相同,而您的数字个数短于预期数字之一,则预先计算的总和-实际总和将为您提供丢失的数字。 5. If the number of your numbers is short by more then one, then you will know how many numbers are missing and what their sum is. 5.如果您的数字个数短于一个,那么您将知道丢失了多少个数字以及它们的总和。

The best part is that you will not need to store the collection of your numbers in memory. 最好的部分是,您将不需要将数字的集合存储在内存中。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM