简体   繁体   English

用于处理大数据量(100万记录及更多)的数据结构和技术

[英]Data Structures & Techniques for operating on large data volumes (1 mln. recs and more)

A WPF .NET 4.5 app that I have been developing, initially to work on small data volumes, now works on much larger data volumes in the region of 1 million and more and of course I started running out of memory. 我一直在开发的WPF .NET 4.5应用程序最初是为处理小数据量而设计的,现在可以在100万左右甚至更多的更大数据量上工作,当然我开始用光了内存。 The data comes from a MS SQL DB and data processing needs to be loaded to a local data structure, because this data is then transformed / processed / references by the code in CLR a continuous and uninterrupted data access is required, however not all data has to be loaded into memory straight away, but only when it is actually accessed. 数据来自MS SQL DB,并且数据处理需要加载到本地数据结构中,因为此数据随后将由CLR中的代码进行转换/处理/引用,因此需要连续且不间断的数据访问,但是并非所有数据都具有立即加载到内存中,但仅在实际访问时才加载。 As a small example an Inverse Distance Interpolator uses this data to produce interpolated maps and all data needs to be passed to it for a continuous grid generation. 作为一个小示例,逆距离插值器使用此数据来生成插值图,并且需要将所有数据传递给它以连续生成网格。

I have re-written some parts of the app for processing data, such as only load x amount of rows at any given time and implement a sliding window approach to data processing which works. 我已经重新编写了应用程序的某些部分来处理数据,例如在任何给定时间仅加载x数量的行,并实现了滑动窗口方法以进行有效的数据处理。 However doing this for the rest of the app will require some time investment and I wonder if there can be a more robust and standard way of approaching this design problem (there has to be, I am not the first one)? 但是,在其余的应用程序中执行此操作将需要一些时间投入,并且我想知道是否可以采用更健壮和标准的方法来解决此设计问题(必须是,我不是第一个)?

tldr; tldr; Does C# provide any data structures or techniques for accessing large data amounts in an interrupted manner, so it behaves like a IEnumerable but data is not in memory until it is actually accessed or required, or is it completely up to me to manage memory usage? C#是否提供任何数据结构或技术以中断方式访问大量数据,因此它的行为类似于IEnumerable,但直到实际访问或需要数据才将其存储在内存中,或者完全由我来管理内存使用情况吗? My ideal would be a structure that would automatically implement a buffer like mechanism and load in more data as of when that data is accessed and freeing memory from the data that has been accessed and no longer of interest. 我的理想选择是一种结构,该结构将自动实现类似缓冲区的机制,并在访问该数据时加载更多数据,并从已访问但不再有意义的数据中释放内存。 Like some DataTable with an internal buffer maybe? 像一些带有内部缓冲区的DataTable一样?

As far as iterating through a very large data set that is too large to fit in memory goes, you can use a producer-consumer model. 至于遍历太大而无法容纳内存的超大型数据集,则可以使用生产者-消费者模型。 I used something like this when I was working with a custom data set that contained billions of records--about 2 terabytes of data total. 当我使用包含数十亿条记录(约2 TB的数据)的自定义数据集时,使用了类似的方法。

The idea is to have a single class that contains both producer and consumer. 这个想法是要有一个既包含生产者又包含消费者的类。 When you create a new instance of the class, it spins up a producer thread that fills a constrained concurrent queue. 当创建类的新实例时,它将旋转一个生产者线程,该线程将填充受约束的并发队列。 And that thread keeps the queue full. 并且该线程使队列充满。 The consumer part is the API that lets you get the next record. 使用者部分是API,可让您获取下一条记录。

You start with a shared concurrent queue. 您从共享并发队列开始。 I like the .NET BlockingCollection for this. 我喜欢.NET BlockingCollection

Here's an example that reads a text file and maintains a queue of 10,000 text lines. 这是一个读取文本文件并维护10,000条文本行的队列的示例。

public class TextFileLineBuffer
{
    private const int QueueSize = 10000;
    private BlockingCollection<string> _buffer = new BlockingCollection<string>(QueueSize);
    private CancellationTokenSource _cancelToken;
    private StreamReader reader;

    public TextFileLineBuffer(string filename)
    {
        // File is opened here so that any exception is thrown on the calling thread. 
        _reader = new StreamReader(filename);
        _cancelToken = new CancellationTokenSource();
        // start task that reads the file
        Task.Factory.StartNew(ProcessFile, TaskCreationOptions.LongRunning);
    }

    public string GetNextLine()
    {
        if (_buffer.IsCompleted)
        {
            // The buffer is empty because the file has been read
            // and all lines returned.
            // You can either call this an error and throw an exception,
            // or you can return null.
            return null;
        }

        // If there is a record in the buffer, it is returned immediately.
        // Otherwise, Take does a non-busy wait.

        // You might want to catch the OperationCancelledException here and return null
        // rather than letting the exception escape.

        return _buffer.Take(_cancelToken.Token);
    }

    private void ProcessFile()
    {
        while (!_reader.EndOfStream && !_cancelToken.Token.IsCancellationRequested)
        {
            var line = _reader.ReadLine();
            try
            {
                // This will block if the buffer already contains QueueSize records.
                // As soon as a space becomes available, this will add the record
                // to the buffer.
                _buffer.Add(line, _cancelToken.Token);
            }
            catch (OperationCancelledException)
            {
                ;
            }
        }
        _buffer.CompleteAdding();
    }

    public void Cancel()
    {
        _cancelToken.Cancel();
    }
}

That's the bare bones of it. 这就是它的基本内容。 You'll want to add a Dispose method that will make sure that the thread is terminated and that the file is closed. 您将要添加一个Dispose方法,以确保终止线程并关闭文件。

I've used this basic approach to good effect in many different programs. 我在许多不同的程序中都使用了这种基本方法来取得良好的效果。 You'll have to do some analysis and testing to determine the optimum buffer size for your application. 您必须进行一些分析和测试,以确定适合您的应用程序的最佳缓冲区大小。 You want something large enough to keep up with the normal data flow and also handle bursts of activity, but not so large that it exceeds your memory budget. 您需要足够大的数据以跟上正常的数据流并处理突发的活动,但又不要太大以至于超出内存预算。

IEnumerable modifications IE无数的修改

If you want to support IEnumerable<T> , you have to make some minor modifications. 如果要支持IEnumerable<T> ,则必须进行一些小的修改。 I'll extend my example to support IEnumerable<String> . 我将扩展示例以支持IEnumerable<String>

First, you have to change the class declaration: 首先,您必须更改类声明:

public class TextFileLineBuffer: IEnumerable<string>

Then, you have to implement GetEnumerator : 然后,您必须实现GetEnumerator

public IEnumerator<String> GetEnumerator()
{
    foreach (var s in _buffer.GetConsumingEnumerable())
    {
        yield return s;
    }
}

IEnumerator IEnumerable.GetEnumerator()
{
    return GetEnumerator();
}

With that, you can initialize the thing and then pass it to any code that expects an IEnumerable<string> . 这样,您可以初始化事物,然后将其传递给需要IEnumerable<string>任何代码。 So it becomes: 这样就变成了:

var items = new TextFileLineBuffer(filename);
DoSomething(items);

void DoSomething(IEnumerable<string> list)
{
    foreach (var s in list)
        Console.WriteLine(s);
}

@Sergey The producer-consumer model is probably your safest solution (Proposed by Jim Mischel) for complete scalability. @Sergey生产者-消费者模型可能是最安全的解决方案(由Jim Mischel提出),以实现完全可伸缩性。

However, if you were to increase the room for the elephant (using your visual metaphor that fits very well), then compression on the fly is a viable option. 但是,如果要增加容纳大象的空间(使用非常合适的视觉隐喻),则动态压缩是一个可行的选择。 Decompress when used and discard after use, leaving the core data structure compressed in memory. 使用时解压缩,使用后丢弃,将核心数据结构压缩在内存中。 Obviously it depends on the data - how much it lends itself to compression, but there is a hell of alot of room in most data structures. 显然,它取决于数据-压缩本身有多少,但是大多数数据结构中都有很大的空间。 If you have ON and OFF flags for some meta data, this can be buried in the unused bits of 16/32 bit numbers, or at least held in bits not bytes; 如果某些元数据具有ON和OFF标志,则可以将其掩埋在16/32位数字的未使用位中,或者至少保留在位而不是字节中; use 16 bit integers for lat / longs with a constant scaling factor to convert each to real numbers before use; 使用具有固定比例因子的经度/纬度使用16位整数,以便在使用前将它们分别转换为实数; strings can be compressed using winzip type libraries - or indexed so that only ONE copy is held and no duplicates exist in memory, etc.... 可以使用winzip类型库压缩字符串-或对其进行索引,以便仅保留一个副本,并且内存中不存在重复副本,等等。

Decompression (albeit custom made) on the fly can be lightning fast. 飞行中的减压(尽管是定制的)可以快速实现。

This whole process can be very laborious I admit, but can definitely keep the room large enough as the elephant grows - in some instances. 我承认,整个过程可能非常费力,但在某些情况下,可以随着大象的成长而确保房间足够大。 (Of course, it may never be good enough if the data is simply growing indefinitely) (当然,如果数据只是无限期地增长,那可能永远不够好)

EDIT: Re any sources... Hi @Sergey, I wish I could!! 编辑:关于任何来源...嗨,@谢尔盖,我希望我能! Truly! 诚然! I have used this technique for data compression and really the whole thing was custom designed on a whiteboard with one or two coders involved. 我已经使用这种技术进行数据压缩,实际上整个事情都是在白板上定制设计的,其中涉及一个或两个编码器。
Its certainly not (all) rocket science, but its good to fully scope out the nature of all the data, then you know (for example) that a certain figure will never exceed 9999, so then you can choose how to store it in minimum bits, and then allocate the left over bits (assuming 32 bit storage) to other values. 它当然不是(全部)火箭科学,但是可以很好地充分利用所有数据的本质,然后(例如)您知道某个数字永远不会超过9999,因此您可以选择如何以最小的方式存储它位,然后将剩余的位(假设存储32位)分配给其他值。 (A real world example is the number of fingers a person has...loosely speaking you could set an upper limit at 8 or 10, although 12 is possible, and even 20 is remotely feasible, etc if they have extra fingers. You can see what I mean) Lat / Longs are the PERFECT example of numbers that will never cross logical boundaries (unless you use wrap around values...). (一个现实世界的例子是一个人的手指数量。...松散地说,您可以将上限设置为8或10,尽管可以设置12个,甚至在有些情况下也可以设置20个上限,如果他们有额外的手指。您可以理解我的意思)纬度/经度是永远不会跨越逻辑边界的数字的完美示例(除非您使用环绕值...)。 That is, they are always in between -90 and +90 (just guessing which type of Lat Longs) - which is very easy to reduce / convert as the range of values is so neat. 也就是说,它们始终在-90到+90之间(只需猜测哪种类型的Lat Longs)-由于值的范围非常整齐,因此很容易减少/转换。

So we did not rely 'directly' on any third party literature. 因此,我们没有“直接”依赖任何第三方文献。 Only upon algorithms designed for specific types of data. 仅针对为特定类型的数据设计的算法。

In other projects, for fast real time DSP (processing) the smarter (experienced game programmers) coders would convert floats to 16 bit ints and have a global scaling factor calculated to give max precision for the particular data stream (accelerometers, LVDT, Pressure gauges, etc) you are collecting. 在其他项目中,对于更快的实时DSP(处理),更聪明的(经验丰富的游戏程序员)编码器会将浮点数转换为16位整数,并计算出全局比例因子,以提供特定数据流(加速度计,LVDT,压力表)的最大精度。等)您正在收集。

This reduced the transmitted AND stored data without losing ANY information. 这减少了发送和存储的数据,而不会丢失任何信息。 Similarly, for real time wave / signal data you could use (Fast) Fourier Transform to turn your noisy wave into its Amplitude, Phase and Spectrum components - literally half of the data values, without actually losing any (significant) data. 同样,对于实时波/信号数据,您可以使用(快速)傅立叶变换将噪声波转换为其幅度,相位和频谱分量-实际上是数据值的一半,而实际上不会丢失任何(重要)数据。 (Within these algorithms, the data 'loss' is completely measurable - so you can decide if you are in fact losing data) (在这些算法中,数据“损失”是完全可测量的,因此您可以决定是否实际上在丢失数据)

Similarly there are algorithms like Rainfall Analysis (nothing to do with rain, more about cycles and frequency) which reduces your data alot. 同样,还有诸如降雨分析(与降雨无关,更多关于周期和频率)之类的算法,可大量减少数据。 Peak detection and vector analysis can be enough for some other signals, which basically throws out about 99% of the data...The list is endless, but the technique MUST be intimately suited to your data. 峰值检测和矢量分析足以满足某些其他信号的需要,这些信号基本上会丢弃约99%的数据...列表是无止境的,但该技术必须与您的数据紧密匹配。 And you may have many different types of data, each lending itself to a different 'reduction' technique. 而且您可能拥有许多不同类型的数据,每种类型都使自己采用不同的“减少”技术。 I'm sure you can google 'lossless data reduction' (although I think the term lossless is coined by music processing and a little misleading since digital music has already lost the upper and lower freq ranges...I digress)....Please post what you find (if of course you have the time / inclination to research this further) 我敢肯定,您可以在Google上进行“无损数据缩减”(尽管我认为“无损”一词是由音乐处理创造的,并且由于数字音乐已经失去了频率上限和频率下限,所以有点误导...我离题了。)。请张贴您发现的内容(如果您当然有时间/倾向对此做进一步的研究)

I would be interested to discuss your meta data, perhaps a large chunk can be 'reduced' quite elegantly... 我很想讨论您的元数据,也许可以相当优雅地“减少”很大一部分...

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM