简体   繁体   English

遍历大型数据集时要在内存中保留多少数据

[英]how much data to hold in memory when looping through large dataset

I am trying to create a trading simulator to test strategies over long periods of time. 我正在尝试创建一个交易模拟器来长期测试策略。

I am using 1 minute data points. 我正在使用1分钟的数据点。 So if I were to run a simulation for say 10 years that would be approx 3,740,000 prices (Price class shown below). 因此,如果我要运行10年的模拟,那大约是374万个价格(如下所示的价格类别)。 A simulation could be much longer than 10 years but using this an example. 模拟可能会超过10年,但是使用此示例。

     class Price
     {
          DateTime DatePrice
          double Open
          double High
          double Low
          double Close
     }

My simulator works however I can't help feeling the way I'm doing it isn't very optimal. 我的模拟器可以运行,但是我不禁觉得自己的工作方式不是很理想。

So currently what I do is to grab a years worth of prices from my SQL database so approx 374,400 prices. 因此,当前我要做的是从我的SQL数据库中获取价值374,400美元的价格的一年的价格。 I do this because I don't want to use too much memory (this might be misguided have no idea). 之所以这样做,是因为我不想使用过多的内存(这可能会误导您,不知道)。

Now when looping through time the code will also make use of the previous lets say 10 prices. 现在,当遍历时间时,代码还将利用之前的10个价格。 So at 2.30am the code will look back the prices from 2.20am, all the prices previous to this are now redundant. 因此,代码将在凌晨2.30时回溯凌晨2.20的价格,之前的所有价格现在都是多余的。 So it seems somewhat wasteful to me if I hold 374,400 prices in memory. 因此,如果我记忆中有374,400个价格,对我来说似乎有点浪费。

            time       Close
            00:00      102
            00:01      99
            00:02      100
            ...
            02:20      84
            02:21      88

So I have a loop that will loop through from my start date to my end date, checking at each step if I need to download additional prices from the database. 因此,我有一个循环,从开始日期到结束日期,将在每个步骤中进行检查,检查是否需要从数据库下载其他价格。

   List<Price> PriceList = Database.GetPrices(first years worth or prices) 

   for(DateTime dtNow = dtStart; dtNow < dtEnd; dtNow = dtNow.AddMinutes(1))
   {
         // run some calculations which doesn't take long

         // then check if PriceList[i] == PriceList.Count - 1
         // if so get more prices from the database and obviously reset i to zero but baring in mind I need to keep the previous 10 prices
   }

What is the best solution for this kind of problem? 解决此类问题的最佳解决方案是什么? Should I be getting prices from the database on another thread or something? 我应该在另一个线程或其他东西上从数据库获取价格吗?

Lets do some math 让我们做一些数学

class Price
{
    DateTime DatePrice;
    double Open;
    double High;
    double Low;
    double Close;
}

has a size of 8(DateTime)+4*8(double) = 40 alone for the members. 成员的大小分别为8(DateTime)+ 4 * 8(double)= 40。 Since it is a reference type you need a method table pointer and a SyncBlock pointer which add 16 byte additionally. 由于它是一种引用类型,因此需要一个方法表指针和一个SyncBlock指针,它们另外增加了16个字节。 Since you need to keep the pointer to the object (8 bytes on x64) somewhere we get a total size per instance of 64 bytes. 由于您需要将指向对象的指针(x64上为8个字节)保留在某个位置,因此每个实例的总大小为64个字节。

If you want to have a 10 year history with 3,7 million instances you will need 237 MB of memory which is not much in todays world. 如果您想拥有3,700万个实例的10年历史,则需要237 MB的内存,这在当今世界已经不算多了。

You can shave off some overhead by switching from double to floats which will need only 4 bytes and if you go with a struct 您可以通过从double切换为float来节省一些开销,这仅需要4个字节,并且如果使用struct

struct Price
{
    DateTime DatePrice;
    float Open;
    float High;
    float Low;
    float Close;
}

You will need only 24 bytes with no big loss of precision since the value range of stocks are not so high and you are interested in a long term trend or pattern and not 0,000000x fractions. 您只需要24个字节,就不会造成很大的精度损失,因为股票的价值范围不是很高,并且您对长期趋势或格局感兴趣,而不是0,000000x分数。

With this struct your 10 year time horizon will cost you only 88MB and it will keep the garbage collector off your data because it is opaque for the GC (no reference types inside your struct). 使用此结构,您的10年时间跨度将仅花费88MB,并且将垃圾收集器与数据隔离开,因为它对于GC是不透明的(结构内没有引用类型)。

That simple optimization should be good enough for time horizons which span hundreds of years even with todays computers and memory sizes. 即使使用当今的计算机和内存大小,这种简单的优化对于跨数百年的时间范围也足够好。 It would even fit into an x86 address space but I would recommend running this on x64 because I suspect you will check not only one stock but several ones in parallel. 它甚至可以放在x86地址空间中,但是我建议在x64上运行它,因为我怀疑您不仅会检查一只股票,而且还会同时检查几只股票。

If I were you, I would keep the problem of caching (which seems to be your problem), separate from functionality. 如果我是您,我会将缓存问题(这似乎是您的问题)与功能分开。

I don't know how you currently fetch your data from the DB. 我不知道您目前如何从数据库中获取数据。 I am guessing you are using some logic similar to 我猜你正在使用一些类似于

DataAdapter.Fill(dataset);
List<Price> PriceList = dataset.Tables[0].SomeLinqQuery();

Instead of fetching all the prices at teh same time, you can use something like below to fetch them incrementally and convert the fetched row into a Price object 不必同时获取所有价格,而是可以使用如下所示的方法逐步获取Price并将获取的行转换为Price对象

IDataReader rdr = IDbCommand.ExecuteReader();
while(rdr.Read())
{
}

Now to make transparent access to Prices, you might want to roll in some class which can provide caching 现在,要透明地访问价格,您可能想要加入一些可以提供缓存的类

class FixedSizeCircularBuffer<T> {
    void Add(T item) { } // make sure to dequeue automatically to keep buffer size fixed
    public T GetPrevious(int relativePosition) { } // provide indexer to access relative to the current element
}

class CachedPrices {
    FixedSizeCircularBuffer<Price> Cache;

    public CachedPrices() {
        // connect to the DB and do ExecuteReader
        // set the cache object to a good size
    }

    public Price this[int i] {
        get {
            if (i is in Cache)
                return Cache[i];
            else
                reader.Read();
                //store the newly fetched item to cache
        }
    }

}

Once you have such infrastructure, then you can pretty much use it to restrict how much pricing information is loaded and keep your functionality separate from the Caching mechanism. 一旦有了这样的基础结构,就可以使用它来限制加载多少定价信息,并使功能与缓存机制分开。 This provides you the flexibility to control how much memory you have to spare for pre-fetching prices and the amount of data you can process 这为您提供了灵活性,可以控制为预取价格必须保留多少内存以及可以处理的数据量。

Needless to say, this is just a guideline - you will have to understand this and implement for yourself 不用说,这只是一个准则-您将必须了解这一点并自己实施

From a time efficiency perspective, what would be optimal is for you to get back an initial batch of prices, start processing those, then immediately begin to retrieve the rest. 从时间效率的角度来看,最佳的方法是让您取回第一批价格,开始处理这些价格,然后立即开始检索其余的价格。 The problem with checking for new data during your processing is that you have to delay your program everytime you need new data. 处理期间检查新数据的问题是,每次需要新数据时都必须延迟程序。

If you really do care about memory, what you need to do is remove prices from the list you have after you are done with them. 如果您确实很在意内存,那么您需要做的就是在使用完之后从列表中删除价格。 This will allow the garbage collector to free up the consumed memory. 这将使垃圾收集器释放消耗的内存。 Otherwise with what you have, once your program is finishing and you pulled back the last year of prices you will have retrieved all of the prices and you would be consuming as much memory as if you had gotten all of the prices at once. 否则,有了您所拥有的东西,一旦您的程序完成并且您撤回了最后一年的价格,您将检索到所有价格,并且您将消耗与一次获取所有价格一样多的内存。

I believe you are being premature with your memory concerns. 我相信您对记忆的担忧还为时过早。 The only time I ever had to worry about memory/the garbage collector in .net was when I had a long running process and one step in that process included downloading PDF's. 我唯一需要担心的是.net中的内存/垃圾收集器是当我有一个运行时间较长的过程,而该过程中的一个步骤包括下载PDF。 Even though I retrieved the PDF's as needed, the PDF's in memory would eventually consume GB's of memory after running for a while and throw an exception after consuming whatever the .net memory limit is for lists. 即使我根据需要检索了PDF,但运行一段时间后,内存中的PDF最终仍会消耗GB的内存,并且在消耗了列表的.net内存限制后,都会抛出异常。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM