简体   繁体   English

C#字典 - 如何解决项目数限制?

[英]C# dictionary - how to solve limit on number of items?

I am using Dictionary and I need to store almost 13 000 000 keys in it. 我正在使用字典,我需要存储近13 000 000个密钥。 Unfortunatelly, after adding 11 950 000th key I got an exception "System out of memory". 不幸的是,在添加11 950 000个密钥之后,我得到了一个例外“系统内存不足”。 Is there any solution of this problem? 有这个问题的解决方案吗? I will need my program to run on less powerable computers than is actually mine in the future.. 我将需要我的程序在比未来实际上更少的可用计算机上运行。

I need that many keys because I need to store pairs - sequence name and sequence length, it is for solving bioinformatics related problem. 我需要那么多密钥,因为我需要存储对 - 序列名称和序列长度,它用于解决生物信息学相关的问题。

Any help will be appreciated. 任何帮助将不胜感激。

Buy more memory, install a 64 bit version of the OS and recompile for 64 bits. 购买更多内存,安装64位版本的操作系统并重新编译为64位。 No, I'm not kidding. 不,我不是在开玩笑。 If you want so many objects... in ram... And then call it a "feature". 如果你想要这么多物品......在ram ......然后称它为“特征”。 If the new Android can require 16gb of memory to be compiled... 如果新的Android可以需要16GB的内存来编译...

I was forgetting... You could begin by reading C# array of objects, very large, looking for a better way 我忘记了......你可以从读取C#数组的对象开始,非常大,寻找更好的方法

You know how many are 13 million objects? 你知道有多少是1300万个物体吗?

To make a comparison, a 32 bits Windows app has access to less than 2 gb of address space. 为了进行比较,32位Windows应用程序可以访问少于2 GB的地址空间。 So it's 2 billion bytes (give or take)... 2 billion / 13 million = something around 150 bytes/object. 所以它是20亿字节(给予或接受)...... 20亿/ 1300万=大约150字节/对象的东西。 Now, if we consider how much a reference type occupies... It's quite easy to eat 150 bytes. 现在,如果我们考虑一个引用类型占用多少...吃150个字节很容易。

I'll add something: I've looked in my Magic 8-Ball and it told me: show us your code . 我会添加一些东西:我看了我的Magic 8-Ball ,它告诉我: 告诉我们你的代码 If you don't tell us what you are using for the key and the values, how should we be able to help you? 如果您没有告诉我们您使用的钥匙和价值,我们应该如何帮助您? What are you using, class or struct or "primitive" types? 你在使用什么, classstruct或“原始”类型? Tell us the "size" of your TKey and TValue . 告诉我们你的TKeyTValue的“大小”。 Sadly our crystall ball broke yesterday :-) 可悲的是,昨天我们的水晶球破了:-)

C# is not a language that was designed to solve heavy-duty scientific computation problems. C#不是一种旨在解决重型科学计算问题的语言。 It is absolutely possible to use C# to build tools that do what you want, but the off-the-shelf parts like Dictionary were designed to solve more common business problems, like mapping zip codes to cities and that sort of thing. 绝对有可能使用C#来构建满足您需求的工具,但像Dictionary这样的现成部件旨在解决更常见的业务问题,例如将邮政编码映射到城市等等。

You're going to have to go with external storage of some sort. 您将不得不使用某种外部存储。 My recommendation would be to buy a database and use it to store your data. 我的建议是购买数据库并用它来存储你的数据。 Then use a DataSet or some similar technology to load portions of the data into memory, manipulate them, and then pour more data from the database into the DataSet, and so on. 然后使用DataSet或类似技术将部分数据加载到内存中,对其进行操作,然后将更多数据从数据库中倒入DataSet,依此类推。

Well, I had almost exactly the same problem. 好吧,我几乎完全一样的问题。

I wanted to load about 12.5 million [string, int]s into a dictionary from a database (for all the programming "gods" above who don't understand why, the answer is that it is enormously quicker when you are working with a 150 GB database if you can cache a proportion of one of the key tables in memory). 我想从数据库中将大约1250万[string,int] s加载到一个字典中(对于上面所有不明白原因的编程“众神”,答案是当你使用150时它会非常快) GB数据库,如果可以缓存内存中一个密钥表的一部分)。

It annoyingly threw an out of memory exception at pretty much the same place - just under the 12 million mark even though the process was only consuming about 1.3 GB of memory (reduced to about 800 MB of memory after a judicious change in db read method to not try and do it all at once) - despite running on an I7 with 8 GB of memory. 它令人烦恼地在几乎相同的地方抛出一个内存不足 - 即使这个过程只消耗了大约1.3 GB的内存(在db读取方法明智地改为内存后减少到大约800 MB内存)不要尝试一次完成所有操作) - 尽管在I7上运行8 GB内存。

The solution was actually remarkably simple - in Visual Studio (2010) in Solution Explorer right click the project and select properties. 解决方案实际上非常简单 - 在解决方案资源管理器的Visual Studio(2010)中右键单击项目并选择属性。 In the Build tab set Platform Target to x64 and rebuild. 在Build选项卡中,将Platform Target设置为x64并重建。

It rattles through the load into the Dictionary in a few seconds and the Dictionary performance is very good. 它会在几秒钟内完成对字典的加载,并且字典性能非常好。

Easy solution is just use simple DB. 简单的解决方案就是使用简单的DB。 The most obvious solution in this case, IMHO is using SQLite .NET , fast, easy and with low memory footprint. 在这种情况下最明显的解决方案是,IMHO使用SQLite .NET ,快速,简单且内存占用少。

Really 13000000 items are quite a lot. 真的1300万件物品相当多。 If 13000000 are allocated classes is a very deep kick into garbage collector stomach! 如果13000000被分配的类是一个非常深的垃圾收集器胃!

Also if you find a way to use the default .NET dictionary, the performance would be really bad, too much keys, the number of keys approaches the number of values a 31 bit hash can use, performance will be awful in whatever system you use, and of course, memory will be too much! 此外,如果您找到使用默认.NET字典的方法,性能将非常糟糕,密钥太多,密钥数量接近31位散列可以使用的值的数量,在您使用的任何系统中性能都会很糟糕当然,记忆力会太多!

If you need a data structure that can use more memory than an hash table you probably need a custom hashtable mixed with a custom binary tree data structure. 如果您需要的数据结构可以使用比散列表更多的内存,则可能需要将自定义散列表与自定义二叉树数据结构混合使用。 Yes, it is possible to write your own combination of two. 是的,可以编写自己的两个组合。

You cannot rely on .net hashtable for sure for this so strange and specific problem. 对于这个如此奇怪和具体的问题,您无法依赖.net哈希表。

Consider that a tree have a lookup complexity of O(log n), while a building complexity of O(n * log n), of course, building it will be too long. 考虑到树的查找复杂度为O(log n),而建筑复杂度为O(n * log n),当然,构建它会太长。 You should then build an hashtable of binary trees (or viceversa) that will allow you to use both data structures consuming less memory. 然后,您应该构建二进制树的哈希表(或反之亦然),这将允许您使用消耗更少内存的两个数据结构。

Then, think about compiling it in 32 bit mode, not in 64 bit mode: 64 bit mode uses more memory for pointers. 然后,考虑在32位模式下编译它,而不是在64位模式下编译:64位模式使用更多内存用于指针。 In the same time it i spossible the contrary, 32 bit address space may be is not sufficient for your problem. 与此相反,相反,32位地址空间可能不足以解决您的问题。 It never happened to me to have a problem that can run out 32 bit address space! 我没有遇到过可以耗尽32位地址空间的问题!

If both keys and values are simple value types i would suggest you to write your data structure in a C dll and use it through C#. 如果键和值都是简单的值类型,我建议您在C dll中编写数据结构并通过C#使用它。

You can try to write a dictionary of dictionaries. 您可以尝试编写词典字典。 Let's say, you can split your data into chunks of 500000 items between 26 dictionaries for example, but the occupied memory would be very very big, don't think your system will handle it. 假设您可以将数据拆分为26个字典之间的500000个项目块,但占用的内存非常大,不要认为您的系统会处理它。

public class MySuperDictionary
{
    private readonly Dictionary<KEY, VALUE>[] dictionaries;

    public MySuperDictionary()
    {
        this.dictionaries = new Dictionary<KEY, VALUE>[373]; // must be a prime number.
        for (int i = 0; i < dictionaries.Length; ++i)
            dictionaries[i] = new Dicionary<KEY, VALUE>(13000000 / dictionaries.Length);
    }

    public void Add(KEY key, VALUE value)
    {
        int bucket = (GetSecondaryHashCode(key) & 0x7FFFFFFF) % dictionaries.Length;
        dictionaries[bucket].Add(key, value);
    }

    public bool Remove(KEY key)
    {
        int bucket = (GetSecondaryHashCode(key) & 0x7FFFFFFF) % dictionaries.Length;
        return dictionaries[bucket].Remove(key);
    }

    public bool TryGetValue(KEY key, out VALUE result)
    {
        int bucket = (GetSecondaryHashCode(key) & 0x7FFFFFFF) % dictionaries.Length;
        return dictionaries[bucket].TryGetValue(key, out result);
    }

    public static int GetSecondaryHashCode(KEY key)
    {
        here you should return an hash code for key possibly using a different hashing algorithm than the algorithm you use in inner dictionaries
    }
}

I think that you need a new approach to your processing. 我认为您需要一种新的处理方法。

I must assume that you obtain the data from a file or a database, either way that is where it should remain. 我必须假设您从文件或数据库中获取数据,无论哪种方式都应该保留。

There is no way that you may actually increase the limit on the number of values stored within a Dictionary, other than increasing system memory, but eitherway it is an extremely inefficient means of processing such a alarge amount of data. 除了增加系统内存之外,您无法实际增加存储在Dictionary中的值的数量限制,但无论如何,它都是处理如此大量数据的极其低效的方法。

You should rethink your algorithm so that you can process the data in more manageable portions. 您应该重新考虑您的算法,以便您可以在更易于管理的部分处理数据。 It will mean processing it in stages until you get your result. 这将意味着分阶段处理它,直到你得到你的结果。 This may mean many hundreeds of passes through the data, but it's the only way to do it. 这可能意味着许多通过数据的hundreeds,但这是唯一的方法。

I would also suggest that you look at using generics to help speed up this repetitive processing and cut down on memory usage. 我还建议您使用泛型来帮助加快重复处理速度并减少内存使用量。

Remember that there will still be a balancing act between system performance and access to externally stored data (be it external disk store or database). 请记住,系统性能和对外部存储数据(无论是外部磁盘存储或数据库)的访问之间仍然存在平衡行为。

It is not the problem with the Dictionary object, but the available memory in your server. 这不是Dictionary对象的问题,而是服务器中的可用内存。 I've done some investigation to understand the failures of dictionary object, but it never failed. 我已经做了一些调查来了解字典对象的失败,但它从未失败过。 Below is the code for your reference 以下是供您参考的代码

    private static void TestDictionaryLimit()
    {
        int intCnt = 0;
        Dictionary<long, string> dItems = new Dictionary<long, string>();
        Console.WriteLine("Total number of iterations = {0}", long.MaxValue);
        Console.WriteLine("....");
        for (long lngCnt = 0; lngCnt < long.MaxValue; lngCnt++)
        {
            if (lngCnt < 11950020)
                dItems.Add(lngCnt, lngCnt.ToString());
            else
                break;
            if ((lngCnt % 100000).Equals(0))
                Console.Write(intCnt++);
        }
        Console.WriteLine("Completed..");
        Console.WriteLine("{0} number of items in dictionary", dItems.Count);
    }

The above code executes properly, and stores more than the number of count that you have mentioned. 上面的代码执行正常,并且存储的数量超过了您提到的计数。

With that many keys, you should either use a database or something like memcache while swapping out chunks of the cache in storage. 使用那么多密钥,您应该使用数据库或类似memcache的东西,同时在存储中交换缓存块。 I'm doubting you need all of the items at once, and if you do, there's no way it's going to work on a low-powered machine with little RAM. 我怀疑你是否需要同时使用所有项目,如果你这样做,那么它就无法在内存很少的低功耗机器上运行。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM