简体   繁体   English

如何使用linq在字典中的项的值内获得不同的变量计数

[英]How to get a distinct count of variable within value of items in a dictionary using linq

I have a dictionary that has about 40 million items, I'm trying to get a distinct count based on a ulong defined in the value of each keyvaluepair in the dictionary. 我有一个有大约4000万个项目的字典,我试图根据字典中每个keyvaluepair值中定义的ulong来获得一个独特的计数。

The way I'm currently doing it: 我目前正在这样做的方式:

int Total = (from c in Items select c.Value.Requester).Distinct().Count();

The only problem is the fact that my app is using about 3.9GB of ram, and this method seems to be making copies of those items it finds (which happens to be about 95% of the items in the dictionary) so the ram usage is spiking a couple more gigabytes before GC gets around to handling it all. 唯一的问题是我的应用程序正在使用大约3.9GB的内存,这个方法似乎正在复制它找到的那些项目(这恰好是字典中约95%的项目)所以ram的用法是在GC开始处理所有这些之前,增加了几千兆字节。

Is there a way to get a distinct count without making copies? 有没有办法在不制作副本的情况下获得独特的计数?

No, you can't do that. 不,你做不到。 It needs to copy the values because it needs to remember which values it has seen before. 它需要复制值,因为它需要记住它之前看到的值。

If you had a list where the items were sorted by Value.Requester then you could count distinct values with a single linear scan without copying. 如果您有一个列表,其中的项目按Value.Requester排序,那么您可以使用单个线性扫描计算不同的值而无需复制。 But you don't have that. 但你没有那个。

If you know that your values lie within a specific range (eg 1 to 100,000,000) you could write a more memory efficient algorithm using a bit array. 如果您知道您的值位于特定范围内(例如1到100,000,000),则可以使用位数组编写更高效的内存算法。 You can create an array of 100,000,000 bits (an array of 3.2 million ints) which would only consume about 12.5 megabytes, and use this to store which values you have seen. 您可以创建一个100,000,000位的数组(一个320万字节的数组),它只消耗大约12.5兆字节,并使用它来存储您看到的值。

Here's some code that you might be able to use: 以下是您可以使用的一些代码:

// Warning: this scans the input multiple times!
// Rewriting the code to only use a single scan is left as an exercise
// for the reader.
public static int DistinctCount(this IEnumerable<int> values)
{
    int min = values.Min();
    int max = values.Max();
    uint[] bitarray = new uint[(max - min + 31) / 32];
    foreach (int value in values)
    {
        int i = (value - min) / 32;
        int j = (value - min) % 32;
        bitarray[i] |= (uint)(1 << j);
    }

    uint count = 0;
    for (int i = 0; i < bitarray.Length; ++i)
    {
        uint bits = bitarray[i];
        while (bits != 0)
        {
            count += bits & 1;
            bits >>= 1;
        }
    }
    return (int)count;
}

Use like this: 使用这样:

int Total = (from c in Items select c.Value.Requester).DistinctCount();

You might have to rethink how you create your dictionary. 您可能需要重新考虑如何创建字典。 If you are building it from a file, you might want to read in smaller chunks of it at a time. 如果要从文件构建它,您可能希望一次读取较小的块。 To get your distinct items, you could, from each chunk of the dictionary file, start adding items to a HashSet<> . 要获取不同的项目,您可以从字典文件的每个块开始向HashSet<>添加项目。 The final size of the HashSet<> will be the number of distinct items. HashSet<>的最终大小将是不同项的数量。 This approach might still be slow, as the collection needs to do work to make sure a value doesn't already exist each time you add a value to the set. 这种方法可能仍然很慢,因为集合需要做的工作是确保每次向集合添加值时都不存在值。

I would take some hints from Mark's answer: make sure your input is sorted before you read it into your application: you can count distinct items in a single pass if your data is sorted (you basically count the number of times the value at n differs from the value at n + 1 . 我会从Mark的答案中得到一些提示:确保输入在您将其读入应用程序之前进行排序:如果您的数据已排序,您可以在一次通过中计算不同的项目(您基本上计算n处的值不同的次数)来自n + 1的值。

As others have already pointed out the structure you use can't do what you want without copying... 正如其他人已经指出你使用的结构不能做你想要的而不复制......

IF you really need to do this with your current structure I think you will have to introduce some redundancy... ie when you insert/remove items from this "big Dictionary" maintain a second rather small one which just keeps the distinct Values with a count (BEWARE of multi-threading issues)... 如果您确实需要使用当前结构执行此操作,我认为您将不得不引入一些冗余...即,当您从这个“大词典”中插入/删除项目时,保留第二个相当小的一个,它只保留不同的值count(请注意多线程问题)......

As for an alternative: 至于替代方案:

Use a Database... if need be there are in-memory-DBs... but I am pretty sure that a disk-based DB would be more than up to the task (40 million an hour would be less than 20K per second)... I am more of an Oracle guy... but SQLite, Postgres etc. are absolutely fine for this too... you could use SQLite as a pure "in-memory-DB" if you want and/or you can create a RAM disk and put the DB files there. 使用数据库...如果需要有内存数据库......但我很确定基于磁盘的数据库将超过任务(每小时4000万,每秒小于20K) )...我更像是一个甲骨文家伙...但SQLite,Postgres等对此也是绝对的好......如果你想要和/或你,你可以使用SQLite作为纯粹的“内存数据库”可以创建一个RAM磁盘并将数据库文件放在那里。

Although its practically useless in most case scenario's, this is technically possible with a simple O(n^2) algorithm (this will take some minutes to execute on 40 000 000 items) 虽然在大多数情况下它实际上没用,但是在技术上可以使用简单的O(n ^ 2)算法(这将需要几分钟才能执行40,000 000个项目)

public static int DistinctCount(this IEnumerable<int> values)
        {

        int max = values.Max();
        int last = int.MinValue;
        int result = 0;

        do
        {
            int current = int.MaxValue;
            foreach (int value in values)
            {
                if (value < current && value > last)
                {
                    current = value;
                }
            }

            result++;
            last = current;

        } while (last != max);

        return result;
    }

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM