简体   繁体   中英

How to get a distinct count of variable within value of items in a dictionary using linq

I have a dictionary that has about 40 million items, I'm trying to get a distinct count based on a ulong defined in the value of each keyvaluepair in the dictionary.

The way I'm currently doing it:

int Total = (from c in Items select c.Value.Requester).Distinct().Count();

The only problem is the fact that my app is using about 3.9GB of ram, and this method seems to be making copies of those items it finds (which happens to be about 95% of the items in the dictionary) so the ram usage is spiking a couple more gigabytes before GC gets around to handling it all.

Is there a way to get a distinct count without making copies?

No, you can't do that. It needs to copy the values because it needs to remember which values it has seen before.

If you had a list where the items were sorted by Value.Requester then you could count distinct values with a single linear scan without copying. But you don't have that.

If you know that your values lie within a specific range (eg 1 to 100,000,000) you could write a more memory efficient algorithm using a bit array. You can create an array of 100,000,000 bits (an array of 3.2 million ints) which would only consume about 12.5 megabytes, and use this to store which values you have seen.

Here's some code that you might be able to use:

// Warning: this scans the input multiple times!
// Rewriting the code to only use a single scan is left as an exercise
// for the reader.
public static int DistinctCount(this IEnumerable<int> values)
{
    int min = values.Min();
    int max = values.Max();
    uint[] bitarray = new uint[(max - min + 31) / 32];
    foreach (int value in values)
    {
        int i = (value - min) / 32;
        int j = (value - min) % 32;
        bitarray[i] |= (uint)(1 << j);
    }

    uint count = 0;
    for (int i = 0; i < bitarray.Length; ++i)
    {
        uint bits = bitarray[i];
        while (bits != 0)
        {
            count += bits & 1;
            bits >>= 1;
        }
    }
    return (int)count;
}

Use like this:

int Total = (from c in Items select c.Value.Requester).DistinctCount();

You might have to rethink how you create your dictionary. If you are building it from a file, you might want to read in smaller chunks of it at a time. To get your distinct items, you could, from each chunk of the dictionary file, start adding items to a HashSet<> . The final size of the HashSet<> will be the number of distinct items. This approach might still be slow, as the collection needs to do work to make sure a value doesn't already exist each time you add a value to the set.

I would take some hints from Mark's answer: make sure your input is sorted before you read it into your application: you can count distinct items in a single pass if your data is sorted (you basically count the number of times the value at n differs from the value at n + 1 .

As others have already pointed out the structure you use can't do what you want without copying...

IF you really need to do this with your current structure I think you will have to introduce some redundancy... ie when you insert/remove items from this "big Dictionary" maintain a second rather small one which just keeps the distinct Values with a count (BEWARE of multi-threading issues)...

As for an alternative:

Use a Database... if need be there are in-memory-DBs... but I am pretty sure that a disk-based DB would be more than up to the task (40 million an hour would be less than 20K per second)... I am more of an Oracle guy... but SQLite, Postgres etc. are absolutely fine for this too... you could use SQLite as a pure "in-memory-DB" if you want and/or you can create a RAM disk and put the DB files there.

Although its practically useless in most case scenario's, this is technically possible with a simple O(n^2) algorithm (this will take some minutes to execute on 40 000 000 items)

public static int DistinctCount(this IEnumerable<int> values)
        {

        int max = values.Max();
        int last = int.MinValue;
        int result = 0;

        do
        {
            int current = int.MaxValue;
            foreach (int value in values)
            {
                if (value < current && value > last)
                {
                    current = value;
                }
            }

            result++;
            last = current;

        } while (last != max);

        return result;
    }

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM