groupBy + count 是否有启发式算法？

Question

I got a List of integers and I want to count the number of times each integer appears in the list.我有一个整数列表，我想计算每个整数出现在列表中的次数。

For example: [0,5,0,1,3,3,1,1,1] gives (0 -> 2), (1 -> 4), (3 -> 2), (5 -> 1) .例如： [0,5,0,1,3,3,1,1,1]给出(0 -> 2), (1 -> 4), (3 -> 2), (5 -> 1) . I only need the count, not the value (the goal is to have an histogram of the counts).我只需要计数，而不是值（目标是获得计数的直方图）。

A common approach would be to group by value then count the cardinality of each set.一种常见的方法是按值分组，然后计算每个集合的基数。 In SQL: SELECT count(*) FROM myTable GROUPBY theColumnContainingIntegers .在 SQL 中： SELECT count(*) FROM myTable GROUPBY theColumnContainingIntegers 。

Is there a faster way to do this?有没有更快的方法来做到这一点？ A heuristic or a probabilistic approach is fine since I am computing a large data set and sacrifying precision for speed is fine.启发式或概率方法很好，因为我正在计算大型数据集并且为了速度而牺牲精度很好。

Something similar to HyperLogLog algorithm (used to count the number of distinct elements in a data set) would be great, but I did not find anything like this...类似于 HyperLogLog 算法（用于计算数据集中不同元素的数量）的东西会很棒，但我没有找到这样的东西......

Answer 1

Let's take your set containing 9 elements [0,5,0,1,3,3,1,1,1] and make it big but with same frequencies of elements:让我们将包含 9 个元素[0,5,0,1,3,3,1,1,1]集合设为大但元素频率相同：

> bigarray = [0,5,0,1,3,3,1,1,1] * 200
 => [0, 5, 0, 1, 3, 3, 1, 1, 1, 0, 5, 0, 1, 3, 3, 1, ...

Now bigarray size is 1800 so let's try to work with it.现在 bigarray 大小为 1800，所以让我们尝试使用它。

Take a sample of 180 elements (random 180 elements from this set)取 180 个元素的样本（从这个集合中随机抽取 180 个元素）

Now compute occurence for this random subset现在计算这个随机子集的出现

{5=>19, 3=>45, 1=>76, 0=>40}

Normalized:标准化：

{5=>1.0, 3=>2.3684210526315788, 1=>4.0, 0=>2.1052631578947367}

Of course for different random subset results will be different:当然对于不同的随机子集结果会有所不同：

{5=>21, 3=>38, 1=>86, 0=>35}

Normalized归一化

{5=>1.0, 3=>1.8095238095238095, 1=>4.095238095238095, 0=>1.6666666666666667}

Of course there are some errors there - this is inevitable and you will need to state what will be acceptable error当然，那里有一些错误 - 这是不可避免的，您需要说明可接受的错误

Now make same test for bigarray (size 1000) with 50% of 0's and 50% of 1's现在用 50% 的 0 和 50% 的 1 对 bigarray（大小 1000）进行相同的测试

 > bigarray = [0,1] * 500
 => [0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0,  ...

With sample of 100 elements:使用 100 个元素的样本：

{0=>50, 1=>50}

Normalized归一化

{0=>1.0, 1=>1.0}

Second sample:第二个样本：

{0=>49, 1=>51}

Normalized归一化

{0=>1.0, 1=>1.0408163265306123}

It seems that we can easily reduce our subset and here Sampling comes.似乎我们可以轻松地减少我们的子集，采样就来了。

Especially Reservoir Sampling - this may be very useful if in your case data is populated 'live' or set is too large to process all values at once.尤其是Reservoir Sampling - 如果在您的情况下数据填充为“实时”或集合太大而无法一次处理所有值，这可能非常有用。

edit编辑

Concerning comment: Of course if you have large set and some element appears there very rare then you may have lost it and occurence will equal 0.关于评论：当然，如果你有一个很大的集合并且一些元素出现在那里非常罕见，那么你可能已经丢失了它并且出现将等于 0。

Then you may use kind of smoothing function (check additive smoothing ).然后你可以使用一种平滑功能（检查附加平滑）。 Just assume that each possible element 1 more time than it really appeared.假设每个可能的元素比实际出现的时间多 1 次。

For example let's say we have set:例如，假设我们已经设置：

1000 elements 1
100 elements 2
10 elements 3
1 elements 4

Let's say our subset contains {1=>100,2=>10,3=>1, 4=>0}假设我们的子集包含 {1=>100,2=>10,3=>1, 4=>0}

Smoothing param = 0.05 so we add 0.05 to each occurence平滑参数 = 0.05 所以我们为每次出现添加 0.05

{1=>100.05,2=>10.05,3=>1.05, 4=>0.05} {1=>100.05,2=>10.05,3=>1.05, 4=>0.05}

Of course this is assuming that you know what values are even possible to be present in the set.当然，这是假设您知道集合中甚至可能存在哪些值。

groupBy + count 是否有启发式算法？

问题描述

1 个解决方案

解决方案1
1 已采纳 2018-12-02 09:48:28

groupBy + count 是否有启发式算法？

问题描述

1 个解决方案

解决方案1 1 已采纳 2018-12-02 09:48:28

解决方案1
1 已采纳 2018-12-02 09:48:28