简体   繁体   English

创建表的随机子集,每个键的平均计数数

[英]creating a random subset of a table with an average number of counts per keys

I have a database with 1 billion key val pairs with 20 million unique key s.我有一个包含 10 亿个key val对和 2000 万个唯一key的数据库。 On average, each key is associated with 50 val s.平均而言,每个key与 50 个val相关联。

key  val
key1 val1
key1 val2
key1 val3
key2 val2
key2 val7
.
.
.

I ran the following and got the standard deviation of the number of val s per each unique key .我运行了以下并得到了每个唯一keyval数量的标准偏差。

select avg(cnt), stddev(cnt)
  from (select count(key) as cnt, key
        from original_db)

This gives avg(cnt) = 50 and stddev(cnt)=137这给出了 avg(cnt) = 50 和 stddev(cnt)=137

I would like to create a subset of key s from this table such that the avg(cnt) of the subset is 100. This means that on average, each unique key in the subset table is associated with an average of ~ 100 values.我想从这个表中创建一个key的子集,这样子集的 avg(cnt) 是 100。这意味着平均而言,子集表中的每个唯一键都与平均约 100 个值相关联。

You can aggregate and use a cumulative average to calculate a running average:您可以汇总并使用累积平均值来计算运行平均值:

select key
from (select key, count(*) as cnt,
             avg(count(*)) over (order by cnt desc, key) as running_avg
      from t
     ) t
where running_avg >= 100;

In other words, this takes all the keys have have 100+ values and then keeps taking a smaller number while the cumulative average is 100 or over.换句话说,这需要所有键都具有 100+ 个值,然后在累积平均值为 100 或更多时继续取较小的数字。

Do note that this could return no keys, if no keys have 100 values.请注意,如果没有键具有 100 个值,则这可能不会返回任何键。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM