[英]creating a random subset of a table with an average number of counts per keys
I have a database with 1 billion key
val
pairs with 20 million unique key
s.我有一个包含 10 亿个
key
val
对和 2000 万个唯一key
的数据库。 On average, each key
is associated with 50 val
s.平均而言,每个
key
与 50 个val
相关联。
key val
key1 val1
key1 val2
key1 val3
key2 val2
key2 val7
.
.
.
I ran the following and got the standard deviation of the number of val
s per each unique key
.我运行了以下并得到了每个唯一
key
的val
数量的标准偏差。
select avg(cnt), stddev(cnt)
from (select count(key) as cnt, key
from original_db)
This gives avg(cnt) = 50 and stddev(cnt)=137这给出了 avg(cnt) = 50 和 stddev(cnt)=137
I would like to create a subset of key
s from this table such that the avg(cnt) of the subset is 100. This means that on average, each unique key in the subset table is associated with an average of ~ 100 values.我想从这个表中创建一个
key
的子集,这样子集的 avg(cnt) 是 100。这意味着平均而言,子集表中的每个唯一键都与平均约 100 个值相关联。
You can aggregate and use a cumulative average to calculate a running average:您可以汇总并使用累积平均值来计算运行平均值:
select key
from (select key, count(*) as cnt,
avg(count(*)) over (order by cnt desc, key) as running_avg
from t
) t
where running_avg >= 100;
In other words, this takes all the keys have have 100+ values and then keeps taking a smaller number while the cumulative average is 100 or over.换句话说,这需要所有键都具有 100+ 个值,然后在累积平均值为 100 或更多时继续取较小的数字。
Do note that this could return no keys, if no keys have 100 values.请注意,如果没有键具有 100 个值,则这可能不会返回任何键。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.