简体   繁体   English

从具有加权行概率的 PostgreSQL 表中选择随机行

[英]Select random row from a PostgreSQL table with weighted row probabilities

Example input:示例输入:

SELECT * FROM test;
 id | percent   
----+----------
  1 | 50 
  2 | 35   
  3 | 15   
(3 rows)

How would you write such query, that on average 50% of time i could get the row with id=1, 35% of time row with id=2, and 15% of time row with id=3?你会如何编写这样的查询,平均有 50% 的时间我可以获得 id=1 的行,35% 的时间行 id=2,以及 15% 的时间行 id=3?

I tried something like SELECT id FROM test ORDER BY p * random() DESC LIMIT 1 , but it gives wrong results.我尝试了类似SELECT id FROM test ORDER BY p * random() DESC LIMIT 1 ,但它给出了错误的结果。 After 10,000 runs I get a distribution like: {1=6293, 2=3302, 3=405} , but I expected the distribution to be nearly: {1=5000, 2=3500, 3=1500} .运行 10,000 次后,我得到的分布如下: {1=6293, 2=3302, 3=405} ,但我预计分布接近: {1=5000, 2=3500, 3=1500}

Any ideas?有任何想法吗?

This should do the trick:这应该可以解决问题:

WITH CTE AS (
    SELECT random() * (SELECT SUM(percent) FROM YOUR_TABLE) R
)
SELECT *
FROM (
    SELECT id, SUM(percent) OVER (ORDER BY id) S, R
    FROM YOUR_TABLE CROSS JOIN CTE
) Q
WHERE S >= R
ORDER BY id
LIMIT 1;

The sub-query Q gives the following result:子查询Q给出以下结果:

1  50
2  85
3  100

We then simply generate a random number in range [0, 100) and pick the first row that is at or beyond that number (the WHERE clause).然后我们简单地生成一个范围 [0, 100) 的随机数,并选择等于或超过该数字的第一行( WHERE子句)。 We use common table expression ( WITH ) to ensure the random number is calculated only once.我们使用公用表表达式( WITH )来确保随机数只计算一次。

BTW, the SELECT SUM(percent) FROM YOUR_TABLE allows you to have any weights in percent - they don't strictly need to be percentages (ie add-up to 100).顺便说一句, SELECT SUM(percent) FROM YOUR_TABLE允许您以percent任何权重 - 它们并不严格需要是百分比(即加起来为 100)。

[SQL Fiddle] [SQL 小提琴]

ORDER BY random() ^ (1.0 / p) ORDER BY random() ^ (1.0 / p)

from the algorithm described by Efraimidis and Spirakis.来自 Efraimidis 和 Spirakis 描述的算法。

Branko's accepted solution is great (thanks!). Branko 接受的解决方案很棒(谢谢!)。 However, I'd like to contribute an alternative that is just as performant (according to my tests), and perhaps easier to visualize.但是,我想提供一种性能相同的替代方案(根据我的测试),并且可能更易于可视化。

Let's recap.让我们回顾一下。 The original question can perhaps be generalized as follows:最初的问题或许可以概括如下:

Given an map of ids and relative weights , create a query that returns a random id in the map, but with a probability proportional to its relative weight.给定一个 id 和相对权重的映射,创建一个查询,在映射中返回一个随机 id,但概率与其相对权重成正比。

Note the emphasis on relative weights, not percent.请注意强调相对权重,而不是百分比。 As Branko points out in his answer, using relative weights will work for anything, including percents.正如布兰科在他的回答中指出的那样,使用相对权重适用于任何事情,包括百分比。

Now, consider some test data, which we'll put in a temporary table:现在,考虑一些测试数据,我们将把它们放在一个临时表中:

CREATE TEMP TABLE test AS
SELECT * FROM (VALUES
    (1, 25),
    (2, 10),
    (3, 10),
    (4, 05)
) AS test(id, weight);

Note that I'm using a more complicated example than that in the original question, in that it does not conveniently add up to 100, and in that the same weight (20) is used more than once (for ids 2 and 3), which is important to consider, as you'll see later.请注意,我使用的示例比原始问题中的示例更复杂,因为它不能方便地加起来为 100,并且多次使用相同的权重(20)(对于 ID 2 和 3),考虑这一点很重要,稍后您将看到。

The first thing we have to do is turn the weights into probabilities from 0 to 1, which is nothing more than a simple normalization (weight / sum(weights)):我们要做的第一件事就是将权重转化为从 0 到 1 的概率,这无非是简单的归一化(weight / sum(weights)):

WITH p AS ( -- probability
    SELECT *,
        weight::NUMERIC / sum(weight) OVER () AS probability
    FROM test
),
cp AS ( -- cumulative probability
    SELECT *,
        sum(p.probability) OVER (
            ORDER BY probability DESC
            ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
        ) AS cumprobability
    FROM p
)
SELECT
    cp.id,
    cp.weight,
    cp.probability,
    cp.cumprobability - cp.probability AS startprobability,
    cp.cumprobability AS endprobability
FROM cp
;

This will result in the following output:这将导致以下输出:

 id | weight | probability | startprobability | endprobability
----+--------+-------------+------------------+----------------
  1 |     25 |         0.5 |              0.0 |            0.5
  2 |     10 |         0.2 |              0.5 |            0.7
  3 |     10 |         0.2 |              0.7 |            0.9
  4 |      5 |         0.1 |              0.9 |            1.0

The query above is admittedly doing more work than strictly necessary for our needs, but I find it helpful to visualize the relative probabilities this way , and it does make the final step of choosing the id trivial:诚然,上面的查询所做的工作比我们的需求所必需的要多,但我发现以这种方式可视化相对概率很有帮助,并且它确实使选择 id 的最后一步变得微不足道:

SELECT id FROM (queryabove)
WHERE random() BETWEEN startprobability AND endprobability;

Now, let's put it all together with a test that ensures the query is returning data with the expected distribution.现在,让我们通过一个测试来确保查询返回具有预期分布的数据。 We'll use generate_series() to generate a random number a million times :我们将使用generate_series()生成一百万次随机数:

WITH p AS ( -- probability
    SELECT *,
        weight::NUMERIC / sum(weight) OVER () AS probability
    FROM test
),
cp AS ( -- cumulative probability
    SELECT *,
        sum(p.probability) OVER (
            ORDER BY probability DESC
            ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
        ) AS cumprobability
    FROM p
),
fp AS ( -- final probability
    SELECT
        cp.id,
        cp.weight,
        cp.probability,
        cp.cumprobability - cp.probability AS startprobability,
        cp.cumprobability AS endprobability
    FROM cp
)
SELECT *
FROM fp
CROSS JOIN (SELECT random() FROM generate_series(1, 1000000)) AS random(val)
WHERE random.val BETWEEN fp.startprobability AND fp.endprobability
;

This will result in output similar to the following:这将导致类似于以下的输出:

 id | count  
----+--------
 1  | 499679 
 3  | 200652 
 2  | 199334 
 4  | 100335 

Which, as you can see, tracks the expected distribution perfectly.正如您所看到的,它完美地跟踪了预期的分布。

Performance表现

The query above is quite performant.上面的查询非常高效。 Even in my average machine, with PostgreSQL running in a WSL1 instance (the horror!), execution is relatively fast:即使在我的普通机器上,PostgreSQL 运行在 WSL1 实例中(可怕!),执行速度也相对较快:

     count | time (ms)
-----------+----------
     1,000 |         7
    10,000 |        25
   100,000 |       210
 1,000,000 |      1950 

Adaptation to generate test data适应生成测试数据

I often use a variation of the query above when generating test data for unit/integration tests.在为单元/集成测试生成测试数据时,我经常使用上述查询的变体。 The idea is to generate random data that approximates a probability distribution that tracks reality.这个想法是生成近似于跟踪现实的概率分布的随机数据。

In that situation I find it useful to compute the start and end distributions once and storing the results in a table:在那种情况下,我发现一次计算开始和结束分布并将结果存储在表中很有用:

CREATE TEMP TABLE test AS
WITH test(id, weight) AS (VALUES
    (1, 25),
    (2, 10),
    (3, 10),
    (4, 05)
),
p AS ( -- probability
    SELECT *, (weight::NUMERIC / sum(weight) OVER ()) AS probability
    FROM test
),
cp AS ( -- cumulative probability
    SELECT *,
        sum(p.probability) OVER (
            ORDER BY probability DESC
            ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
        ) cumprobability
    FROM p
)
SELECT
    cp.id,
    cp.weight,
    cp.probability,
    cp.cumprobability - cp.probability AS startprobability,
    cp.cumprobability AS endprobability
FROM cp
;

I can then use these precomputed probabilities repeatedly, which results in extra performance and simpler use.然后我可以重复使用这些预先计算的概率,这会带来额外的性能和更简单的使用。

I can even wrap it all in a function that I can call any time I want to get a random id:我什至可以将它全部包装在一个函数中,我可以在我想获得随机 ID 的任何时候调用该函数:

CREATE OR REPLACE FUNCTION getrandomid(p_random FLOAT8 = random())
RETURNS INT AS
$$
    SELECT id
    FROM test
    WHERE p_random BETWEEN startprobability AND endprobability
    ;
$$
LANGUAGE SQL STABLE STRICT

Window function frames窗函数框

It's worth noting that the technique above is using a window function with a non-standard frame ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW .值得注意的是,上面的技术使用了一个窗口函数,它带有一个非标准框架ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW This is necessary to deal with the fact that some weights might be repeated, which is why I chose test data with repeated weights in the first place!这对于处理某些权重可能重复的事实是必要的,这就是为什么我首先选择了具有重复权重的测试数据!

Your proposed query appears to work;您提出的查询似乎有效; see this SQLFiddle demo .请参阅此 SQLFiddle 演示 It creates the wrong distribution though;但是它会产生错误的分布; see below.见下文。

To prevent PostgreSQL from optimising the subquery I've wrapped it in a VOLATILE SQL function.为了防止 PostgreSQL 优化子查询,我将它包装在一个VOLATILE SQL 函数中。 PostgreSQL has no way to know that you intend the subquery to run once for every row of the outer query, so if you don't force it to volatile it'll just execute it once. PostgreSQL 无法知道您打算为外部查询的每一行运行一次子查询,因此如果您不强制它可变,它只会执行一次。 Another possibility - though one that the query planner might optimize out in future - is to make it appear to be a correlated subquery, like this hack that uses an always-true where clause, like this: http://sqlfiddle.com/#!12/3039b/9另一种可能性——尽管查询规划器将来可能会优化——是让它看起来是一个相关的子查询,就像这个使用始终为真的 where 子句的黑客,像这样: http : //sqlfiddle.com/# !12/3039b/9

At a guess (before you updated to explain why it didn't work) your testing methodology was at fault, or you're using this as a subquery in an outer query where PostgreSQL is noticing it isn't a correlated subquery and executing it just once, like in this example .猜测(在您更新以解释它为什么不起作用之前)您的测试方法有问题,或者您将其用作外部查询中的子查询,其中 PostgreSQL 注意到它不是相关子查询并执行它就一次,就像在这个例子中一样 . .

UPDATE: The distribution produced isn't what you're expecting.更新:产生的分布不是你所期望的。 The issue here is that you're skewing the distribution by taking multiple samples of random() ;这里的问题是您通过获取random()多个样本来扭曲分布; you need a single sample.你需要一个样本。

This query produces the correct distribution ( SQLFiddle ):此查询生成正确的分布 ( SQLFiddle ):

WITH random_weight(rw) AS (SELECT random() * (SELECT sum(percent) FROM test))
 SELECT id
FROM (                   
  SELECT 
    id,
    sum(percent) OVER (ORDER BY id),
    coalesce(sum(prev_percent) OVER (ORDER BY id),0) FROM (
      SELECT 
        id,
        percent,
        lag(percent) OVER () AS prev_percent
      FROM test
    ) x
) weighted_ids(id, weight_upper, weight_lower)
CROSS JOIN random_weight
WHERE rw BETWEEN weight_lower AND weight_upper;

Performance is, needless to say, horrible.不用说,性能是可怕的。 It's using two nested sets of windows.它使用两组嵌套的窗口。 What I'm doing is:我正在做的是:

  • Creating (id, percent, previous_percent) then using that to create two running sums of weights that are used as range brackets;创建 (id, percent, previous_percent) 然后使用它来创建两个运行的权重总和,用作范围括号; then然后
  • Taking a random value, scaling it to the range of weights, and then picking a value that has weights within the target bracket取一个随机值,将其缩放到权重范围,然后选择一个权重在目标括号内的值

Here is something for you to play with:这里有一些东西供你玩:

select t1.id as id1
  , case when t2.id is null then 0 else t2.id end as id2
  , t1.percent as percent1
  , case when t2.percent is null then 0 else t2.percent end as percent2 
from "Test1" t1 
  left outer join "Test1" t2 on t1.id = t2.id + 1
where random() * 100 between t1.percent and 
  case when t2.percent is null then 0 else t2.percent end;

Essentially perform a left outer join so that you have two columns to apply a between clause.基本上执行左外连接,以便您有两列应用 between 子句。

Note that it will only work if you get your table ordered in the right way.请注意,只有当您以正确的方式订购餐桌时,它才会起作用。

Based on Branko Dimitrijevic's answer, I wrote this query, which may or may not be faster by using the sum total of percent using tiered windowing functions (not unlike a ROLLUP ).根据 Branko Dimitrijevic 的回答,我编写了这个查询,通过使用分层窗口函数(与ROLLUP不同)使用percent的总和,该查询可能会更快,也可能不会更快。

WITH random AS (SELECT random() AS random)
SELECT id FROM (
    SELECT id, percent,
    SUM(percent) OVER (ORDER BY id) AS rank,
    SUM(percent) OVER () * random AS roll
    FROM test CROSS JOIN random
) t WHERE roll <= rank LIMIT 1

If the ordering isn't important, SUM(percent) OVER (ROWS UNBOUNDED PRECEDING) AS rank, may be preferable because it avoids having to sort the data first.如果排序不重要, SUM(percent) OVER (ROWS UNBOUNDED PRECEDING) AS rank,可能更可取,因为它避免了首先对数据进行排序。

I also tried Mechanic Wei's answer ( as described in this paper, apparently ), which seems very promising in terms of performance, but after some testing, the distribution appear to be off :我还尝试了 Mechanic Wei 的回答( 显然如本文所述),这在性能方面似乎非常有希望,但经过一些测试,分布似乎已关闭

SELECT id
FROM test
ORDER BY random() ^ (1.0/percent)
LIMIT 1

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM