简体   繁体   English

在SQL / PostgreSQL中使用加权过滤器随机选择行

[英]Random row selection with weighted filters in SQL/PostgreSQL

I have a questions table and I need to get X questions to prepare a test. 我有一个问题表,需要准备X个问题才能准备考试。 The questions need to be filtered according to multiple criteria (subject, institution, area, etc.), each with different weights. 需要根据多个标准(主题,机构,区域等)对问题进行过滤,每个标准具有不同的权重。

The filters weight are dynamically setted and normalized outside the query. 过滤器权重是在查询外部动态设置和标准化的。 Ex.: 例如:

  1. Subject 1 — 0.4 主题1 — 0.4
  2. Subject 2 — 0.1 主题2 — 0.1
  3. Subject 3 — 0.5 主题3 — 0.5
  4. Institution 1 — 0.2 机构1 — 0.2
  5. Institution 2 — 0.04 机构2-0.04
  6. Institution 3 — 0.76 机构3 — 0.76
  7. Area 1 — 1 区域1 — 1

Some other points: 其他一些要点:

  • Today, I have 10 different filters (subject, institution, area, etc.), but the user can select in a multiple and mixed way (ex.: 10 subjects, 5 institutions, 30 areas, etc.), like in the sample above. 今天,我有10个不同的过滤器(主题,机构,区域等),但是用户可以以多种方式(例如:10个主题,5个机构,30个区域等)进行选择以上。
  • The questions table have ~500k rows; 问题表有约50万行;
  • The filters are N — N with the questions; 筛选条件为N — N个问题;
  • After the filtering, I want to limit the returned rows; 过滤之后,我想限制返回的行;
  • If some filter can't offer any more questions, the other ones must be considered (remember: I want to prepare a test -- if I have questions left, they must be used) 如果某个过滤器不能再提供其他问题,则必须考虑其他问题(请记住:我要准备测试-如果我还有问题,则必须使用它们)
  • I'm very concerned with the performance of this query. 我非常关心此查询的性能。

To illustrate, if I didn't want to weight the filters, I would do something like that: 为了说明这一点,如果我不想加权过滤器,我会做类似的事情:

SELECT
    *
FROM
    public.questions q
    INNER JOIN public.subjects_questions sq ON q.id = sq.question_id
    INNER JOIN public.subjects s ON s.id = sq.subject_id
    INNER JOIN public.institutions_questions iq ON iq.question_id = q.id
    INNER JOIN public.institutions i ON i.id = iq.institution_id
    INNER JOIN public.areas_questions aq ON aq.question_id = q.id
    INNER JOIN public.areas a ON a.id = aq.area_id
WHERE
    s.id IN :subjects
    AND a.id IN :areas
    AND i.id IN :institutions
ORDER BY
    random() limit 200

Desired output: 所需的输出:

Question — Subject — Institution — Area

I thought in something along the lines: 我以为是这样:

  1. Create a CTE with the questions returned by the filter; 使用过滤器返回的问题创建CTE; must consider that the same question can be returned by more than one filter — do I need to evaluate each filter apart and UNION ALL then to solve this? 必须考虑到一个以上的过滤器可以返回相同的问题-我是否需要将每个过滤器分开评估,然后用UNION ALL来解决这个问题? Must assign, too, from what filter the question came from; 还必须分配问题来自哪个过滤器;
  2. Create another CTE with weights and the respective filter associated; 创建另一个具有权重和关联的过滤器的CTE;
  3. JOIN the CTE's, but at this point the questions must be grouped and the weights SUMmed; 加入CTE,但此时必须对问题进行分组,并对权重求和。
  4. Apply a Window Function and return the results, limitted to X rows (LIMIT X). 应用窗口函数并返回结果(限于X行(LIMIT X))。

How would you write such query / solve this problem? 您将如何编写此类查询/解决此问题?

What about something like this. 那这样的事呢 This is just to demonstrate the idea, I'll leave the details up to you. 这只是为了演示这个想法,我将详细信息留给您。 In case you aren't familiar with this random selection method, if you randomly generate a number between 0 and 1, it has a 40% chance of being under .4. 如果您不熟悉这种随机选择方法,则如果您随机生成一个介于0和1之间的数字,则它有40%的可能性低于.4。 So rand() <= .4 will return true 40% of the time. 因此rand()<= .4将在40%的时间内返回true。

The assume you have or can create a "Filters" entity which looks a bit like this 假设您拥有或可以创建一个看起来像这样的“过滤器”实体

CREATE TABLE Filters
  ( FieldName VARCHAR(100), 
    FieldValue VARCHAR(100),
    Prob Float -- probability of selection based on Name and Value
  );

SELECT DISTINCT TMP.* -- The fields you want. Distinct needed to get rid of 
                      -- records which pass multiple conditions.
  FROM (SELECT YRSWF.*,
               RAND() AS rnd
          FROM YourResultSetWithoutFilters YRSWF -- You can code the details
       ) TMP  
 INNER
  JOIN Filters F
    ON (
       TMP.Subject = F.FieldValue
   AND F.FieldName = 'Subject'
   AND TMP.rnd <= F.prob
       )
    OR (
       TMP.Institution = F.FieldValue
   AND F.FieldName = 'Institution'
   AND TMP.rnd <= F.prob
       )
    OR ( 
       TMP.Area = F.FieldValue
   AND F.FieldName = 'Area'
   AND TMP.rnd <= F.prob
       );

Ok. 好。 Managed to solve it. 设法解决它。 Basically, used the strategy already outlined in the question and a little help from here -- I had already seen this post before, but I was (and still am) trying to solve in a more elegant way -- something like this but for multiple rows --, not needing to create the "bounds" by hand. 基本上,使用问题中已经概述的策略以及从此处获得的一些帮助-我以前已经看过这篇文章,但是我(并且现在)正试图以一种更优雅的方式解决问题-类似这样,但是对于多个行-无需手动创建“界限”。

Let's try step-by-step: 让我们逐步尝试:

Since the filters, with the weights, come from outside the schema, let's create a CTE: 由于具有权重的过滤器来自架构外部,因此我们创建一个CTE:

WITH filters (type, id, weight) AS (
    SELECT 'subject', '148232e0-dece-40d9-81e0-0fa675f040e5'::uuid, 0.5
    UNION SELECT 'subject', '854431bb-18ee-4efb-803f-185757d25235'::uuid, 0.4
    UNION SELECT 'area', 'e12863fb-afb7-45cf-9198-f9f58ebc80cf'::uuid, 1
    UNION SELECT 'institution', '7f56c89f-705e-45c7-98fb-fee470550edf'::uuid, 0.5
    UNION SELECT 'institution', '0066257b-b2e3-4ee8-8075-517a2aa1379e'::uuid, 0.5
)

Now, let's filter the rows, ignoring the weight (for now), so later we don't need to work with the whole table: 现在,让我们过滤行,而忽略权重(现在),因此以后我们不需要使用整个表:

WITH filtered_questions AS (
    SELECT
        q.id,
        s.id subject_id,
        a.id area_id,
        i.id institution_id
    FROM
        public.questions q
        INNER JOIN public.subjects_questions sq ON q.id = sq.question_id
        INNER JOIN public.subjects s ON s.id = sq.subject_id
        INNER JOIN public.institutions_questions iq ON iq.question_id = q.id
        INNER JOIN public.institutions i ON i.id = iq.institution_id
        INNER JOIN public.areas_questions aq ON aq.question_id = q.id
        INNER JOIN public.areas a ON a.id = aq.area_id
    WHERE
        subject_id IN (SELECT id from filters where type = 'subject')
        and institution_id IN (SELECT id from filters where type = 'institution')
        and area_id IN (SELECT id from filters where type = 'area')
)

The same question can be selected by multiple filters, increasing the chance of it being selected. 可以通过多个过滤器选择同一问题,从而增加了选择它的机会。 We must update the weights to solve this. 我们必须更新权重以解决此问题。

WITH filtered_questions_weights_sum AS (
    SELECT
        q.id,
        SUM(filters.weight) weight_sum
    FROM filtered_questions q
    INNER JOIN filters
    ON (filters.type = 'subject' AND q.subject_id IN(filters.id))
    OR (filters.type = 'area' AND q.area_id IN(filters.id))
    OR (filters.type = 'institution' AND q.institution_id IN(filters.id))
    GROUP BY q.id
)

Generating the bounds, like exposed here . 产生界限,就像暴露在这里一样

WITH cumulative_prob AS (
    SELECT
        id,
        SUM(weight_sum) OVER (ORDER BY id) AS cum_prob
    FROM filtered_questions_weights_sum
),
cumulative_bounds AS (
    SELECT
        id,
        COALESCE( lag(cum_prob) OVER (ORDER BY cum_prob, id), 0 ) AS lower_cum_bound,
        cum_prob AS upper_cum_bound
    FROM cumulative_prob
)

Generating the random series. 生成随机序列。 Had to re-normalize ( random() * (SELECT SUM(weight_sum) ) because the weights were updated in a previous step. 10 is the number of rows that we want to return. 必须重新规范化( random() * (SELECT SUM(weight_sum) ),因为权random() * (SELECT SUM(weight_sum)上一步中已更新。10是我们要返回的行数。

WITH random_series AS (
    SELECT generate_series (1,10),random() * (SELECT SUM(weight_sum) FROM filtered_questions_weights_sum) AS R
)

And finally: 最后:

SELECT
      id, lower_cum_bound, upper_cum_bound, R
FROM random_series
JOIN cumulative_bounds
ON R::NUMERIC <@ numrange(lower_cum_bound::NUMERIC, upper_cum_bound::NUMERIC, '(]')

And we get the following distribution: 我们得到以下分布:

id                                   lower_cum_bound upper_cum_bound r                   
------------------------------------ --------------- --------------- ------------------- 
380f46e9-f373-4b89-a863-05f484e6b3b6 0               2.0             0.41090718149207534 
42bcb088-fc19-4272-8c49-e77999edd01c 2.0             3.9             3.4483200465794654  
46a97f1d-789f-46e7-9d3b-bd881a22a32e 3.9             5.9             5.159445870062337   
46a97f1d-789f-46e7-9d3b-bd881a22a32e 3.9             5.9             5.524481557868421   
972d0296-acc3-4b44-b67d-928049d5e9c2 5.9             7.8             6.842470594821498   
bdcc26f7-ccaf-4f8f-9e0b-81b9a6d29cdb 11.6            13.5            12.207371663767844  
bdcc26f7-ccaf-4f8f-9e0b-81b9a6d29cdb 11.6            13.5            12.674184153741226  
c935e3de-f1b6-4399-b5eb-ed3a9194eb7b 15.5            17.5            17.16804686235264   
e5061aeb-53b7-4247-8404-87508c5ac723 21.4            23.4            22.622627633158118  
f8c37700-0c3a-457e-8882-7c65269482ea 25.4            27.3            26.841821723571048  

Putting it all together: 放在一起:

WITH filters (type, id, weight) AS (
        SELECT 'subject', '148232e0-dece-40d9-81e0-0fa675f040e5'::uuid, 0.5
        UNION SELECT 'subject', '854431bb-18ee-4efb-803f-185757d25235'::uuid, 0.4
        UNION SELECT 'area', 'e12863fb-afb7-45cf-9198-f9f58ebc80cf'::uuid, 1
        UNION SELECT 'institution', '7f56c89f-705e-45c7-98fb-fee470550edf'::uuid, 0.5
        UNION SELECT 'institution', '0066257b-b2e3-4ee8-8075-517a2aa1379e'::uuid, 0.5
        )
    ,
    filtered_questions AS
    (
        SELECT
            q.id,
            SUM(filters.weight) weight_sum
        FROM
        public.questions q
        INNER JOIN public.subjects_questions sq ON q.id = sq.question_id
        INNER JOIN public.subjects s ON s.id = sq.subject_id
        INNER JOIN public.institutions_questions iq ON iq.question_id = q.id
        INNER JOIN public.institutions i ON i.id = iq.institution_id
        INNER JOIN public.activity_areas_questions aq ON aq.question_id = q.id
        INNER JOIN public.activity_areas a ON a.id = aq.activity_area_id
        INNER JOIN filters
            ON (filters.type = 'subject' AND s.id IN(filters.id))
            OR (filters.type = 'area' AND a.id IN(filters.id))
            OR (filters.type = 'institution' AND i.id IN(filters.id))
        WHERE
            s.id IN (SELECT id from filters where type = 'subject')
            and i.id IN (SELECT id from filters where type = 'institution')
            and a.id IN (SELECT id from filters where type = 'area')
        GROUP BY q.id
    )
    ,
    cumulative_prob AS (
        SELECT
            id,
            SUM(weight_sum) OVER (ORDER BY id) AS cum_prob
        FROM filtered_questions
    )
    ,
    cumulative_bounds AS (
        SELECT
            id,
            COALESCE( lag(cum_prob) OVER (ORDER BY cum_prob, id), 0 ) AS lower_cum_bound,
            cum_prob AS upper_cum_bound
        FROM cumulative_prob
    )
    ,
    random_series AS
    (
        SELECT generate_series (1,14),random() * (SELECT SUM(weight_sum) FROM filtered_questions) AS R
    )
SELECT id, lower_cum_bound, upper_cum_bound, R
FROM random_series
JOIN cumulative_bounds
ON R::NUMERIC <@ numrange(lower_cum_bound::NUMERIC, upper_cum_bound::NUMERIC, '(]')

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM