I have a questions table and I need to get X questions to prepare a test. The questions need to be filtered according to multiple criteria (subject, institution, area, etc.), each with different weights.
The filters weight are dynamically setted and normalized outside the query. Ex.:
Some other points:
To illustrate, if I didn't want to weight the filters, I would do something like that:
SELECT
*
FROM
public.questions q
INNER JOIN public.subjects_questions sq ON q.id = sq.question_id
INNER JOIN public.subjects s ON s.id = sq.subject_id
INNER JOIN public.institutions_questions iq ON iq.question_id = q.id
INNER JOIN public.institutions i ON i.id = iq.institution_id
INNER JOIN public.areas_questions aq ON aq.question_id = q.id
INNER JOIN public.areas a ON a.id = aq.area_id
WHERE
s.id IN :subjects
AND a.id IN :areas
AND i.id IN :institutions
ORDER BY
random() limit 200
Desired output:
Question — Subject — Institution — Area
I thought in something along the lines:
How would you write such query / solve this problem?
What about something like this. This is just to demonstrate the idea, I'll leave the details up to you. In case you aren't familiar with this random selection method, if you randomly generate a number between 0 and 1, it has a 40% chance of being under .4. So rand() <= .4 will return true 40% of the time.
The assume you have or can create a "Filters" entity which looks a bit like this
CREATE TABLE Filters
( FieldName VARCHAR(100),
FieldValue VARCHAR(100),
Prob Float -- probability of selection based on Name and Value
);
SELECT DISTINCT TMP.* -- The fields you want. Distinct needed to get rid of
-- records which pass multiple conditions.
FROM (SELECT YRSWF.*,
RAND() AS rnd
FROM YourResultSetWithoutFilters YRSWF -- You can code the details
) TMP
INNER
JOIN Filters F
ON (
TMP.Subject = F.FieldValue
AND F.FieldName = 'Subject'
AND TMP.rnd <= F.prob
)
OR (
TMP.Institution = F.FieldValue
AND F.FieldName = 'Institution'
AND TMP.rnd <= F.prob
)
OR (
TMP.Area = F.FieldValue
AND F.FieldName = 'Area'
AND TMP.rnd <= F.prob
);
Ok. Managed to solve it. Basically, used the strategy already outlined in the question and a little help from here -- I had already seen this post before, but I was (and still am) trying to solve in a more elegant way -- something like this but for multiple rows --, not needing to create the "bounds" by hand.
Let's try step-by-step:
Since the filters, with the weights, come from outside the schema, let's create a CTE:
WITH filters (type, id, weight) AS (
SELECT 'subject', '148232e0-dece-40d9-81e0-0fa675f040e5'::uuid, 0.5
UNION SELECT 'subject', '854431bb-18ee-4efb-803f-185757d25235'::uuid, 0.4
UNION SELECT 'area', 'e12863fb-afb7-45cf-9198-f9f58ebc80cf'::uuid, 1
UNION SELECT 'institution', '7f56c89f-705e-45c7-98fb-fee470550edf'::uuid, 0.5
UNION SELECT 'institution', '0066257b-b2e3-4ee8-8075-517a2aa1379e'::uuid, 0.5
)
Now, let's filter the rows, ignoring the weight (for now), so later we don't need to work with the whole table:
WITH filtered_questions AS (
SELECT
q.id,
s.id subject_id,
a.id area_id,
i.id institution_id
FROM
public.questions q
INNER JOIN public.subjects_questions sq ON q.id = sq.question_id
INNER JOIN public.subjects s ON s.id = sq.subject_id
INNER JOIN public.institutions_questions iq ON iq.question_id = q.id
INNER JOIN public.institutions i ON i.id = iq.institution_id
INNER JOIN public.areas_questions aq ON aq.question_id = q.id
INNER JOIN public.areas a ON a.id = aq.area_id
WHERE
subject_id IN (SELECT id from filters where type = 'subject')
and institution_id IN (SELECT id from filters where type = 'institution')
and area_id IN (SELECT id from filters where type = 'area')
)
The same question can be selected by multiple filters, increasing the chance of it being selected. We must update the weights to solve this.
WITH filtered_questions_weights_sum AS (
SELECT
q.id,
SUM(filters.weight) weight_sum
FROM filtered_questions q
INNER JOIN filters
ON (filters.type = 'subject' AND q.subject_id IN(filters.id))
OR (filters.type = 'area' AND q.area_id IN(filters.id))
OR (filters.type = 'institution' AND q.institution_id IN(filters.id))
GROUP BY q.id
)
Generating the bounds, like exposed here .
WITH cumulative_prob AS (
SELECT
id,
SUM(weight_sum) OVER (ORDER BY id) AS cum_prob
FROM filtered_questions_weights_sum
),
cumulative_bounds AS (
SELECT
id,
COALESCE( lag(cum_prob) OVER (ORDER BY cum_prob, id), 0 ) AS lower_cum_bound,
cum_prob AS upper_cum_bound
FROM cumulative_prob
)
Generating the random series. Had to re-normalize ( random() * (SELECT SUM(weight_sum)
) because the weights were updated in a previous step. 10 is the number of rows that we want to return.
WITH random_series AS (
SELECT generate_series (1,10),random() * (SELECT SUM(weight_sum) FROM filtered_questions_weights_sum) AS R
)
And finally:
SELECT
id, lower_cum_bound, upper_cum_bound, R
FROM random_series
JOIN cumulative_bounds
ON R::NUMERIC <@ numrange(lower_cum_bound::NUMERIC, upper_cum_bound::NUMERIC, '(]')
And we get the following distribution:
id lower_cum_bound upper_cum_bound r
------------------------------------ --------------- --------------- -------------------
380f46e9-f373-4b89-a863-05f484e6b3b6 0 2.0 0.41090718149207534
42bcb088-fc19-4272-8c49-e77999edd01c 2.0 3.9 3.4483200465794654
46a97f1d-789f-46e7-9d3b-bd881a22a32e 3.9 5.9 5.159445870062337
46a97f1d-789f-46e7-9d3b-bd881a22a32e 3.9 5.9 5.524481557868421
972d0296-acc3-4b44-b67d-928049d5e9c2 5.9 7.8 6.842470594821498
bdcc26f7-ccaf-4f8f-9e0b-81b9a6d29cdb 11.6 13.5 12.207371663767844
bdcc26f7-ccaf-4f8f-9e0b-81b9a6d29cdb 11.6 13.5 12.674184153741226
c935e3de-f1b6-4399-b5eb-ed3a9194eb7b 15.5 17.5 17.16804686235264
e5061aeb-53b7-4247-8404-87508c5ac723 21.4 23.4 22.622627633158118
f8c37700-0c3a-457e-8882-7c65269482ea 25.4 27.3 26.841821723571048
Putting it all together:
WITH filters (type, id, weight) AS (
SELECT 'subject', '148232e0-dece-40d9-81e0-0fa675f040e5'::uuid, 0.5
UNION SELECT 'subject', '854431bb-18ee-4efb-803f-185757d25235'::uuid, 0.4
UNION SELECT 'area', 'e12863fb-afb7-45cf-9198-f9f58ebc80cf'::uuid, 1
UNION SELECT 'institution', '7f56c89f-705e-45c7-98fb-fee470550edf'::uuid, 0.5
UNION SELECT 'institution', '0066257b-b2e3-4ee8-8075-517a2aa1379e'::uuid, 0.5
)
,
filtered_questions AS
(
SELECT
q.id,
SUM(filters.weight) weight_sum
FROM
public.questions q
INNER JOIN public.subjects_questions sq ON q.id = sq.question_id
INNER JOIN public.subjects s ON s.id = sq.subject_id
INNER JOIN public.institutions_questions iq ON iq.question_id = q.id
INNER JOIN public.institutions i ON i.id = iq.institution_id
INNER JOIN public.activity_areas_questions aq ON aq.question_id = q.id
INNER JOIN public.activity_areas a ON a.id = aq.activity_area_id
INNER JOIN filters
ON (filters.type = 'subject' AND s.id IN(filters.id))
OR (filters.type = 'area' AND a.id IN(filters.id))
OR (filters.type = 'institution' AND i.id IN(filters.id))
WHERE
s.id IN (SELECT id from filters where type = 'subject')
and i.id IN (SELECT id from filters where type = 'institution')
and a.id IN (SELECT id from filters where type = 'area')
GROUP BY q.id
)
,
cumulative_prob AS (
SELECT
id,
SUM(weight_sum) OVER (ORDER BY id) AS cum_prob
FROM filtered_questions
)
,
cumulative_bounds AS (
SELECT
id,
COALESCE( lag(cum_prob) OVER (ORDER BY cum_prob, id), 0 ) AS lower_cum_bound,
cum_prob AS upper_cum_bound
FROM cumulative_prob
)
,
random_series AS
(
SELECT generate_series (1,14),random() * (SELECT SUM(weight_sum) FROM filtered_questions) AS R
)
SELECT id, lower_cum_bound, upper_cum_bound, R
FROM random_series
JOIN cumulative_bounds
ON R::NUMERIC <@ numrange(lower_cum_bound::NUMERIC, upper_cum_bound::NUMERIC, '(]')
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.