I've been working on crunching some data for piece of University coursework and I'm looking to optimise my query.
The dataset I'm using is the UK national police data on stop and searches and I'm trying to get the correlations between ethnicity and the share of stop and searches they get.
I have a query which will for each police force and ethnicity combination find the total number of searches, the percentage of searches on that ethnicity compared to others by the same force, the national average percentage and the difference between that force average and the national average (boring an confusing I know).
This is my current query which 'works':
SELECT c1.FORCE,
c1.ETHNICITY,
(SELECT COUNT(*) FROM CRIMES WHERE FORCE = c1.FORCE AND ETHNICITY = c1.ETHNICITY) AS num_searches,
(ROUND(((SELECT COUNT(*) FROM CRIMES WHERE FORCE = c1.FORCE AND ETHNICITY = c1.ETHNICITY) /
(SELECT COUNT(*) FROM CRIMES WHERE FORCE = c1.FORCE)::DECIMAL), 4) * 100) AS percentage_of_force,
(SELECT ROUND((COUNT(*) / 303565::DECIMAL) * 100, 4) FROM CRIMES WHERE ETHNICITY = c1.ETHNICITY GROUP BY ETHNICITY) AS national_average,
(SELECT (ROUND(((SELECT COUNT(*) FROM CRIMES WHERE FORCE = c1.FORCE AND ETHNICITY = c1.ETHNICITY) /
(SELECT COUNT(*) FROM CRIMES WHERE FORCE = c1.FORCE)::DECIMAL), 4) * 100) - (SELECT ROUND((COUNT(*) / 303565::DECIMAL) * 100, 4) FROM CRIMES WHERE ETHNICITY = c1.ETHNICITY GROUP BY ETHNICITY)) AS difference_from_average
FROM (SELECT * FROM CRIMES) AS c1
GROUP BY c1.FORCE, c1.ETHNICITY
ORDER BY c1.FORCE, c1.ETHNICITY;
So the question I have revolves around reusing the same query in the 'SELECT' section more than once.
As you can see from the above query the difference_from_average
is just the result of percentage_of_force
minus national_average
however I can't seem to figure out a way to calculate these values once and then reuse them elsewhere in the SELECT
section. So my question is how can I achieve this?
Additional Info
Example Input Data
| date | ethnicity | force |
|------------|-----------|-----------------|
| 2018-01-01 | White | metropolitan |
| 2018-01-01 | White | west-yorkshire |
| 2018-01-01 | White | metropolitan |
| 2018-01-01 | White | metropolitan |
| 2018-01-01 | White | north-yorkshire |
| 2018-01-01 | White | west-yorkshire |
| 2018-01-01 | Black | metropolitan |
| 2018-01-01 | Undefined | metropolitan |
| 2018-01-01 | White | metropolitan |
| 2018-01-01 | White | metropolitan |
| 2018-01-01 | White | norfolk |
| 2018-01-01 | White | north-yorkshire |
| 2018-01-01 | White | northumbria |
| 2018-01-01 | White | west-yorkshire |
| 2018-01-01 | Black | metropolitan |
| 2018-01-01 | Black | metropolitan |
| 2018-01-01 | Black | metropolitan |
| 2018-01-01 | Black | metropolitan |
| 2018-01-01 | White | metropolitan |
| 2018-01-01 | Black | metropolitan |
Example Query Result
| force | ethnicity | num_searches | percentage_of_force | national_average | difference_from_average |
|-------------------|-----------|--------------|---------------------|------------------|-------------------------|
| avon-and-somerset | Asian | 41 | 2.88 | 13.0641 | -10.1841 |
| avon-and-somerset | Black | 223 | 15.64 | 25.6798 | -10.0398 |
| avon-and-somerset | Other | 66 | 4.63 | 2.7368 | 1.8932 |
| avon-and-somerset | Undefined | 184 | 12.9 | 7.4699 | 5.4301 |
| avon-and-somerset | White | 912 | 63.96 | 50.941 | 13.019 |
| bedfordshire | Asian | 440 | 23.31 | 13.0641 | 10.2459 |
| bedfordshire | Black | 373 | 19.76 | 25.6798 | -5.9198 |
| bedfordshire | Mixed | 2 | 0.11 | 0.1084 | 0.0016 |
| bedfordshire | Other | 33 | 1.75 | 2.7368 | -0.9868 |
| bedfordshire | Undefined | 97 | 5.14 | 7.4699 | -2.3299 |
| bedfordshire | White | 943 | 49.95 | 50.941 | -0.991 |
| btp | Asian | 301 | 7.14 | 13.0641 | -5.9241 |
| btp | Black | 1274 | 30.23 | 25.6798 | 4.5502 |
| btp | Other | 71 | 1.68 | 2.7368 | -1.0568 |
| btp | Undefined | 48 | 1.14 | 7.4699 | -6.3299 |
| btp | White | 2521 | 59.81 | 50.941 | 8.869 |
I'm using PostgreSQL v11.2.
There are different ways to simplify the query. You could use a series of CTEs to pre-compute the results for the different levels of aggregation. But I think that the most efficient and readable option is to use window functions.
All intermediate counts can be computed in a subquery, using COUNT(...) OVER(...)
with various PARTITION BY
options, as follows :
SELECT
force,
ethnicity,
COUNT(*) OVER(PARTITION BY force, ethnicity) AS cnt,
COUNT(*) OVER(PARTITION BY force) AS cnt_force,
COUNT(*) OVER(PARTITION BY ethnicity) AS cnt_ethnicity,
ROW_NUMBER() OVER(PARTITION BY force, ethnicity) AS rn
FROM crimes
Then the outer query can compute the final results (while filtering on the first record in each force
/ ethnicity
tuple to avoid duplicates).
Query :
SELECT
force,
ethnicity,
cnt AS num_searches,
ROUND(cnt / cnt_force::decimal * 100, 4) AS percentage_of_force,
ROUND(cnt_ethnicity / 303565::decimal * 100, 4) AS national_average,
ROUND(cnt / cnt_force::decimal * 100, 4)
- ROUND(cnt_ethnicity / 303565::decimal * 100, 4) AS difference_from_average
FROM (
SELECT
force,
ethnicity,
COUNT(*) OVER(PARTITION BY force, ethnicity) AS cnt,
COUNT(*) OVER(PARTITION BY force) AS cnt_force,
COUNT(*) OVER(PARTITION BY ethnicity) AS cnt_ethnicity,
ROW_NUMBER() OVER(PARTITION BY force, ethnicity) AS rn
FROM crimes
) x
WHERE rn = 1
ORDER BY force, ethnicity;
| force | ethnicity | num_searches | percentage_of_force | national_average | difference_from_average |
| --------------- | --------- | ------------ | ------------------- | ---------------- | ----------------------- |
| metropolitan | Black | 6 | 46.1538 | 0.0020 | 46.1518 |
| metropolitan | Undefined | 1 | 7.6923 | 0.0003 | 7.6920 |
| metropolitan | White | 6 | 46.1538 | 0.0043 | 46.1495 |
| norfolk | White | 1 | 100.0000 | 0.0043 | 99.9957 |
| north-yorkshire | White | 2 | 100.0000 | 0.0043 | 99.9957 |
| northumbria | White | 1 | 100.0000 | 0.0043 | 99.9957 |
| west-yorkshire | White | 3 | 100.0000 | 0.0043 | 99.9957 |
The trick is to use subselects:
SELECT f(a, b), a, c
FROM (SELECT g(c, d) AS a,
h(c) AS b,
c, d
FROM x) AS q;
You get the idea.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.