简体   繁体   中英

SQL: Rank by Group without summarising or joining

I need to calculate the sum(value) for groups in my dataset, then rank the groups based on that sum.

Here's an example dataset and result. I want to know which CYL group has the highest total mpg (I know it's a nonsensical result.) and the rankings of the CYL groups.

Data:

| model               | mpg  | cyl | gear | 
|---------------------|------|-----|------| 
| Mazda RX4           | 21   | 6   | 4    | 
| Mazda RX4 Wag       | 21   | 6   | 4    | 
| Datsun 710          | 22.8 | 4   | 4    | 
| Hornet 4 Drive      | 21.4 | 6   | 3    | 
| Hornet Sportabout   | 18.7 | 8   | 3    | 
| Valiant             | 18.1 | 6   | 3    | 
| Duster 360          | 14.3 | 8   | 3    | 
| Merc 240D           | 24.4 | 4   | 4    | 
| Merc 230            | 22.8 | 4   | 4    | 
| Merc 280            | 19.2 | 6   | 4    | 
| Merc 280C           | 17.8 | 6   | 4    | 
| Merc 450SE          | 16.4 | 8   | 3    | 
| Merc 450SL          | 17.3 | 8   | 3    | 
| Merc 450SLC         | 15.2 | 8   | 3    | 
| Cadillac Fleetwood  | 10.4 | 8   | 3    | 
| Lincoln Continental | 10.4 | 8   | 3    | 
| Chrysler Imperial   | 14.7 | 8   | 3    | 
| Fiat 128            | 32.4 | 4   | 4    | 
| Honda Civic         | 30.4 | 4   | 4    | 
| Toyota Corolla      | 33.9 | 4   | 4    | 
| Toyota Corona       | 21.5 | 4   | 3    | 
| Dodge Challenger    | 15.5 | 8   | 3    | 
| AMC Javelin         | 15.2 | 8   | 3    | 
| Camaro Z28          | 13.3 | 8   | 3    | 
| Pontiac Firebird    | 19.2 | 8   | 3    | 
| Fiat X1-9           | 27.3 | 4   | 4    | 
| Porsche 914-2       | 26   | 4   | 5    | 
| Lotus Europa        | 30.4 | 4   | 5    | 
| Ford Pantera L      | 15.8 | 8   | 5    | 
| Ferrari Dino        | 19.7 | 6   | 5    | 
| Maserati Bora       | 15   | 8   | 5    | 
| Volvo 142E          | 21.4 | 4   | 4    | 

This is the desired output :

| cyl | gear | SUM([MPG]) | sum_mpg_by_group  | RANK | 
|-----|------|------------|-------------------|------| 
| 4   | 3    | 21.5       | 293.3             | 1    | 
| 4   | 5    | 56.4       | 293.3             | 1    | 
| 4   | 4    | 215.4      | 293.3             | 1    | 
| 6   | 5    | 19.7       | 138.2             | 3    | 
| 6   | 3    | 39.5       | 138.2             | 3    | 
| 6   | 4    | 79         | 138.2             | 3    | 
| 8   | 5    | 30.8       | 211.4             | 2    | 
| 8   | 3    | 180.6      | 211.4             | 2    | 

Requirements:

This should be done without subqueries , with statements or joins - I know if can be done with them, but for performance and brevity reasons I want to explore options without them.

In other words, is there a way to obtain the RANK of the group based on the GROUPED SUM without using joins?

The following gets me nearly there but obviously the RANK() statement fails so I'm just missing the last column.

-- Non working query

select cyl
    , gear
    , sum(mpg) as sum_mpg
    , sum(sum(mpg)) over (PARTITION BY cyl) as sum_mpg_by_group
  --, rank() over (group by model order by sum(mpg) desc) as RANK
from sample_data_mtcars
group by cyl, gear

This sounds like aggregation and dense_rank() :

select s.*,
       dense_rank() over (order by sum_mpg_group desc) as ranking
from (select cyl, gear, sum(mpg) as sum_mpg,
             sum(sum(mpg)) over (partition by cyl) as sum_mpg_group
      from sample_data_mtcars
      group by cyl, gear
     ) s

You should be able to group by an OLAP function, or able to call an OLAP_FUNCTION() OVER ( an-OLAP-function ) to do that, which is not possible.

Why shy away from a nested query? Vertica has pipeline parallelism: While the rows are flowing through the inner query (or the Common Table Expression), another set of operators is working on the outer query, catching the rows coming from the inner query and processing them further. Eats resources, but isn't slow.

Try this:

Formulate your query as a nested query, like Gordon has suggested, and put the PROFILE keyword in front of it. When you do that, collect transaction_id and statement_id that are returned when you launch the profiling of that query.

Then you can check the execution_engine_profiles system table, filtering by transaction_id/statement_id, and see how many operators were working, check for elapsed time and CPU time to see that several operators run in parallel...

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM