简体   繁体   中英

SQL query to compare row value to group values, with condition

I wish to port some R code to Hadoop to be used with Impala or Hive with a SQL-like query. The code I have is based on this question:

R data table: compare row value to group values, with condition

I wish to find, for each row, the number of rows with the same id in subgroup 1 with cheaper price.

Let's say I have the following data:

CREATE TABLE project
(
    id int,
    price int, 
    subgroup int
);

INSERT INTO project(id,price,subgroup) 
VALUES
    (1, 10, 1), 
    (1, 10, 1), 
    (1, 12, 1),
    (1, 15, 1),
    (1,  8, 2),
    (1, 11, 2),
    (2,  9, 1),
    (2, 12, 1),
    (2, 14, 2),
    (2, 18, 2);

Here is the output I would like to have (with the new column cheaper ):

id  price   subgroup   cheaper
1   10      1          0 ( because no row is cheaper in id 1 subgroup 1)
1   10      1          0 ( because no row is cheaper in id 1 subgroup 1)
1   12      1          2 ( rows 1 and 2 are cheaper)
1   15      1          3
1    8      2          0 (nobody is cheaper in id 1 and subgroup 1)
1   11      2          2
2    9      1          0
2   12      1          1
2   14      2          2
2   18      2          2

Note that I always want to compare rows to the ones in subgroup 1, even when the rows are themselves in subgroup 2.

You can join the table with itself, using a LEFT JOIN:

SELECT
  p.id,
  p.price,
  p.subgroup,
  COUNT(p2.id)
FROM
  project p LEFT JOIN project p2
  ON p.id=p2.id AND p2.subgroup=1 AND p.price>p2.price
GROUP BY
  p.id,
  p.price,
  p.subgroup
ORDER BY
  p.id, p.subgroup

count(p2.id) will count all rows where the join does succeed (and it succeeds where there are cheaper prices for the same id and for the subgroup 1).

The only problem is that you are expecting those two rows:

1   10      1          0
1   10      1          0

but my query will only return one, because I'm grouping by id, price, and subgroup. If you have another unique ID in your project table you could also group by that ID. Please see a fiddle here .

Or you could use an inline query:

SELECT
  p.id,
  p.price,
  p.subgroup,
  (SELECT COUNT(*)
   FROM project p2
   WHERE p2.id=p.id AND p2.subgroup=1 AND p2.price<p.price) AS n
FROM
  project p

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM