简体   繁体   中英

SQL greatest-n-per-group with relational table joins

I have 3 tables. image, categories, image_category.

image:         id | title | imageURL
categories:     cat_id | cat_name
image_category: image_id | cat_id

My current query to select all images in order newest to oldest is:

SELECT image.id as ID, image.title as title, categories.cat_name as CAT 
FROM image_category 
LEFT JOIN image 
ON image_category.image_id = image.id 
INNER JOIN categories 
ON image_category.cat_id = categories.cat_id 
ORDER BY ID DESC

I would like to show the newest 4 images per category. The largest image.id are the newest images.

For example. If I had 3 categories and 40 images in each category. I want to show the newest 4 images from each category. I will later be trying to show the next 4 per category after that and then the next 4 per category until there are no images left.

This solution seems like what im looking for.

SELECT i1.*
FROM item i1
LEFT OUTER JOIN item i2
ON (i1.category_id = i2.category_id AND i1.item_id < i2.item_id)
GROUP BY i1.item_id
HAVING COUNT(*) < 4
ORDER BY category_id, date_listed;

but I have a relational table connecting my image_id and category_id. Cant figure out how to implement this with that extra table join.

Would appreciate help from an SQL guru.

You're almost there, you just need to do the grouping using your item_category table since that's where the cat_id's are.

SELECT ...
FROM item_category AS c1
LEFT OUTER JOIN item_category AS c2
  ON c1.cat_id = c2.cat_id AND c1.image_id < c2.image_id
GROUP BY c1.cat_id
HAVING COUNT(*) < 4

Then once you've got that, you know that c1 contains the top four images per category. You can then join c1 to the image table to get other attributes:

SELECT i.id, i.title, c.cat_name AS CAT 
FROM item_category AS c1
LEFT OUTER JOIN item_category AS c2
  ON c1.cat_id = c2.cat_id AND c1.image_id < c2.image_id
INNER JOIN image AS on c1.image_id = i.id
INNER JOIN categories AS c on c1.cat_id = c.id
GROUP BY c1.image_id
HAVING COUNT(*) < 4;

Although this isn't strictly legal SQL due to the single-value rule , MySQL will permit it.


Copied from comments thread:

I would fetch the full result, store it in a cache, and then iterate over it however I want, using application code. That would be far simpler and have better performance. SQL is powerful, but another solution may be easier to develop, debug, and maintain.

You can certainly use LIMIT to iterate through the result set:

SELECT i.id, i.title, c.cat_name AS CAT 
FROM item_category AS c1
LEFT OUTER JOIN item_category AS c2
  ON c1.cat_id = c2.cat_id AND c1.image_id < c2.image_id
INNER JOIN image AS on c1.image_id = i.id
INNER JOIN categories AS c on c1.cat_id = c.id
GROUP BY c1.image_id
HAVING COUNT(*) < 4
ORDER BY c.cat_id
LIMIT 4 OFFSET 16;

But keep in mind that doing an OFFSET means that it has to run the query over again each time you view another set of them. There are optimizations in MySQL so that it quits a query once it has found enough rows, but it's still expensive if you iterate frequently, and advance far into the series of pages.

Two possible optimizations you can use: One is to cache part of the result, on the theory that few users will want to advance through every page of a large paginated result. So for example, fetch enough to populate ten pages worth of results, and cache that. It reduces the number of queries a lot, and perhaps only 1% of the times will user advance into the next set of ten pages.

SELECT i.id, i.title, c.cat_name AS CAT 
FROM item_category AS c1
LEFT OUTER JOIN item_category AS c2
  ON c1.cat_id = c2.cat_id AND c1.image_id < c2.image_id
INNER JOIN image AS on c1.image_id = i.id
INNER JOIN categories AS c on c1.cat_id = c.id
GROUP BY c1.image_id
HAVING COUNT(*) < 4
ORDER BY c.cat_id
LIMIT 40 OFFSET 40; /* second set of ten pages */

Another optimization, if you can assume that any view of page N will be coming from a view of page N-1 , is for the request to filter the categories based on the greatest category id seen in the N-1 st page. You need to do it this way because OFFSET works by row number in the result set, but indexed offsets work by values found on those rows. These aren't the same offset if there may be gaps or unused cat_id values.

SELECT i.id, i.title, c.cat_name AS CAT 
FROM item_category AS c1
LEFT OUTER JOIN item_category AS c2
  ON c1.cat_id = c2.cat_id AND c1.image_id < c2.image_id
INNER JOIN image AS on c1.image_id = i.id
INNER JOIN categories AS c on c1.cat_id = c.id
WHERE c1.cat_id > 47 /* this value is the largest seen in previous page */ 
GROUP BY c1.image_id
HAVING COUNT(*) < 4
ORDER BY c.cat_id
LIMIT 40; /* no offset needed */

Re your comments:

... using LIMIT and OFFSET will only trim those results and not move me down the list of rows.

LIMIT is working as intended; it applies to the resulting rows after GROUP BY and HAVING have done their work.

The way I was doing it before the greatest N per category query is by
1. pulling in x amount of images,
2. Remembering which was the last image, and then
3. using a sub query on my subsequent queries to get the next x amount of images with ids smaller than than the last image. Is something like that possible with greatest N per group?

That's what my WHERE clause does in the last example above, without using a subquery. And I'm assuming you're advancing to the next higher set of cat_id's. This solution works only if you're advancing one page at a time, and in the positive direction.


All right, there's another solution for greatest-n-per-group that works with MySQL, but it relies on the user variables feature. SQLite doesn't have this feature.

SELECT * FROM (
  SELECT 
    p.id as image_ID, p.imageURL as URL, c.cat_name as CAT, ic.cat_id,
    IF(@cat=ic.cat_id, @row:=@row+1, @row:=1) AS _row, @cat:=ic.cat_id AS _cat
  FROM (SELECT @cat:=null, @row:=0) AS _init
  CROSS JOIN image_category AS ic
  INNER JOIN portfolio AS p ON ic.image_id = p.id
  INNER JOIN categories AS c on ic.cat_id = c.cat_id
  ORDER BY ic.cat_id, ic.image_id
) AS x
WHERE _row BETWEEN 4 AND 6;  /* or choose any range you want */

This is similar to using ROW_NUMBER() OVER (PARTITION BY cat_id) that is supported by standard SQL and most RDBMS, but SQLite doesn't support that either yet.

SELECT *
FROM (
  SELECT a.id as ID,a.title as title,b.cat_name as CAT, row_number() OVER (PARTITION BY b.cat_id ORDER BY b.cat_id,a.id desc) AS n
   from images a, categories b, image_category c 
    where a.id = c.image_id
    and b.cat_id = c.cat_id
) x
WHERE n < 4
ORDER BY b.cat_id,a.id desc;

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM