Only return rows that match all criteria

Question

Here is a rough schema:

create table images (
    image_id serial primary key,
    user_id int references users(user_id),
    date_created timestamp with time zone
);

create table images_tags (
    images_tag_id serial primary key,
    image_id int references images(image_id),
    tag_id int references tags(tag_id)       
);

The output should look like this:

{"images":[
    {"image_id":1, "tag_ids":[1, 2, 3]},
    ....
]}

The user is allowed to filter images based on user ID, tags, and offset image_id . For instance, someone can say "user_id":1, "tags":[1, 2], "offset_image_id":500 , which will give them all images that are from user_id 1, have both tags 1 AND 2, and an image_id of 500 or less.

The tricky part is the "have both tags 1 AND 2". It is more straight-forward (and faster) to return all images that have either 1, 2, or both. I don't see any way around this other than aggregating, but it is much slower.

Any help doing this quickly?

Here is the current query I am using which is pretty slow:

select * from (
    select i.*,u.handle,array_agg(t.tag_id) as tag_ids, array_agg(tag.name) as tag_names from (
        select i.image_id, i.user_id, i.description, i.url, i.date_created from images i
        where (?=-1 or i.user_id=?)
        and (?=-1 or i.image_id <= ?)
        and exists(
            select 1 from image_tags t
            where t.image_id=i.image_id
            and (?=-1 or user_id=?)
            and (?=-1 or t.tag_id in (?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?))
        )
        order by i.image_id desc
    ) i
    left join image_tags t on t.image_id=i.image_id
    left join tag using (tag_id) --not totally necessary
    left join users u on i.user_id=u.user_id --not totally necessary
    group by i.image_id,i.user_id,i.description,i.url,i.date_created,u.handle) sub
where (?=-1 or sub.tag_ids @> ?)
limit 100;

Answer 1

When the execution plan of this statement is determined, at prepare time, the PostgresSQL planner doesn't know which of these ?=-1 conditions will be true or not.

So it has to produce a plan to maybe filter on a specific user_id , or maybe not, and maybe filter on a range on image_id or maybe not, and maybe filter on a specific set of tag_id , or maybe not. It's likely to be a dumb, unoptimized plan, that can't take advantage of indexes.

While your current strategy of a big generic query that covers all cases is OK for correctness, for performance you might need to abandon it in favor or generating the minimal query given the parametrized conditions that are actually filled in.

In such a generated query, the ?=-1 or ... will disappear, only the joins that are actually needed will be present, and the dubious t.tag_id in (?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?) will go or be reduced to what's strictly necessary.

If it's still slow given certain sets of parameters, then you'll have a much easier starting point to optimize on.

As for the gist of the question, testing the exact match on all tags, you might want to try the idiomatic form in an inner subquery:

SELECT image_id FROM image_tags
  WHERE tag_id in (?,?,...)
  GROUP BY image_id HAVING count(*)=?

where the last ? is the number of tags passed as parameters.

(and completely remove sub.tag_ids @> ? as an outer condition).

Answer 2

Among other things, your GROUP BY clause is likely wider than any of your indices (and/or includes columns in unlikely combinations). I'd probably re-write your query as follows (turning @Daniel's subquery for the tags into a CTE):

WITH Tagged_Images (SELECT Image_Tags.image_id, ARRAY_AGG(Tag.tag_id) as tag_ids,
                                                ARRAY_AGG(Tag.name) as tag_names
                    FROM Image_Tags
                    JOIN Tag
                      ON Tag.tag_id = Image_Tags.tag_id
                    WHERE tag_id IN (?, ?)
                    GROUP BY image_id
                    HAVING COUNT(*) = ?)

SELECT Images.image_id, Images.user_id, 
       Images.description, Images.url, Images.date_created,
       Tagged_Images.tag_ids, Tagged_Images.tag_names,
       Users.handle
FROM Images
JOIN Tagged_Images
  ON Tagged_Images.image_id = Images.image_id
LEFT JOIN Users
       ON Users.user_id = Images.user_id
WHERE Images.user_id = ?
      AND Images.date_created < ?
ORDER BY Images.date_created, Images.image_id
LIMIT 100

(Untested - no provided dataset. note that I'm assuming you're building the criteria dynamically, to avoid condition flags)

Here's some other stuff:

Note that Tagged_Images will have at minimum the indicated tags, but might have more. If you want images with only those tags (exactly 2, no more, no less), an additional level needs to be added to the CTE.
There's a number of examples floating around of stored procs that turn comma-separated lists into virtual tables (heck, I've done it with recursive CTEs), which you could use for the IN() clause. It doesn't matter that much here, though, due to needing dynamic SQL anyways...
Assuming that Images.image_id is auto-generated, doing ranges searches or ordering by it is largely pointless. There are relatively few cases where humans care about the value held here. Except in cases where you're searching for one specific row (for updating/deleting/whatever), conceptual data sets don't really care either; the value of itself is largely meaningless. What does image_id < 500 actually tell me? Nothing - just that a given number was assigned to it. Are you using it to restrict based on "early" versus "late" images? Then use the proper data for that, which would be date_created . For pagination? Well, you have to do that after all the other conditions, or you get weird page lengths (like 0 in some cases). Generated keys should be relied on for one property only: uniqueness. This is the reason I stuck it at the end of the ORDER BY - to ensure a consistent ordering. Assuming that date_created has a high enough resolution as a timestamp, even this is unnecessary.
I'm fairly certain your LEFT JOIN to Users should probably be a regular (INNER) JOIN , but you didn't provide enough information for me to be sure.

Answer 3

Aggregation is not likely to be the thing slowing you down. A query such as:

select images.image_id
  from images
  join images_tags on (images.image_id=images_tags.image_id)
 where images_tags.tag_id in (1,2)
group by images.image_id
having count(*) = 2

will get you all of the images that have tags 1 and 2 and it will run quickly if you have indexes on both image_tags columns:

create index on images_tags(tag_id);
create index on images_tags(image_id);

The slowest part of the query is likely to be the in part of the where clause. You can speed that up if you are prepared to create a temporary table with the target tags in:

create temp table target_tags(tag_id int primary key);
insert into target_tags values (1);
insert into target_tags values (2);

select images.image_id
  from images
  join images_tags on (images.image_id=images_tags.image_id)
  join target_tags on images_tags.tag_id=target_tags.tag_id
group by images.image_id
having count(*) = (select count(*) from target_tags)

Only return rows that match all criteria

Question

3 answers

solution1
1 ACCPTED 2014-05-21 11:44:16

solution2
1 2014-05-21 13:29:00

solution3
0 2014-05-21 11:41:36

Only return rows that match all criteria

Question

3 answers

solution1 1 ACCPTED 2014-05-21 11:44:16

solution2 1 2014-05-21 13:29:00

solution3 0 2014-05-21 11:41:36

solution1
1 ACCPTED 2014-05-21 11:44:16

solution2
1 2014-05-21 13:29:00

solution3
0 2014-05-21 11:41:36