简体   繁体   中英

SQL top records based on two tables relations

I have three main items I am storing: Articles, Entities, and Keywords. This makes 5 tables:

article { id }
entity {id, name}
article_entity {id, article_id, entity_id}
keyword {id, name}
article_keyword {id, article_id, keyword_id}

I would like to get all articles that contain the TOP X keywords + entities. I can get the top X keywords or entities with a simple group by on the entity_id/keyword_id .

SELECT [entity|keyword]_id, count(*) as num FROM article_entity
GROUP BY entity_id ORDER BY num DESC LIMIT 10

How would I get all articles that have a relation to the top entities and keywords?

This was what I imagined, but I know it doesn't work because of the group by entity limiting the article_id's that return.

SELECT * FROM article
WHERE EXISTS (
    [... where article is mentioned in top X entities.. ]
) AND EXISTS (
    [... where article is mentioned in top X keywords.. ]
);

If I understand you correct the objective of the query is to find the articles that have a relation to both one of the top 10 entities as well as to one of the top 10 keywords. If this is the case the following query should do that, by requiring that the article returned has a match in both the set of top 10 entities and the set of top 10 keywords.

Please give it a try.

SELECT a.id 
FROM article a
INNER JOIN article_entity  ae ON a.id = ae.article_id
INNER JOIN article_keyword ak ON a.id = ak.article_id
INNER JOIN (
  SELECT entity_id, COUNT(article_id) AS article_entity_count
  FROM article_entity
  GROUP BY entity_id 
  ORDER BY article_entity_count DESC LIMIT 10
) top_ae ON ae.entity_id = top_ae.entity_id
INNER JOIN (
  SELECT keyword_id, COUNT(article_id) AS article_keyword_count 
  FROM article_keyword
  GROUP BY keyword_id 
  ORDER BY article_keyword_count DESC LIMIT 10
) top_ak ON ak.keyword_id = top_ak.keyword_id
GROUP BY a.id;

The downside to using a simple limit 10 in the two subqueries for top entities/keywords is that it won't handle ties, so if the 11th keyword was just as popular as the 10th it still won't get chosen. This can be fixed though by using a ranking function, but afaik MySQL doesn't have anything build in (like RANK() window functions in Oracle or MSSQL).

I set up a sample SQL Fiddle (but using fewer data points and limit 2 as I'm lazy).

Not knowing the volume of data you are working with, I would first recommend that you have two storage columns on your article table for count of entities and keywords respectively. Then via triggers on adding/deleting from each, update the respective counter columns. This way, you don't have to do a burning query each time needed, especially in a web-based interface. Then, you can just select from the articles table ordered by the E+K counts descending and be done with it, instead of constant sub-querying the underlying tables.

Now, that said, the other suggestions are somewhat similar to what I am posting, but they all appear to be doing a limit of 10 records for each set. Lets throw this scenario into the picture. Say you have articles 1-20 all a range of 10, 9 and 8 entities and 1-2 keywords. Then articles 21-50 have the reverse... 10, 9, 8 keywords and 1-2 entities. Now, you have articles 51-58 that have 7 entities AND 7 keywords total of 14 combined points. None of the queries would have caught this as entities would only return the qualifying 1-20 records and keywords records 21-50. Articles 51-58 would be so far down on the list, it would not even be considered even though its total is 14.

To handle this, each sub-query is a full query specifically on the article ID and its count. Simple order by the article_ID as that is basis of the join to the master article table.

Now, the coalesce() will get the count if so available, otherwise 0 and add the two values together. From that, the results are ordered with the highest counts first (thus getting scenario sample articles 51-58 plus a few of the others) when the limit is applied.

SELECT
      a.id,
      coalesce( JustE.ECount, 0 ) ECount,
      coalesce( JustK.KCount, 0 ) KCount,
      coalesce( JustE.ECount, 0 ) + coalesce( JustK.KCount, 0 ) TotalCnt
   from
      article a
         LEFT JOIN ( select article_id, COUNT(*) as ECount
                        from article_entity
                        group by article_id
                        order by article_id ) JustE
            on a.id = JustE.article_id
         LEFT JOIN ( select article_id, COUNT(*) as KCount
                        from article_keyword
                        group by article_id
                        order by article_id ) JustK
            on a.id = JustK.article_id
   order by
      coalesce( JustE.ECount, 0 ) + coalesce( JustK.KCount, 0 ) DESC
   limit 10

I took this in several steps

tl;dr This shows all the articles from the top (4) keywords and entities:

Here's a fiddle

select
  distinct article_id
from
(
select
  article_id
from
  article_entity ae
  inner join 
    (select
      entity_id, count(*)
    from
      article_entity
    group by
      entity_id
    order by 
      count(*) desc
    limit 4) top_entities on ae.entity_id = top_entities.entity_id
union all
select
  article_id
from
  article_keyword ak
  inner join 
    (select
      keyword_id, count(*)
    from
      article_keyword
    group by
      keyword_id
    order by 
      count(*) desc
    limit 4) top_keywords on ak.keyword_id = top_keywords.keyword_id) as articles

Explanation:

This starts with an effort to find the top X entities. (4 seemed to work for the number of associations i wanted to make in the fiddle)

I didn't want to select articles here because it skews the group by, you want to focus solely on the top entities. Fiddle

select
  entity_id, count(*)
from
  article_entity
group by
  entity_id
order by 
  count(*) desc
limit 4

Then I selected all the articles from these top entities. Fiddle

select
  *
from
  article_entity ae
  inner join 
    (select
      entity_id, count(*)
    from
      article_entity
    group by
      entity_id
    order by 
      count(*) desc
    limit 4) top_entities on ae.entity_id = top_entities.entity_id

Obviously the same logic needs to happen for the keywords. The queries are then union ed together ( fiddle ) and the distinct article ids are pulled from the union.

This will give you all articles that have a relation to the top (x) entities and keywords.

This gets the top 10 keyword articles that are also a top 10 entity. You may not get 10 records back because it is possible that an article only meets one of the criteria (top entity but not top keyword or top keyword but not top entity)

select *
from article a
inner join
                (select count(*),ae.article_id
                 from article_entity ae
                group by ae.article_id
                order by count(*) Desc limit 10) e
on a.id = e.article_id
inner join
                 (select count(*),ak.article_id
                from article_keyword ak
                group by ak.article_id
                order by count(*) Desc limit 10) k
on a.id = k.article_id

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM