简体   繁体   中英

Dont understand how queries to retrieve top n records from each group work

I had an issue where I was trying to get the top 'n' records from each group (day) or records in my database. After a bunch of digging I found some great answers and they did in fact solve my problem.

However, my noob-ness is preventing me from understanding exactly WHY these "counting" solutions work. If someone with better SQL knowledge can explain, that would be really great.

EDIT: here's more details

Let's say I had a table described below with this sample data. (To make things simpler, I have a column that kept track of the time of the next upcoming midnight, in order to group 'per day' better).

id | vote_time | time_of_midnight |    name    | votes_yay | votes_nay
------------------------------------------------------------------------
 1 |     a     |        b         |  Person p  |    24     |     36
 1 |     a     |        b         |  Person q  |    20     |     10
 1 |     a     |        b         |  Person r  |    42     |     22
 1 |     c     |        d         |  Person p  |     8     |     10
 1 |     c     |        d         |  Person s  |   120     |     63
  • There can be tens or hundreds of "People" per day (b, d, ...)
  • id is some other column I needed in order to group by (you can think of it as an election id if that helps)

I'm trying to calculate the top 5 names that had the highest number of votes per day, in descending order. I was able to use the referenced articles to create a query that would give me the following results (on Oracle):

SELECT name, time_of_midnight, votes_yay, votes_nay, (votes_yay+votes_nay) AS total_votes
FROM results a
WHERE id=1 AND (
    SELECT COUNT(*) 
    FROM results b
    WHERE b.id=a.id AND b.time_of_midnight=a.time_of_midnight AND (a.votes_yay+a.votes_nay) >= (b.votes_yay+b.votes_nay)) <= 5
ORDER BY time_of_midnight DESC, total_votes DESC;


   name    | time_of_midnight | votes_yay | votes_nay | total_votes
------------------------------------------------------------------------
 Person s  |         d        |     120   |     63    |     183
 Person p  |         d        |       8   |     10    |      18
 Person r  |         b        |      42   |     22    |      64
 Person p  |         b        |      24   |     36    |      60
 Person q  |         b        |      20   |     10    |      30

So I'm not really sure

  • Why this counting method works?
  • [stupid]: Why don't I need to also include name in the inner query to make sure it doesn't join the data incorrectly?

Let's begin with the fact that your query is actually calculating top 5 names that had the lowest number of votes. To get the top 5 with the highest number, you'll need to change this condition:

(a.votes_yay+a.votes_nay) >= (b.votes_yay+b.votes_nay)

into this:

(a.votes_yay+a.votes_nay) <= (b.votes_yay+b.votes_nay)

or, perhaps, this (which is the same):

(b.votes_yay+b.votes_nay) >= (a.votes_yay+a.votes_nay)

(The latter form would seem to me preferable, but merely because it would be uniform with the other two comparisons which have a b column on the left-hand side and an a column on the right-hand side. That is perfectly irrelevant to the correctness of the logic.)

Logically, what's happening is this. For every row in results , the server will be looking for rows in the same table that match id and time_of_midnight of the given row and have the same or higher number of total votes than that in the given row. It will then count the found rows and check if the result is not greater than 5, ie if no more than 5 rows in the same (id, time_of_midnight) group have the same or higher number of votes as in the given row.

For example, if the given row happens to be one with the most votes in its group, the subquery will find only that same row (assuming there are no ties) and so the count will be 1. That is fewer than 5 – therefore, the given row will qualify for output.

If the given row will be the second most voted item in a group, the subquery will find the same row and the top-voted item (again, assuming no ties), which will give the count of 2. Again, that matches the count <= 5 condition, and so the row will be returned in the output.

In general, if a row is ranked as # N in its group according to the total number of votes, it means there are N rows in that group where the vote number is the same or higher than the number in the given row (we are still assuming there are no ties). So, when you are counting votes in this way, you are effectively calculating the given row's ranking.

Now, if there are ties, you may get fewer results per group using this method. In fact, if a group had 6 or more rows tied at the maximum number of rows, you would get no rows for that group in the output, because the subquery would never return a count value less than 6.

That is because effectively all the top-voted items would be ranked as 6 (or whatever their number would be) rather than as 1. To rank them as 1 instead, you could try the following modification of the same query:

SELECT name, time_of_midnight, votes_yay, votes_nay, (votes_yay+votes_nay) AS total_votes
FROM results a
WHERE id=1 AND (
    SELECT 
    FROM results b
    WHERE b.id=a.id AND b.time_of_midnight=a.time_of_midnight
      AND ) <= 5
ORDER BY time_of_midnight DESC, total_votes DESC;

Now the subquery will be looking only for rows with the higher number of votes than in the given row. The resulting count will be increased by 1 and that will be the given row's ranking (and the value to compare against 5).

So, if the counts were eg 10, 10, 8, 7 etc., the rankings would be calculated as 1, 1, 3, 4 etc. rather than as 2, 2, 3, 4 etc., as with the original version.

That, of course, means that the output might now have more than 5 rows per group. For instance, if votes were distributed as 10, 9, 8, 8, 8, 8, 6 etc., you would get 10, 9 and all the 8s (because the rankings would be 1, 2, 3, 3, 3, 3, 7...). To return exactly 5 names per group (assuming there are at least 5 of them), you'd probably need to consider a different method altogether.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM