I am trying to compare two lists of unique Household IDs using the Distinct clause. The problem comes when I try to pull in a third column consisting of timestamps into the results.
When I include only the two Household ID columns in the Select statement, the results seem to make sense. I get back two lists of unique IDs.
Here is that query:
select distinct e.household_id, a.hhid
FROM [dbo].[exposure] e
left outer join [dbo].[audience] a
on e.household_id = a.hhid
However, when I just add the "e.imp_ts" column to the Select statement, it looks like SQL completely disregards the Distinct part of the query and pulls in all the duplicate households in the files.
select distinct e.household_id, a.hhid, e.imp_ts
FROM [dbo].[exposure] e
left outer join [dbo].[audience] a
on e.household_id = a.hhid
Can someone please explain why the query doesn't work when I simply add a third column to the Select statement?
Thank you!
It is not that the second query "doesn't work", but rather that it is being asked to provide different results than the first query. As others in the comments have pointed out, because the imp_ts
column contains more granular data, the distinct can no longer return a unique list of household IDs. For example, household ID 12345 may contain 5 records, each with unique timestamps on them.
In order to resolve this, you have some choices:
imp_ts
from the query.For #2 and #3 above, you can use MIN()
or MAX()
with a GROUP BY
to achieve those results. Here is an example of using MIN()
:
select e.household_id, a.hhid, MIN(e.imp_ts) AS min_imp_ts
FROM [dbo].[exposure] e
left outer join [dbo].[audience] a
on e.household_id = a.hhid
group by e.household_id, a.hhid
I would suggest looking up GROUP BY examples online to get a better idea of what is happening.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.