SQL Inner Join returns duplicates

Question

I have the following 2 tables:

tab1 with 37146 rows

week_ref with 730 rows

All I want to do is join those tables on year and week so that the first week day and last week day will display next to the columns of the first table.

Below is my query:

SELECT tab1.year
      ,tab1.week
      ,tab1.col3
      ,tab1.col4
      ,tab1.col5
      ,tab1.col6
      ,tab1.total
      ,tab1.col7
      ,week_ref.first_week_day
      ,week_ref.last_week_day

FROM dtsetname.tab1

JOIN spyros.week_ref ON (week_ref.year = tab1.year AND week_ref.week = tab1.week)

The return of the query returns the 2 extra columns but the rows are 255535. So it is full of duplicates. I used to get how join works, but I guess not anymore xd... Any help on this? The correct output table should only give me 37146 rows since I only just want to add 2 extra columns.

Thanks

Answer 1

The problem is that your week_ref table has a row for each day rather than per week.

You can select just one day. If you have a weekday number or name (which I'm guessing that you do), that can be used:

FROM dtsetname.tab1 JOIN
     spyros.week_ref wr
     ON wr.year = tab1.year AND
        wr.week = tab1.week AND
        wr.dayname = 'Monday'

If such a column is not available, then you can either extract() the information or aggregate:

FROM dtsetname.tab1 JOIN
     (SELECT ANY_VALUE(wr).*
      FROM spyros.week_ref wr
      GROUP BY wr.year, wr.week
     ) wr
     ON wr.year = tab1.year AND
        wr.week = tab1.week

Answer 2

Below is for BigQuery Standard SQL

Before JOIN'ing you just need to dedup data in week_ref table as in below example

#standardSQL
SELECT tab1.year
      ,tab1.week
      ,tab1.col3
      ,tab1.col4
      ,tab1.col5
      ,tab1.col6
      ,tab1.total
      ,tab1.col7
      ,week_ref.first_week_day
      ,week_ref.last_week_day
FROM dtsetname.tab1 tab1
JOIN (SELECT DISTINCT year, week, first_week_day, last_week_day FROM spyros.week_ref) week_ref
ON (week_ref.year = tab1.year AND week_ref.week = tab1.week)

Answer 3

first, I hope that year+week & year+day are primary keys in corresponding tables, otherwise the problem is there.

If so, here is another hint to check: I notice that you join them by year and week, however, in the first table I see many 52 in a week column and in the second one 0 as a value.

There are only 52 weeks in year, plus a day, so is it possible you need to join by

week_ref.year = tab1.year AND week_ref.week = tab1.week+1

Answer 4

I think the solutions mentioned by others should work if you are looking to join to your reference table to get week start/end dates.

However, if you think your tab1 table has definite values in the week and year columns (and if I understand your data correctly) you can avoid the join altogether to get your desired results:

select 
  year
  ,week
  ,col3
  ,col4
  ,col5
  ,col6
  ,total
  ,col7
  ,date_sub(weekdate, interval IF(EXTRACT(DAYOFWEEK FROM weekdate) = 1, 6, EXTRACT(DAYOFWEEK FROM weekdate) - 1) day) as first_week_day
  ,date_add(date_sub(weekdate, interval IF(EXTRACT(DAYOFWEEK FROM weekdate) = 1, 6, EXTRACT(DAYOFWEEK FROM weekdate) - 1) day), interval 6 day) as last_week_day
from (  
  select 
     tab1.year
    ,tab1.week
    ,tab1.col3
    ,tab1.col4
    ,tab1.col5
    ,tab1.col6
    ,tab1.total
    ,tab1.col7
    date_add(date(cast(tab1.year as int64), 1, 1), interval cast(tab1.week as int64) week) as weekdate
  from `mydataset.tab1` as tab1
)

Hope it helps :)

SQL Inner Join returns duplicates

Question

4 answers

solution1
1 2019-07-15 14:45:07

solution2
1 ACCPTED 2019-07-15 17:55:48

solution3
0 2019-07-15 14:47:57

solution4
0 2019-07-16 04:11:03

SQL Inner Join returns duplicates

Question

4 answers

solution1 1 2019-07-15 14:45:07

solution2 1 ACCPTED 2019-07-15 17:55:48

solution3 0 2019-07-15 14:47:57

solution4 0 2019-07-16 04:11:03

solution1
1 2019-07-15 14:45:07

solution2
1 ACCPTED 2019-07-15 17:55:48

solution3
0 2019-07-15 14:47:57

solution4
0 2019-07-16 04:11:03