[英]Is it possible to remove duplicates from the result for the data set?
I have two following tables, dim_customers
and fact_daily_customer_shipments
: 我有以下两个表dim_customers
和fact_daily_customer_shipments
:
dim_customers
+-------------+-----------------------+---------------------+
| customer_id | membership_start_date | membership_end_date |
+-------------+-----------------------+---------------------+
| 114 | 2015-01-01 | 2015-02-15 |
| 116 | 2015-02-01 | 2015-03-15 |
| 120 | 2015-02-15 | 2015-04-01 |
| 221 | 2015-03-15 | 2015-10-01 |
| 120 | 2015-05-15 | 2015-07-01 |
+-------------+-----------------------+---------------------+
fact_daily_customer_shipments
+-------------+------------+-----------------------+----------+
| shipment_id | ship_date | customer_id | quantity |
+-------------+------------+-----------------------+----------+
| 1 | 2015-02-13 | 114 | 2 |
| 2 | 2015-03-01 | 116 | 1 |
| 3 | 2015-03-01 | 120 | 6 |
| 4 | 2015-03-01 | 321 | 10 |
| 5 | 2015-06-01 | 116 | 1 |
| 6 | 2015-10-01 | 120 | 3 |
+-------------+------------+-----------------------+----------+
Join them to get a table of the following schema: 加入他们以获得以下架构的表:
fact_shipments_by_membership_status
+-----------+-------------------+----------+
| ship_date | membership_status | quantity |
+-----------+-------------------+----------+
Example results: 结果示例:
+------------+-----------+-----+
| ship_date | is_member | sum |
+------------+-----------+-----+
| 2015-02-13 | Y | 2 |
| 2015-03-01 | N | 10 |
| 2015-03-01 | Y | 7 |
| 2015-06-01 | N | 1 |
| 2015-10-01 | N | 3 |
+------------+-----------+-----+
SQL I came up with, 我想出的SQL
select dc.ship_date,
case when dc.ship_date between dc.membership_start_date
and dc.membership_end_date then 'Y'
else 'N'
end as is_member,
sum(fc.quantity)
from dim_customers dc
inner join fact_daily_customer_shipments fc on dc.customer_id = fc.customer_id
This SQL doesn't make sense because I see duplicates in both the tables. 该SQL没有意义,因为我在两个表中都看到了重复项。 Joining the table on key attributes customer_id
is yielding duplicates. 在关键属性customer_id
上加入表将产生重复项。
Any thoughts what would the correct SQL approach would be? 有什么想法是正确的SQL方法是什么?
The reason you are having issues with duplication is that you have two entries in the dim_customers
table with the same customer_id
value (but different membership dates). 重复出现问题的原因是dim_customers
表中有两个条目具有相同的customer_id
值(但成员资格日期不同)。 What this means is that you need to change the JOIN
condition to include the membership_dates. 这意味着您需要更改JOIN
条件以包括membership_dates。 By then changing to a LEFT JOIN
, we can determine whether a customer was a member at the time by whether the customer_id
value from the JOIN is NULL
. 然后,通过更改为LEFT JOIN
,我们可以通过JOIN中的customer_id
值是否为NULL
来确定客户当时是否是成员。 So the query you should use is: 因此,您应该使用的查询是:
select fc.ship_date,
case when dc.customer_id is null then 'Y' else 'N' end as is_member,
sum(fc.quantity)
from fact_daily_customer_shipments fc
left join dim_customers dc on dc.customer_id = fc.customer_id and fc.ship_date between dc.membership_start_date and dc.membership_end_date
group by fc.ship_date, is_member
Output: 输出:
ship_date is_member sum(fc.quantity)
2015-02-13 N 2
2015-03-01 N 7
2015-03-01 Y 10
2015-06-01 Y 1
2015-10-01 Y 3
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.