简体   繁体   English

是否可以从数据集中的结果中删除重复项?

[英]Is it possible to remove duplicates from the result for the data set?

I have two following tables, dim_customers and fact_daily_customer_shipments : 我有以下两个表dim_customersfact_daily_customer_shipments

dim_customers 
+-------------+-----------------------+---------------------+
| customer_id | membership_start_date | membership_end_date |
+-------------+-----------------------+---------------------+
|         114 | 2015-01-01            | 2015-02-15          |
|         116 | 2015-02-01            | 2015-03-15          |
|         120 | 2015-02-15            | 2015-04-01          |
|         221 | 2015-03-15            | 2015-10-01          |
|         120 | 2015-05-15            | 2015-07-01          |
+-------------+-----------------------+---------------------+ 
fact_daily_customer_shipments 
+-------------+------------+-----------------------+----------+ 
| shipment_id | ship_date  |           customer_id | quantity |
+-------------+------------+-----------------------+----------+
|           1 | 2015-02-13 |                   114 |        2 |
|           2 | 2015-03-01 |                   116 |        1 |
|           3 | 2015-03-01 |                   120 |        6 |
|           4 | 2015-03-01 |                   321 |       10 |
|           5 | 2015-06-01 |                   116 |        1 |
|           6 | 2015-10-01 |                   120 |        3 |
+-------------+------------+-----------------------+----------+

Join them to get a table of the following schema: 加入他们以获得以下架构的表:

fact_shipments_by_membership_status 
+-----------+-------------------+----------+
| ship_date | membership_status | quantity |
+-----------+-------------------+----------+ 

Example results: 结果示例:

+------------+-----------+-----+
| ship_date  | is_member | sum |
+------------+-----------+-----+
| 2015-02-13 | Y         |   2 |
| 2015-03-01 | N         |  10 |
| 2015-03-01 | Y         |   7 |
| 2015-06-01 | N         |   1 |
| 2015-10-01 | N         |   3 |
+------------+-----------+-----+

SQL I came up with, 我想出的SQL

select dc.ship_date, 
       case when dc.ship_date between dc.membership_start_date
                              and dc.membership_end_date then 'Y'
            else 'N'
       end as is_member, 
       sum(fc.quantity)
from dim_customers dc
    inner join fact_daily_customer_shipments fc on dc.customer_id = fc.customer_id

This SQL doesn't make sense because I see duplicates in both the tables. 该SQL没有意义,因为我在两个表中都看到了重复项。 Joining the table on key attributes customer_id is yielding duplicates. 在关键属性customer_id上加入表将产生重复项。

Any thoughts what would the correct SQL approach would be? 有什么想法是正确的SQL方法是什么?

The reason you are having issues with duplication is that you have two entries in the dim_customers table with the same customer_id value (but different membership dates). 重复出现问题的原因是dim_customers表中有两个条目具有相同的customer_id值(但成员资格日期不同)。 What this means is that you need to change the JOIN condition to include the membership_dates. 这意味着您需要更改JOIN条件以包括membership_dates。 By then changing to a LEFT JOIN , we can determine whether a customer was a member at the time by whether the customer_id value from the JOIN is NULL . 然后,通过更改为LEFT JOIN ,我们可以通过JOIN中的customer_id值是否为NULL来确定客户当时是否是成员。 So the query you should use is: 因此,您应该使用的查询是:

select fc.ship_date, 
       case when dc.customer_id is null then 'Y' else 'N' end as is_member, 
       sum(fc.quantity)
from fact_daily_customer_shipments fc
left join dim_customers dc on dc.customer_id = fc.customer_id and fc.ship_date between dc.membership_start_date and dc.membership_end_date
group by fc.ship_date, is_member

Output: 输出:

ship_date   is_member   sum(fc.quantity)
2015-02-13  N           2
2015-03-01  N           7
2015-03-01  Y           10
2015-06-01  Y           1
2015-10-01  Y           3

SQLFiddle Demo SQLFiddle演示

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM