简体   繁体   中英

Select distinct customer_id

I would like to count for each store_id on article_id level:

  • how many shared article_id's arrived first in either store_A and store_B respectively.

  • If the arrival_timestamp for eg article_id=2 for store_A< store_B (ie article arrived first in store_A) then we would count 1 for store_A and 0 for store_B

See examples below:

Main table


arrival_timestamp           article_id   store_id

2019-04-01 11:04             2            A
2019-04-01 13:12             2            B
2019-04-01 08:24             4            A
2019-04-01 10:24             4            B
2019-04-10 07:00             7            A
2019-04-10 10:14             7            B
2019-04-23 07:34             9            A
2019-04-23 05:52             9            B

Output table


storeA_count_first_articles     storeB_count_first_articles
3                                1

You can use two levels of aggregation:

select
    sum(case when arrival_timestamp_a < arrival_timestamp_b then 1 else 0 end) storeA_count_first_articles,
    sum(case when arrival_timestamp_b < arrival_timestamp_a then 1 else 0 end) storeB_count_first_articles
from (
    select 
        article_id,
        min(case when store_id = 'A' then arrival_timestamp end) arrival_timestamp_a,
        min(case when store_id = 'B' then arrival_timestamp end) arrival_timestamp_b
    from mytable
    group by article_id
) t

The subquery uses conditional aggregation to compute the first arrival date of each article in eacn store. Then, the outer query compares the first arrival timestamp of each article and produces the final results.

Another option uses row_number() , which avoids conditional logic and aggregation in the subquery:

select 
    sum(case when store_id = 'A' then 1 else 0 end) storeA_count_first_articles,
    sum(case when store_id = 'B' then 1 else 0 end) storeB_count_first_articles
from (
    select 
        t.*, 
        row_number() over(partition by article_id order by arrival_timestamp) rn
    from mytable t
) t
where rn = 1

I'm not familiar with Presto, but I think this should work based on their documentation. This answer is a general solution without needing to specifically name Store A and Store B in the query.

SELECT
    q.first_store_id AS store_id,
    COUNT(*) AS count_first_articles
FROM
    (
        SELECT
            article_id,
            first_value( store_id ) OVER ( ORDER BY arrival_timestamp ) AS first_store_id
        FROM
            table
        GROUP BY
            article_id
    ) AS q
GROUP BY
    first_store_id

This works for any number of store_id values without needing to manually define each column - and because the results are row-oriented instead of column-oriented they're easier to process in application code. If you still want named columns you can do that in an outer-query or use a PIVOT / UNPIVOT (hmm, apparently Presto doesn't support PIVOT yet - but you can still do it in application code)

You'll get results like this:

store_id        count_first_articles
      A                            3
      B                            1

The magic is in the first_value which is a Window Function , and Presto has a decent set of window functions built-in.

To convert the row-based results into your original column-based example output, do this:

SELECT
    SUM( CASE WHEN q2.store_id = 'A' THEN q2.count_first_articles END ) AS storeA_count_first_articles,
    SUM( CASE WHEN q2.store_id = 'B' THEN q2.count_first_articles END ) AS storeB_count_first_articles
FROM
    (
        SELECT
            q.first_store_id AS store_id,
            COUNT(*) AS count_first_articles
        FROM
            (
                SELECT
                    article_id,
                    first_value( store_id ) OVER ( ORDER BY arrival_timestamp ) AS first_store_id
                FROM
                    table
                GROUP BY
                    article_id
            ) AS q
        GROUP BY
            first_store_id
    ) AS q2

Giving:

storeA_count_first_articles     storeB_count_first_articles
3                                1

While this answer is superficially more complicated (well, more nested ) than the other answers, it is a general solution that doesn't need modifications when you want to look at more stores besides 'A' and 'B' .

You can use two levels of aggregation. One method is:

select sum(case when first_store_id = 'A' then 1 else 0 end) as first_a,
       sum(case when first_store_id = 'B' then 1 else 0 end) as first_b       
from (select distinct article_id,
             first_value(store_id) over (partition by article_id order by arrival_timestamp) as first_store_id
      from t
     ) t;

Note: The inner aggregation uses select distinct as a convenience. The outer aggregation doesn't use group by because you want only one row in the result set.

This can also be written in Presto using min_by() and an explicit aggregation:

select sum(case when first_store_id = 'A' then 1 else 0 end) as first_a,
       sum(case when first_store_id = 'B' then 1 else 0 end) as first_b       
from (select article_id, min_by(store_id, arrival_timestamp) as first_store_id
      from t
      group by article_id
     ) t;

Note: Both these queries assume you do not have other stores. If you do and you only care about these two, then add a where store_id in ('A', 'B') to the queries.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM