[英]How to get correct min, max date for each customer's changing label in wide format in BigQuery?
I have a table that records customer purchases, for example:我有一个记录客户购买的表,例如:
customer_id客户ID | label label | date日期 | purchase_id购买编号 | price价格 |
---|---|---|---|---|
2 2 | A一个 | 2022-01-01 2022-01-01 | asd asd | 10 10 |
3 3 | A一个 | 2022-01-01 2022-01-01 | asdf自卫队 | 5 5 |
4 4 | B乙 | 2022-02-04 2022-02-04 | asdfg asdfg | 200 200 |
2 2 | A一个 | 2022-01-03 2022-01-03 | asdjg asdjg | 4 4 |
3 3 | B乙 | 2022-02-01 2022-02-01 | dfs dfs | 20 20 |
2 2 | G G | 2022-04-05 2022-04-05 | fdg fdg | 40 40 |
2 2 | G G | 2022-04-10 2022-04-10 | fdg fdg | 40 40 |
2 2 | A一个 | 2022-06-06 2022-06-06 | fgd fgd | 20 20 |
I want to see how many days/money each customer has spent in each label, so far what I'm doing is:我想看看每个客户在每个 label 上花了多少天/钱,到目前为止我正在做的是:
SELECT
customer_id,
label,
COUNT(DISTINCT(purchase_id) as orders_count,
SUM(price) as total_spent,
min(date) as first_date,
max(date) as last_date,
DATE_DIFF(max(date), min(date), DAY) as days
FROM
TABLE
WHERE
date > '2022-01-01'
GROUP BY
customer_id,
label
which gives me a long table, like this:这给了我一张长桌子,像这样:
customer_id客户ID | label label | orders_count订单数 | total_spent总花费 | first_date第一次约会 | last_date最后日期 | days天 |
---|---|---|---|---|---|---|
2 2 | A一个 | 3 3 | 34 34 | 2022-01-01 2022-01-01 | 2022-06-06 2022-06-06 | 180 180 |
2 2 | G G | 1 1 | 40 40 | 2022-04-05 2022-04-05 | 2022-04-10 2022-04-10 | 5 5 |
etc ETC
Just for simplicity I show a few columns, but customers have orders all the time.为简单起见,我展示了几列,但客户一直都有订单。 The issue with the above is that, for example for customer 2
, that he starts with label A, then changes to G, then he is back to A so this is not visible in the results table (min(date) is correct, but max(date) takes their 2nd A max(date)) and that I'd prefer to have it in wide format.上面的问题是,例如对于客户2
,他从 label A 开始,然后更改为 G,然后他又回到 A,所以这在结果表中不可见(min(date) 是正确的,但是max(date) 采用他们的第二个 A max(date)) 并且我更喜欢宽格式。 For instance, ideally, columns called next_label_{i} that you get values for each changing label would be the best for me.例如,理想情况下,名为 next_label_{i} 的列对我来说是最好的,您可以为每个更改的 label 获取值。
Could you advise me of a way of a) dealing with accomodating with this label change(future label change is the same as an earlier label) and b) a way to produce it into a wide format?您能否告诉我一种方法:a) 处理此 label 更改(未来 label 更改与早期标签相同)和 b) 将其制作成宽格式的方法?
Thanks谢谢
edit: example output (correct date, wide format) [columns would go as wide as the max number of unique labels for any customer]编辑:示例 output(正确日期,宽格式)[列将 go 与任何客户的唯一标签的最大数量一样宽]
customer_id客户ID | first_label第一个标签 | first_first_date first_first_date | first_last_date first_last_date | first_total_spent first_total_spent | first_days第一天 | next_label下一个标签 | next_first_date next_first_date | next_last_date next_last_date | next_days next_days | next_label_2下一个标签_2 | next_first_date_2 next_first_date_2 | next_last_date_2 next_last_date_2 | next_days_2 next_days_2 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2 2 | A一个 | 2022-01-01 2022-01-01 | 2022-01-03 2022-01-03 | 2 2 | 14 14 | G G | 2022-04-05 2022-04-05 | 2022-04-05 2022-04-05 | 0 0 | A一个 | 2022-06-06 2022-06-06 | 2022-06-06 2022-06-06 | 0 0 |
etc ETC
Sorry this is not exactly accurate (missing the orders_count, total_spent) but it's a pain in the ass for format it here, but hopefully you get the idea.抱歉,这并不完全准确(缺少orders_count,total_spent),但在这里格式化它很麻烦,但希望你能明白。 In principle, it's something as if you used python's pivot_table on the previous dataset.原则上,就好像您在之前的数据集上使用了 python 的 pivot_table。
Alternatively, I'd be glad for just a solution in the long format that distinguishes between a customer's label and the same customer's repeated label ( as in customer 2 who starts with A and after changing to G, returns to A)或者,我很高兴只提供一个长格式的解决方案,该解决方案可以区分客户的 label 和同一客户的重复 label(如客户 2 以 A 开头,更改为 G 后返回 A)
Could you advise me of... b) a way to produce it into a wide format?您能告诉我... b) 一种将其制作成宽幅格式的方法吗?
First, I want to say that I hope you have really good reason to get that output as usually it is not what is considered a best practices and rather is being left for presentation layer to handle.首先,我想说我希望你有充分的理由得到 output 因为通常它不是最佳实践,而是留给表示层处理。
With that in mind - consider below approach考虑到这一点 - 考虑以下方法
select * from (
select customer_id, offset, purchase.*
from (
select customer_id,
array_agg((struct(label, date, purchase_id, price)) order by date) purchases
from your_table
group by customer_id
), unnest(purchases) purchase with offset
order by customer_id, offset
)
pivot (
any_value(label) label,
any_value(date) date,
any_value(purchase_id) purchase_id,
any_value(price) price
for offset in (0,1,2,3,4,5)
)
if applied to sample data in your question - output is如果应用于您问题中的示例数据 - output 是
Note: Above has silly assumption that you know the max number of steps (in this case I used 6 - from 0 till 5).注意:上面有一个愚蠢的假设,即您知道最大步数(在这种情况下,我使用了 6 - 从 0 到 5)。 There are plenty of posts here on SO that shows how to use same technique to make it dynamic.这里有很多关于 SO 的帖子,展示了如何使用相同的技术使其动态化。 I do not want to duplicate them as it is against SO policies.我不想复制它们,因为它违反了 SO 政策。 So, just do your extra homework on this:o)所以,只需在这方面做额外的功课:o)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.