[英]SQL/BigQuery how can I sessionize this data?
I'm struggling to figure out how to write a query that will properly aggregate the below sample data.我正在努力弄清楚如何编写一个可以正确聚合以下示例数据的查询。 This sample data represents the output of an aggregate query that I then need to filter down further.
此示例数据表示聚合查询的输出,然后我需要对其进行进一步过滤。 I'm not 100% sure on this, but I think I need to sessionize this data, where the session starts on the first row of a resource_name that is
null
for complete
and ends when that resource name has true
in column complete
, with a lag time of ~ 30 minutes.我对此不是 100% 确定,但我认为我需要对这些数据进行会话,其中会话从 resource_name 的第一行开始,该行为
null
表示complete
并在该资源名称在列complete
true
时complete
,带有延迟时间约 30 分钟。
I unfortunately don't have the ability to change the data, and all of the output below is from a single table with a query that aggregates the fields as such.不幸的是,我没有能力更改数据,下面的所有输出都来自一个带有聚合字段的查询的单个表。
Example Data:示例数据:
|resource_name|operation_type|initiate|complete|timestamp |
|-------------|--------------|--------|--------|----------------------|
|foo |full |true |null |2021-11-01-5:51:46 UTC|
|foo |full |null |true |2021-11-01-5:51:49 UTC|
|foo |incomplete |null |null |2021-11-01-7:02:22 UTC| <--- foo begins
|foo |incomplete |null |null |2021-11-01-7:02:37 UTC|
|foo |incomplete |null |null |2021-11-01-7:03:19 UTC|
|baz |incomplete |null |null |2021-11-01-7:03:25 UTC|
|baz |incomplete |null |null |2021-11-01-7:03:29 UTC|
|foo |full |true |null |2021-11-01-7:03:31 UTC|
|foo |full |null |true |2021-11-01-7:12:55 UTC| <--- foo ends
|bar |incomplete |null |null |2021-11-01-7:39:22 UTC| <--- bar starts
|bar |incomplete |null |null |2021-11-01-7:40:37 UTC|
|baz |incomplete |null |null |2021-11-01-7:41:37 UTC|
|baz |incomplete |null |null |2021-11-01-7:41:39 UTC|
|baz |incomplete |null |null |2021-11-01-7:41:45 UTC|
|bar |incomplete |null |null |2021-11-01-7:44:19 UTC|
|bar |incomplete |null |null |2021-11-01-7:44:58 UTC|
|bar |full |true |null |2021-11-01-7:45:31 UTC|
|bar |full |null |true |2021-11-01-7:47:55 UTC| <--- bar ends
|bar |incomplete |null |null |2021-11-01-9:38:22 UTC| <--- bar starts again
|bar |incomplete |null |null |2021-11-01-9:40:37 UTC|
|bar |full |true |null |2021-11-01-9:45:31 UTC|
|bar |full |null |true |2021-11-01-9:51:55 UTC| <--- bar ends again
What I'm trying to do is find the timestamp differences for each resource_name
between the first incomplete
operation_type
and the next full
operation_type
where complete = true
, for each resource_name.我想要做的就是找到每个时间戳差异
resource_name
第一之间的incomplete
operation_type
和未来full
operation_type
其中complete = true
,每个RESOURCE_NAME。
So, in this case, I would return one value for foo
, and two values for bar
.因此,在这种情况下,我将为
foo
返回一个值,为bar
返回两个值。 foo
has one initial incomplete
operation_type
, and one full
operation_type
with complete = true
, and bar
has two instances of the same. foo
有一个初始的incomplete
operation_type
和一个full
operation_type
其中complete = true
,而bar
有两个相同的实例。
My results should be (duration not computed so you can see timestamps that should be picked up, sorted DESC):我的结果应该是(未计算持续时间,因此您可以看到应该选择的时间戳,按 DESC 排序):
|resource_name|duration |
|-------------|----------------------------------------------------------------|
|bar | timestamp_diff(2021-11-01-9:51:55 - 2021-11-01-9:38:22, SECOND)|
|foo | timestamp_diff(2021-11-01-7:12:55 - 2021-11-01-7:02:22, SECOND)|
|bar | timestamp_diff(2021-11-01-7:47:55 - 2021-11-01-7:39:22, SECOND)|
Based on the information you have provided I have come with this solution:根据您提供的信息,我提供了此解决方案:
I follow these steps to recreate it on my side:我按照以下步骤在我这边重新创建它:
So to explain the though process, the approach should be about getting your success runs first then use it to go the minimum incomplete values for each run.所以为了解释这个过程,方法应该是先让你成功运行,然后使用它来获取每次运行的最小不完整值。 For what I see in the way you have more than one error inside an hour so we split it based on an hour to cover that scenario.
对于我所看到的情况,您在一小时内有多个错误,因此我们根据一个小时将其拆分以涵盖该场景。
CSV CSV
foo,full,true,,2021-11-01 05:51:46
foo,full,,true,2021-11-01 05:51:49
foo,incomplete,,,2021-11-01 07:02:22
foo,incomplete,,,2021-11-01 07:02:37
foo,incomplete,,,2021-11-01 07:03:19
baz,incomplete,,,2021-11-01 07:03:25
baz,incomplete,,,2021-11-01 07:03:29
foo,full,true,,2021-11-01 07:03:31
foo,full,,true,2021-11-01 07:12:55
bar,incomplete,,,2021-11-01 07:39:22
bar,incomplete,,,2021-11-01 07:40:37
baz,incomplete,,,2021-11-01 07:41:37
baz,incomplete,,,2021-11-01 07:41:39
baz,incomplete,,,2021-11-01 07:41:45
bar,incomplete,,,2021-11-01 07:44:19
bar,incomplete,,,2021-11-01 07:44:58
bar,full,true,,2021-11-01 07:45:31
bar,full,,true,2021-11-01 07:47:55
bar,incomplete,,,2021-11-01 09:38:22
bar,incomplete,,,2021-11-01 09:40:37
bar,full,true,,2021-11-01 09:45:31
bar,full,,true,2021-11-01 09:51:55
Bigquery Code大查询代码
with operations_sucess as (
select f.resource_name,
f.operation_type,
f.timestamp,
extract(hour from f.timestamp) as hour,
UNIX_SECONDS(f.timestamp) as unixsecs
from <MY TEMP BIGQUERY TABLE> f
where f.operation_type='full' and complete = True
group by f.resource_name,f.operation_type,f.timestamp
order by f.timestamp
)
select s.resource_name,
timestamp_diff(TIMESTAMP_SECONDS(MAX(ss.unixsecs)),MIN(s.timestamp),SECOND),
from <MY TEMP BIGQUERY TABLE> s
inner join operations_sucess ss
on EXTRACT(HOUR FROM s.timestamp) = ss.hour and ss.resource_name = s.resource_name
where s.operation_type ='incomplete'
group by s.resource_name,EXTRACT(HOUR FROM s.timestamp)
output will be:输出将是:
1| foo | 633
2| bar | 513
3| bar | 813
Please let me know if this code help you in your process.如果此代码对您的流程有帮助,请告诉我。
Regards,问候,
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.