简体   繁体   English

SQL/BigQuery 如何对这些数据进行会话?

[英]SQL/BigQuery how can I sessionize this data?

I'm struggling to figure out how to write a query that will properly aggregate the below sample data.我正在努力弄清楚如何编写一个可以正确聚合以下示例数据的查询。 This sample data represents the output of an aggregate query that I then need to filter down further.此示例数据表示聚合查询的输出,然后我需要对其进行进一步过滤。 I'm not 100% sure on this, but I think I need to sessionize this data, where the session starts on the first row of a resource_name that is null for complete and ends when that resource name has true in column complete , with a lag time of ~ 30 minutes.我对此不是 100% 确定,但我认为我需要对这些数据进行会话,其中会话从 resource_name 的第一行开始,该行为null表示complete并在该资源名称在列complete truecomplete ,带有延迟时间约 30 分钟。

I unfortunately don't have the ability to change the data, and all of the output below is from a single table with a query that aggregates the fields as such.不幸的是,我没有能力更改数据,下面的所有输出都来自一个带有聚合字段的查询的单个表。

Example Data:示例数据:

|resource_name|operation_type|initiate|complete|timestamp             |
|-------------|--------------|--------|--------|----------------------|
|foo          |full          |true    |null    |2021-11-01-5:51:46 UTC|
|foo          |full          |null    |true    |2021-11-01-5:51:49 UTC|
|foo          |incomplete    |null    |null    |2021-11-01-7:02:22 UTC|  <--- foo begins
|foo          |incomplete    |null    |null    |2021-11-01-7:02:37 UTC|
|foo          |incomplete    |null    |null    |2021-11-01-7:03:19 UTC|
|baz          |incomplete    |null    |null    |2021-11-01-7:03:25 UTC|
|baz          |incomplete    |null    |null    |2021-11-01-7:03:29 UTC|
|foo          |full          |true    |null    |2021-11-01-7:03:31 UTC|
|foo          |full          |null    |true    |2021-11-01-7:12:55 UTC|  <--- foo ends
|bar          |incomplete    |null    |null    |2021-11-01-7:39:22 UTC|  <--- bar starts
|bar          |incomplete    |null    |null    |2021-11-01-7:40:37 UTC|
|baz          |incomplete    |null    |null    |2021-11-01-7:41:37 UTC|
|baz          |incomplete    |null    |null    |2021-11-01-7:41:39 UTC|
|baz          |incomplete    |null    |null    |2021-11-01-7:41:45 UTC|
|bar          |incomplete    |null    |null    |2021-11-01-7:44:19 UTC|
|bar          |incomplete    |null    |null    |2021-11-01-7:44:58 UTC|
|bar          |full          |true    |null    |2021-11-01-7:45:31 UTC|
|bar          |full          |null    |true    |2021-11-01-7:47:55 UTC|  <--- bar ends
|bar          |incomplete    |null    |null    |2021-11-01-9:38:22 UTC|  <--- bar starts again 
|bar          |incomplete    |null    |null    |2021-11-01-9:40:37 UTC|
|bar          |full          |true    |null    |2021-11-01-9:45:31 UTC|
|bar          |full          |null    |true    |2021-11-01-9:51:55 UTC|  <--- bar ends again

What I'm trying to do is find the timestamp differences for each resource_name between the first incomplete operation_type and the next full operation_type where complete = true , for each resource_name.我想要做的就是找到每个时间戳差异resource_name第一之间的incomplete operation_type和未来full operation_type其中complete = true ,每个RESOURCE_NAME。

So, in this case, I would return one value for foo , and two values for bar .因此,在这种情况下,我将为foo返回一个值,为bar返回两个值。 foo has one initial incomplete operation_type , and one full operation_type with complete = true , and bar has two instances of the same. foo有一个初始的incomplete operation_type和一个full operation_type其中complete = true ,而bar有两个相同的实例。

My results should be (duration not computed so you can see timestamps that should be picked up, sorted DESC):我的结果应该是(未计算持续时间,因此您可以看到应该选择的时间戳,按 DESC 排序):

|resource_name|duration                                                        |
|-------------|----------------------------------------------------------------|
|bar          | timestamp_diff(2021-11-01-9:51:55 - 2021-11-01-9:38:22, SECOND)|
|foo          | timestamp_diff(2021-11-01-7:12:55 - 2021-11-01-7:02:22, SECOND)|
|bar          | timestamp_diff(2021-11-01-7:47:55 - 2021-11-01-7:39:22, SECOND)|

Based on the information you have provided I have come with this solution:根据您提供的信息,我提供了此解决方案:

I follow these steps to recreate it on my side:我按照以下步骤在我这边重新创建它:

  1. create a .csv table based on the data you provide in your post.根据您在帖子中提供的数据创建一个 .csv 表。
  2. created a dataset and a table创建了一个数据集和一个表
  3. filled the table with the .csv file用 .csv 文件填充表格
  4. run below query.在查询下运行。

So to explain the though process, the approach should be about getting your success runs first then use it to go the minimum incomplete values for each run.所以为了解释这个过程,方法应该是先让你成功运行,然后使用它来获取每次运行的最小不完整值。 For what I see in the way you have more than one error inside an hour so we split it based on an hour to cover that scenario.对于我所看到的情况,您在一小时内有多个错误,因此我们根据一个小时将其拆分以涵盖该场景。

CSV CSV

foo,full,true,,2021-11-01 05:51:46
foo,full,,true,2021-11-01 05:51:49
foo,incomplete,,,2021-11-01 07:02:22
foo,incomplete,,,2021-11-01 07:02:37
foo,incomplete,,,2021-11-01 07:03:19
baz,incomplete,,,2021-11-01 07:03:25
baz,incomplete,,,2021-11-01 07:03:29
foo,full,true,,2021-11-01 07:03:31
foo,full,,true,2021-11-01 07:12:55
bar,incomplete,,,2021-11-01 07:39:22
bar,incomplete,,,2021-11-01 07:40:37
baz,incomplete,,,2021-11-01 07:41:37
baz,incomplete,,,2021-11-01 07:41:39
baz,incomplete,,,2021-11-01 07:41:45
bar,incomplete,,,2021-11-01 07:44:19
bar,incomplete,,,2021-11-01 07:44:58
bar,full,true,,2021-11-01 07:45:31
bar,full,,true,2021-11-01 07:47:55
bar,incomplete,,,2021-11-01 09:38:22
bar,incomplete,,,2021-11-01 09:40:37
bar,full,true,,2021-11-01 09:45:31
bar,full,,true,2021-11-01 09:51:55

Bigquery Code大查询代码

with operations_sucess as (
    select f.resource_name,
    f.operation_type,
    f.timestamp,
    extract(hour from f.timestamp) as hour, 
    UNIX_SECONDS(f.timestamp) as unixsecs
    from <MY TEMP BIGQUERY TABLE> f 
    where f.operation_type='full' and complete = True 
    group by f.resource_name,f.operation_type,f.timestamp
    order by f.timestamp
)
select s.resource_name,
timestamp_diff(TIMESTAMP_SECONDS(MAX(ss.unixsecs)),MIN(s.timestamp),SECOND),
from <MY TEMP BIGQUERY TABLE> s  
inner join operations_sucess ss
    on EXTRACT(HOUR FROM s.timestamp) = ss.hour and ss.resource_name = s.resource_name 
where s.operation_type ='incomplete'
group by s.resource_name,EXTRACT(HOUR FROM s.timestamp)

output will be:输出将是:

1| foo | 633
2| bar | 513
3| bar | 813

Please let me know if this code help you in your process.如果此代码对您的流程有帮助,请告诉我。

Regards,问候,

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM