在 BigQuery 中提取 Substring

Question

Body身体
/tpt/TpScaning-ScaningRoad-PERSIVEE-202301020730327429-1567651.tp /tpt/TpScaning-ScaningRoad-PERSIVEE-202301020730327429-1567651.tp
/tpt/TpScaning-BaliResortsInterns-PERSIVEE-202205300341154744-909890.tp /tpt/TpScaning-BaliResortsInterns-PERSIVEE-202205300341154744-909890.tp
/tpt/TpScaning-RXX-PERSIVEE-202108210412090110-000989.tp /tpt/TpScaning-RXX-PERSIVEE-202108210412090110-000989.tp
/tpt/TpScaning-Backnationnotecom-PERSIVEE-202112201229124214-783672.tp /tpt/TpScaning-Backnationnotecom-PERSIVEE-202112201229124214-783672.tp
/tpt/TpScaning-DBZ-PERSIVEE-202109131129036172-908954.tp /tpt/TpScaning-DBZ-PERSIVEE-202109131129036172-908954.tp
/tpt/TpScaning-DBZ-PERSIVEE-202109131129036172-17892578.tp /tpt/TpScaning-DBZ-PERSIVEE-202109131129036172-17892578.tp

I have a column 'Body' in a BigQuery table xyz which contains the Scans being requested on our internal app.我在 BigQuery 表 xyz 中有一列“正文”，其中包含在我们的内部应用程序上请求的扫描。 A Body field contains date part within it ie (202301020730327429, 202205300341154744...) the 18 digit numbers.正文字段包含其中的日期部分，即（202301020730327429、202205300341154744 ...）18 位数字。

I want to extract this 18 digit sequence in the separate column named date to know when the scan requests were created.我想在名为 date 的单独列中提取这个 18 位序列，以了解扫描请求的创建时间。

I tried to use the substr function but it did not work correctly as the length of Body field is varying.我尝试使用 substr function 但它无法正常工作，因为 Body 字段的长度在变化。 Is there a better way to get this implemented?有没有更好的方法来实现这个？

Expected output:预计 output：

date日期
202301020730327429 202301020730327429
202205300341154744 202205300341154744
202108210412090110 202108210412090110
202112201229124214 202112201229124214
202109131129036172 202109131129036172
202109131129036172 202109131129036172

Answer 1

You might consider below.您可以考虑以下。

WITH sample_table AS (
  SELECT '/tpt/TpScaning-ScaningRoad-PERSIVEE-202301020730327429-1567651.tp' body UNION ALL
  SELECT '/tpt/TpScaning-BaliResortsInterns-PERSIVEE-202205300341154744-909890.tp' body UNION ALL
  SELECT '/tpt/TpScaning-RXX-PERSIVEE-202108210412090110-000989.tp' body UNION ALL
  SELECT '/tpt/TpScaning-Backnationnotecom-PERSIVEE-202112201229124214-783672.tp' body UNION ALL
  SELECT '/tpt/TpScaning-DBZ-PERSIVEE-202109131129036172-908954.tp' body UNION ALL
  SELECT '/tpt/TpScaning-DBZ-PERSIVEE-202109131129036172-17892578.tp' body
)
SELECT *, PARSE_TIMESTAMP('%Y%m%d%H%M%E4S', LEFT(date, 14) || '.' || RIGHT(date, 4)) ts 
  FROM (
    SELECT *, REGEXP_EXTRACT(body, r'\d{18}') AS date FROM sample_table
  );

some explanations一些解释

regular expressions explanations -> https://regex101.com/r/fiB4mD/1正则表达式解释 -> https://regex101.com/r/fiB4mD/1
\d matches 1 digit (equivalent to [0-9]) \d匹配 1 个数字（相当于 [0-9]）
{18} matches the previous token exactly 18 times {18}恰好与前一个标记匹配 18 次
REGEXP_EXTRACT function will find a pattern described in the regular expression above from a given string data. REGEXP_EXTRACT function 将从给定的字符串数据中找到上面正则表达式中描述的模式。
%E4S format element parse seconds with digits of fractional precision , ie 00.0000 for %E4S . %E4S格式元素用小数精度的数字解析秒，即00.0000 for %E4S 。 note that it requires a dot(.) in string data.请注意，它在字符串数据中需要一个点 (.)。

Query results查询结果

Answer 2

Consider also below approach还请考虑以下方法

select body, regexp_extract(body, r'\d{18}') as date,
  timestamp(regexp_replace(body, r'(^.*?)(\d{4})(\d{2})(\d{2})(\d{2})(\d{2})(\d{2})(\d{4})(.*?$)', r'\2-\3-\4 \5:\6:\7.\8')) as ts
from your_table

if applied to sample data in your question - output is如果应用于您问题中的样本数据 - output 是

在 BigQuery 中提取 Substring

问题描述

2 个解决方案

解决方案1
1 2023-01-05 14:18:28

some explanations一些解释

解决方案2
1 已采纳 2023-01-05 15:43:13

在 BigQuery 中提取 Substring

问题描述

2 个解决方案

解决方案1 1 2023-01-05 14:18:28

some explanations一些解释

解决方案2 1 已采纳 2023-01-05 15:43:13

解决方案1
1 2023-01-05 14:18:28

解决方案2
1 已采纳 2023-01-05 15:43:13