简体   繁体   中英

Extract Substring in BigQuery

Body
/tpt/TpScaning-ScaningRoad-PERSIVEE-202301020730327429-1567651.tp
/tpt/TpScaning-BaliResortsInterns-PERSIVEE-202205300341154744-909890.tp
/tpt/TpScaning-RXX-PERSIVEE-202108210412090110-000989.tp
/tpt/TpScaning-Backnationnotecom-PERSIVEE-202112201229124214-783672.tp
/tpt/TpScaning-DBZ-PERSIVEE-202109131129036172-908954.tp
/tpt/TpScaning-DBZ-PERSIVEE-202109131129036172-17892578.tp

I have a column 'Body' in a BigQuery table xyz which contains the Scans being requested on our internal app. A Body field contains date part within it ie (202301020730327429, 202205300341154744...) the 18 digit numbers.

I want to extract this 18 digit sequence in the separate column named date to know when the scan requests were created.

I tried to use the substr function but it did not work correctly as the length of Body field is varying. Is there a better way to get this implemented?

Expected output:

date
202301020730327429
202205300341154744
202108210412090110
202112201229124214
202109131129036172
202109131129036172

You might consider below.

WITH sample_table AS (
  SELECT '/tpt/TpScaning-ScaningRoad-PERSIVEE-202301020730327429-1567651.tp' body UNION ALL
  SELECT '/tpt/TpScaning-BaliResortsInterns-PERSIVEE-202205300341154744-909890.tp' body UNION ALL
  SELECT '/tpt/TpScaning-RXX-PERSIVEE-202108210412090110-000989.tp' body UNION ALL
  SELECT '/tpt/TpScaning-Backnationnotecom-PERSIVEE-202112201229124214-783672.tp' body UNION ALL
  SELECT '/tpt/TpScaning-DBZ-PERSIVEE-202109131129036172-908954.tp' body UNION ALL
  SELECT '/tpt/TpScaning-DBZ-PERSIVEE-202109131129036172-17892578.tp' body
)
SELECT *, PARSE_TIMESTAMP('%Y%m%d%H%M%E4S', LEFT(date, 14) || '.' || RIGHT(date, 4)) ts 
  FROM (
    SELECT *, REGEXP_EXTRACT(body, r'\d{18}') AS date FROM sample_table
  );
some explanations
  • regular expressions explanations -> https://regex101.com/r/fiB4mD/1
  • \d matches 1 digit (equivalent to [0-9])
  • {18} matches the previous token exactly 18 times
  • REGEXP_EXTRACT function will find a pattern described in the regular expression above from a given string data.
  • %E4S format element parse seconds with digits of fractional precision , ie 00.0000 for %E4S . note that it requires a dot(.) in string data.

Query results

在此处输入图像描述

Consider also below approach

select body, regexp_extract(body, r'\d{18}') as date,
  timestamp(regexp_replace(body, r'(^.*?)(\d{4})(\d{2})(\d{2})(\d{2})(\d{2})(\d{2})(\d{4})(.*?$)', r'\2-\3-\4 \5:\6:\7.\8')) as ts
from your_table             

if applied to sample data in your question - output is

在此处输入图像描述

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM