简体   繁体   English

将字符串列表转换为表格

[英]Convert list of strings to table

I'm just a beginner in sql and I would like to transform strings into a table 我只是sql的初学者,我想将字符串转换成表

Date of application: 01/02/2018 Request: Buy books Contact: email: hi@gmail.com Tel: 0123456789 Ordered inquiry: Order ID: 12345678 BL: 87654321 Product: 123456 Books 申请日期:2018年1月2日要求:购买书籍联系人:电子邮件:hi@gmail.com电话:0123456789订购的查询:订单ID:12345678 BL:87654321产品:123456 Books

Date of application: 01/04/2018 Request: Retour table Contact: Rodion Raskólnikov email: hello@outlook.com Tel: 9876543210 Ordered inquiry: Order Id: 87654321 BL: 12345678 Product: 654321 Tables 申请日期:2018年1月4日要求:探路表联系人:RodionRaskólnikov电子邮件:hello@outlook.com电话:9876543210订购的查询:订单编号:87654321 BL:12345678产品:654321桌子

Like this: 像这样: 在此处输入图片说明

I tried this: 我尝试了这个:

WITH raw_messages AS (SELECT lines
  FROM `my_table` 
  WHERE REGEXP_CONTAINS(lines, '^Date of application: '))

SELECT 
  REGEXP_EXTRACT(lines, r'^Date of application: [0-9]{2}/[0-9]{2}/[0-9]{4}') as date

FROM raw_messages

It does not work as I would like and I have no idea how to continue building my table. 它不起作用,我不知道如何继续构建表。

Below is for BigQuery Standard SQL 以下是BigQuery标准SQL

Assuming order of fields in your lines are set as it is in your example 假设您的示例中的行的字段顺序已设置

#standardSQL
WITH raw_messages AS (
  SELECT lines
  FROM `my_table` 
  WHERE REGEXP_CONTAINS(lines, '^Date of application: ')
)
SELECT 
  REGEXP_EXTRACT(lines, r'(?i)^Date of application: ([0-9]{2}/[0-9]{2}/[0-9]{4})') AS DATE,
  REGEXP_EXTRACT(lines, r'(?i) Request: (.*?) Contact: ') AS request,
  REGEXP_EXTRACT(lines, r'(?i) Contact: (.*?) email: ') AS contact,
  REGEXP_EXTRACT(lines, r'(?i) email: (.*?) Tel: ') AS email,
  REGEXP_EXTRACT(lines, r'(?i) Tel: (.*?) Ordered inquiry: ') AS phone,
  REGEXP_EXTRACT(lines, r'(?i) Order ID: (.*?) BL: ') AS id,
  REGEXP_EXTRACT(lines, r'(?i) BL: (.*?) Product: ') AS bl,
  REGEXP_EXTRACT(lines, r'(?i) Product: (.*?)$') AS product
FROM raw_messages   

You can test , play with above using dummy data from your question as below 您可以使用以下问题中的虚拟数据进行测试,操作

#standardSQL
WITH `project.dataset.my_table` AS (
  SELECT 'Date of application: 01/02/2018 Request: Buy books Contact: email: hi@gmail.com Tel: 0123456789 Ordered inquiry: Order ID: 12345678 BL: 87654321 Product: 123456 Books' lines UNION ALL
  SELECT 'Date of application: 01/04/2018 Request: Retour table Contact: Rodion Raskólnikov email: hello@outlook.com Tel: 9876543210 Ordered inquiry: Order Id: 87654321 BL: 12345678 Product: 654321 Tables'
), raw_messages AS (
  SELECT lines
  FROM `project.dataset.my_table` 
  WHERE REGEXP_CONTAINS(lines, '^Date of application: ')
)
SELECT 
  REGEXP_EXTRACT(lines, r'(?i)^Date of application: ([0-9]{2}/[0-9]{2}/[0-9]{4})') AS DATE,
  REGEXP_EXTRACT(lines, r'(?i) Request: (.*?) Contact: ') AS request,
  REGEXP_EXTRACT(lines, r'(?i) Contact: (.*?) email: ') AS contact,
  REGEXP_EXTRACT(lines, r'(?i) email: (.*?) Tel: ') AS email,
  REGEXP_EXTRACT(lines, r'(?i) Tel: (.*?) Ordered inquiry: ') AS phone,
  REGEXP_EXTRACT(lines, r'(?i) Order ID: (.*?) BL: ') AS id,
  REGEXP_EXTRACT(lines, r'(?i) BL: (.*?) Product: ') AS bl,
  REGEXP_EXTRACT(lines, r'(?i) Product: (.*?)$') AS product
FROM raw_messages    

with result 结果

Row DATE        request         contact             email               phone       id          bl          product  
1   01/02/2018  Buy books       null                hi@gmail.com        0123456789  12345678    87654321    123456 Books     
2   01/04/2018  Retour table    Rodion Raskólnikov  hello@outlook.com   9876543210  87654321    12345678    654321 Tables    

In case if order of fields in your strings is not known/guaranteed, but you know all fields in them - below is smart enough to parse those properly 如果字符串中的字段顺序未知/无法保证,但您知道它们中的所有字段-下面的内容足够聪明,可以正确地解析它们

#standardSQL
WITH raw_messages AS (
  SELECT lines FROM `project.dataset.my_table` 
  WHERE REGEXP_CONTAINS(lines, '^Date of application: ')
), fields AS (
  SELECT 'Date of application' field, 'date' column UNION ALL
  SELECT 'Request', 'request' UNION ALL
  SELECT 'Contact', 'contact' UNION ALL
  SELECT 'email', 'email' UNION ALL
  SELECT 'Tel', 'phone' UNION ALL
  SELECT 'Order ID', 'id' UNION ALL
  SELECT 'BL', 'bl' UNION ALL
  SELECT 'Product', 'product' UNION ALL
  SELECT 'Ordered inquiry', '' UNION ALL
  SELECT 'Boundary of string', ''
), patterns AS ( 
  SELECT f1.field, f1. column, CONCAT(r'(?i) ',f1.field,': (.*)',f2.field,': ') pattern
  FROM fields f1 CROSS JOIN fields f2
), splits AS (SELECT ARRAY(
      SELECT AS STRUCT column, ARRAY_AGG(value ORDER BY LENGTH(value) LIMIT 1)[OFFSET(0)] value
      FROM (SELECT column, REGEXP_EXTRACT(CONCAT(' Boundary of string: ', lines, ' Boundary of string: '), pattern) value
        FROM patterns ) 
      WHERE NOT value IS NULL AND NOT column = '' GROUP BY column 
    ) arr FROM raw_messages
) SELECT 
  (SELECT value FROM UNNEST(arr) WHERE column='date')     AS DATE,
  (SELECT value FROM UNNEST(arr) WHERE column='request')  AS request,
  (SELECT value FROM UNNEST(arr) WHERE column='contact')  AS contact,
  (SELECT value FROM UNNEST(arr) WHERE column='email')    AS email,
  (SELECT value FROM UNNEST(arr) WHERE column='phone')    AS phone,
  (SELECT value FROM UNNEST(arr) WHERE column='id')       AS id,
  (SELECT value FROM UNNEST(arr) WHERE column='bl')       AS bl,
  (SELECT value FROM UNNEST(arr) WHERE column='product')  AS product
FROM splits     

you can test above with the same dummy data as in my another answer and obviously result should be the same 您可以使用与我的另一个答案相同的虚拟数据进行测试,显然结果应该是相同的

Note: as you can see - you need to set fields AS (...) CTE explicitly with all fields in strings and respective column names to be used in any order but important - you need to add one more entry there - 'Boundary of string' 注意:如您所见-您需要显式设置fields AS (...) CTE字段,字符串中的所有字段以及相应的列名都可以以任何顺序使用,但很重要-您需要在此添加一个条目'Boundary of string'

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM