简体   繁体   中英

Simple regex matching in BigQuery not working

I have two tables available in BigQuery:

  • my-project.my-database.what-to-query :
+---------+-----------+
| id_what | name_what |
+---------+-----------+
|    1    |   C++     |
+---------+-----------+
|    2    |   Foo     |
+---------+-----------+
|    3    |   Ca$h    |
+---------+-----------+
  • my-project.my-database.where-to-query :
+----------+----------------------+
| id_where |      name_where      |
+----------+----------------------+
|    4     | C++ and Ca$h         |
+----------+----------------------+
|    5     | Foo Fighters is nice |
+----------+----------------------+
|    6     | I know C# and C++    |
+----------+----------------------+
|    7     | Football is cool     |
+----------+----------------------+
|    8     | Don't have anything  |
+----------+----------------------+

I would like to use name_what as a REGEX search keyword, to obtain all the matches in name_where , while keeping all the columns. The result should look like:

+---------+-----------+----------+----------------------+
| id_what | name_what | id_where |      name_where      |
+---------+-----------+----------+----------------------+
|    1    |   C++     |    4     | C++ and Ca$h         |
+---------+-----------+----------+----------------------+
|    1    |   C++     |    6     | I know C# and C++    |
+---------+-----------+----------+----------------------+
|    2    |   Foo     |    5     | Foo Fighters is nice |
+---------+-----------+----------+----------------------+
|    2    |   Foo     |    7     | Football is cool     |
+---------+-----------+----------+----------------------+
|    3    |   Ca$h    |    4     | C++ and Ca$h         |
+---------+-----------+----------+----------------------+

Notice how C++ should be escaped, something like:

SELECT *
FROM `my-project.my-database.where-to-query`
WHERE REGEXP_CONTAINS(name, r"C\+\+")

BUT the thing is that column name_what could keep several OTHER strings (ie, IRL, both tables contain hundreds of thousands of rows, this is only a toy sample), which would contain OTHER RegEx special characters. In Python for instance, you have re.escape to deal with this specific problem, but nothing similar in SQL / BigQuery.

With comment's aid, I have tried the following updated code:

CREATE TEMP FUNCTION ENCODE_WITH_ESCAPE(x STRING) RETURNS STRING AS (
    REPLACE(
      REPLACE(x, "+", "\\\\+"), "$", "\\\\$"
    )  -- For the time being, only "+" & "$" have been dealt with, there could be more
);

WITH what AS (
      SELECT 1 AS id_what, 'c++' AS name_what UNION ALL 
      SELECT 2 AS id_what, 'foo' AS name_what UNION ALL 
      SELECT 3 AS id_what, 'ca$h' AS name_what

    ),
    andwhere AS (
      SELECT 4 AS id_where, 'C++ and Ca$h' AS name_where UNION ALL 
      SELECT 5 AS id_where, 'Foo Fighters is nice' AS name_where UNION ALL 
      SELECT 6 AS id_where, 'I know C# and C++' AS name_where UNION ALL 
      SELECT 7 AS id_where, 'Football is cool' AS name_where UNION ALL 
      SELECT 8 AS id_where, "Don't have anything" AS name_where

    ) 

    SELECT * 
    FROM what JOIN andwhere 
    ON REGEXP_CONTAINS(ENCODE_WITH_ESCAPE(andwhere.name_where), ENCODE_WITH_ESCAPE(what.name_what))

The previous code run, with the output: There is no data to display .

How to combine all the requirements?

PS.: BigQuery's "Legacy SQL" can NOT be an answer.

See if this helps:

create temp function encode_with_escape(x STRING) returns string as (
    replace(x, "+", "\\\\+")
); 

 WITH what AS (
      SELECT 1 as id_what, 'c++' as name_what union all 
      SELECT 2 as id_what, 'foo' as name_what

    ),
    andwhere as (
      SELECT 3 as id_where, 'c++ is great' as name_where union all 
      SELECT 5 as id_where, 'c++ was after c' as name_where union all 
      SELECT 4 as id_where, 'food was good' as name_where

    ) 

    SELECT * 
    FROM what join andwhere 
    on regexp_contains(encode_with_escape(andwhere.name_where), encode_with_escape(what.name_what))
    

Gives back:

在此处输入图像描述

Consider below option

create temp function escapeRegExp(x string) 
returns string language js
as r"return x.replace(/[.*+?^${}()|[\]\\]/g, '\\$&');";
with what as (
  select 1 as id_what, 'c++' as name_what union all 
  select 2 as id_what, 'foo' as name_what union all 
  select 3 as id_what, 'ca$h' as name_what
), andwhere as (
  select 4 as id_where, 'C++ and Ca$h' as name_where union all 
  select 5 as id_where, 'Foo Fighters is nice' as name_where union all 
  select 6 as id_where, 'I know C# and C++' as name_where union all 
  select 7 as id_where, 'Football is cool' as name_where union all 
  select 8 as id_where, "Don't have anything" as name_where
)
select *
from what join andwhere
on regexp_contains(lower(name_where), escapeRegExp(lower(name_what)))    

with output

在此处输入图像描述

Pratik's solution is the way to go. Meantime, consider also below option

SELECT c.id_what, c.name_what, s.id_where, s.name_where
FROM `my-project.my-database.what-to-query` c, `my-project.my-database.where-to-query` s
WHERE s.name_where LIKE '%' || c.name_what || '%'       

with output

在此处输入图像描述

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM