How to optimize repeated regex using cross join in Bigquery

Question

I have about 20 million rows text data and want to label it based on several keywords (about 100k keywords). My text data was look like this

text
my car was broken
nobody knows
the fish is so beautiful

While the keywords is look like this

keywords
car
beautiful
know
journey

My expected output is look like this, where I will regex the text column using keywords data.

text	keywords
my car was broken	car
nobody knows	know
the fish is so beautiful	beautiful

I use regex like in solution of this post But, since my data is so huge (20mio x 100k keywords) it runs so long like forever. So, what I want to ask is, is there any better solution?

Anyway, here the query I use:

 select 'my car was broken' as text
 union all
 select 'nobody knows'
 union all
 select 'the fish is so beautiful'
 )
 ,raw_keywords as(
  select 'car' as keyword
  union all
  select 'beautiful'
  union all
  select 'know'
  union all
  select 'journey'
  )
  SELECT  text, keyword
    FROM raw_text, raw_keywords 
    WHERE REGEXP_CONTAINS(text, keyword)

Answer 1

I can't test with a big dataset, but a solution can be:

Extract all words and save it in a temporary table (use clustering to improve the performance of the join)
Join this temporary table with the keyboard table

So this is the code:

First create the temporary table:

create table word_text(
    word STRING,
    text STRING
)
CLUSTER BY word
;

Populate it (here I didn't test with a big amount of data, so maybe this step can be a bit long) :

insert into word_text
-- Transform array to rows
select word, text from(
    -- Transform text to array
    select split(text, ' ') as words, text from (
        select 'my car was broken' as text
        union all
        select 'nobody knows'
        union all
        select 'the fish is so beautiful'
    )
 ),
UNNEST(words) word

Cross the data:

SELECT word_text.text, raw_keywords.keyword
FROM word_text, raw_keywords 
WHERE word_text.word = raw_keywords.keyword

Answer 2

The following query returns what you want spliting the text into words and doing a INNER JOIN on the keywords table, without having to create a new table.

WITH splited AS (
  SELECT SPLIT(text, ' ') AS text_split, text FROM project.dataset.text_tab 
)

SELECT text, keyword
FROM (
  SELECT text, word_inside FROM splited, UNNEST(text_split) AS word_inside
)
INNER JOIN
  `project.dataset.keywords`
ON
  keyword = word_inside;

Tables:

Result:

How to optimize repeated regex using cross join in Bigquery

Question

2 answers

solution1
0 2021-11-17 09:38:39

solution2
0 ACCPTED 2021-11-17 10:53:46

How to optimize repeated regex using cross join in Bigquery

Question

2 answers

solution1 0 2021-11-17 09:38:39

solution2 0 ACCPTED 2021-11-17 10:53:46

solution1
0 2021-11-17 09:38:39

solution2
0 ACCPTED 2021-11-17 10:53:46