简体   繁体   中英

How to optimize repeated regex using cross join in Bigquery

I have about 20 million rows text data and want to label it based on several keywords (about 100k keywords). My text data was look like this

text
my car was broken
nobody knows
the fish is so beautiful

While the keywords is look like this

keywords
car
beautiful
know
journey

My expected output is look like this, where I will regex the text column using keywords data.

text keywords
my car was broken car
nobody knows know
the fish is so beautiful beautiful

I use regex like in solution of this post But, since my data is so huge (20mio x 100k keywords) it runs so long like forever. So, what I want to ask is, is there any better solution?

Anyway, here the query I use:

 select 'my car was broken' as text
 union all
 select 'nobody knows'
 union all
 select 'the fish is so beautiful'
 )
 ,raw_keywords as(
  select 'car' as keyword
  union all
  select 'beautiful'
  union all
  select 'know'
  union all
  select 'journey'
  )
  SELECT  text, keyword
    FROM raw_text, raw_keywords 
    WHERE REGEXP_CONTAINS(text, keyword) 

I can't test with a big dataset, but a solution can be:

  • Extract all words and save it in a temporary table (use clustering to improve the performance of the join)
  • Join this temporary table with the keyboard table

So this is the code:

First create the temporary table:

create table word_text(
    word STRING,
    text STRING
)
CLUSTER BY word
;

Populate it (here I didn't test with a big amount of data, so maybe this step can be a bit long) :

insert into word_text
-- Transform array to rows
select word, text from(
    -- Transform text to array
    select split(text, ' ') as words, text from (
        select 'my car was broken' as text
        union all
        select 'nobody knows'
        union all
        select 'the fish is so beautiful'
    )
 ),
UNNEST(words) word   

在此处输入图像描述

Cross the data:

SELECT word_text.text, raw_keywords.keyword
FROM word_text, raw_keywords 
WHERE word_text.word = raw_keywords.keyword 

The following query returns what you want spliting the text into words and doing a INNER JOIN on the keywords table, without having to create a new table.

WITH splited AS (
  SELECT SPLIT(text, ' ') AS text_split, text FROM project.dataset.text_tab 
)

SELECT text, keyword
FROM (
  SELECT text, word_inside FROM splited, UNNEST(text_split) AS word_inside
)
INNER JOIN
  `project.dataset.keywords`
ON
  keyword = word_inside;
  • Tables:

在此处输入图像描述 在此处输入图像描述

  • Result:

在此处输入图像描述

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM