I have about 20 million rows text data and want to label it based on several keywords (about 100k keywords). My text data was look like this
text |
---|
my car was broken |
nobody knows |
the fish is so beautiful |
While the keywords is look like this
keywords |
---|
car |
beautiful |
know |
journey |
My expected output is look like this, where I will regex the text column using keywords data.
text | keywords |
---|---|
my car was broken | car |
nobody knows | know |
the fish is so beautiful | beautiful |
I use regex like in solution of this post But, since my data is so huge (20mio x 100k keywords) it runs so long like forever. So, what I want to ask is, is there any better solution?
Anyway, here the query I use:
select 'my car was broken' as text
union all
select 'nobody knows'
union all
select 'the fish is so beautiful'
)
,raw_keywords as(
select 'car' as keyword
union all
select 'beautiful'
union all
select 'know'
union all
select 'journey'
)
SELECT text, keyword
FROM raw_text, raw_keywords
WHERE REGEXP_CONTAINS(text, keyword)
I can't test with a big dataset, but a solution can be:
So this is the code:
First create the temporary table:
create table word_text(
word STRING,
text STRING
)
CLUSTER BY word
;
Populate it (here I didn't test with a big amount of data, so maybe this step can be a bit long) :
insert into word_text
-- Transform array to rows
select word, text from(
-- Transform text to array
select split(text, ' ') as words, text from (
select 'my car was broken' as text
union all
select 'nobody knows'
union all
select 'the fish is so beautiful'
)
),
UNNEST(words) word
Cross the data:
SELECT word_text.text, raw_keywords.keyword
FROM word_text, raw_keywords
WHERE word_text.word = raw_keywords.keyword
The following query returns what you want spliting the text into words and doing a INNER JOIN
on the keywords table, without having to create a new table.
WITH splited AS (
SELECT SPLIT(text, ' ') AS text_split, text FROM project.dataset.text_tab
)
SELECT text, keyword
FROM (
SELECT text, word_inside FROM splited, UNNEST(text_split) AS word_inside
)
INNER JOIN
`project.dataset.keywords`
ON
keyword = word_inside;
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.