[英]Bigquery matching words counts between two tables
given these two tables ( documents
and labels
) how can I find match counts for all the label's pattern
s from the labels
table, found in the document
field of documents
table (count exact matches, using regex is optional)给定这两个表(
documents
和labels
),我如何从labels
表中找到所有标签pattern
的匹配计数,在documents
表的document
字段中找到(计数完全匹配,使用正则表达式是可选的)
WITH documents AS (
SELECT 1 AS id, "foo bar, foo baz" AS document UNION ALL
SELECT 2, "foo bar bar qux" UNION ALL
SELECT 3, "etc blah blah"
),
labels as (
select 'FOO_LABEL' as label, 'foo' as pattern UNION ALL
select 'FOO_LABEL', 'qux' UNION ALL
select 'BAR_LABEL', 'bar' UNION ALL
select 'ETC_LABEL', 'etc'
)
the expected matches counts by document:预期匹配按文档计数:
id, label, cnt
1, FOO_LABEL, 2
1, BAR_LABEL, 1
2, FOO_LABEL, 2
2, BAR_LABEL, 2
3, ETC_LABEL, 1
the difference from this question is that I need actual match counts与这个问题的不同之处在于我需要实际的匹配计数
and unlike this question my patterns are coming from a separate table与这个问题不同,我的模式来自一个单独的表格
there are ~100M documents, and ~1000 rows in labels table有约 1 亿个文档,标签表中有约 1000 行
Mikhail Berlyant 's answer above is perfect, I just realized I needed substring match instead of exact word match, so i slightly modified it by replacing using(pattern)
with ON STRPOS(word, pattern)>0
in the JOIN
: Mikhail Berlyant上面的答案是完美的,我刚刚意识到我需要 substring 匹配而不是精确的单词匹配,所以我通过在
JOIN
中将using(pattern)
替换为ON STRPOS(word, pattern)>0
来稍微修改它:
WITH documents AS (
SELECT 1 AS id, "foo bar foobar qux" AS document UNION ALL
SELECT 2, "foooooo barbar"
),
labels as (
select 'FOO_LABEL' as label, 'foo' as pattern UNION ALL
select 'FOO_LABEL', 'qux' UNION ALL
select 'BAR_LABEL', 'bar'
)
select id, label, count(*) cnt
from documents, unnest(regexp_extract_all(document, r'[\w]+')) word
join labels
ON STRPOS(word, pattern)>0 --faster that regexp_contains(word, pattern)
group by id, label
Edit : another minor change to work with phrases/sentences instead of matching individual words (the doc is split into phrases delimited by commas, etc), so that it works for multi-word patterns .编辑:另一个小的变化是使用短语/句子而不是匹配单个单词(文档被分成由逗号分隔的短语等),因此它适用于多单词模式。 this misses some counts (ie repeated substring matches within phrase) but faster overall
这错过了一些计数(即在短语中重复 substring 匹配)但总体速度更快
WITH documents AS (
SELECT 1 AS id, "foo 123 test foo 123" AS document UNION ALL
SELECT 2, "all bars, all bars all bars "
),
labels as (
select 'FOO_LABEL' as label, 'foo 123' as pattern UNION ALL
select 'BAR_LABEL', 'all bar'
)
select id, label, count(*) cnt
from documents, UNNEST(REGEXP_EXTRACT_ALL(document , r'([^.!;(),\n~|]+)')) AS phrase
join labels
ON STRPOS(phrase, pattern)>0
group by id, label
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.