[英]Equivalent of string contains in google bigquery
我有一个如下所示的表格
我想创建two new binary columns
来指示受试者是否服用了steroids
和aspirin
。 我希望在Postgresql and google bigquery
中实现这一点
我尝试了以下但它不起作用
select subject_id
case when lower(drug) like ('%cortisol%','%cortisone%','%dexamethasone%')
then 1 else 0 end as steroids,
case when lower(drug) like ('%peptide%','%paracetamol%')
then 1 else 0 end as aspirin,
from db.Team01.Table_1
SELECT
db.Team01.Table_1.drug
FROM `table_1`,
UNNEST(table_1.drug) drug
WHERE REGEXP_CONTAINS( db.Team01.Table_1.drug,r'%cortisol%','%cortisone%','%dexamethasone%')
我希望我的 output 如下所示
使用条件聚合。 这是一个适用于大多数(如果不是全部)RDBMS 的解决方案:
SELECT
subject_id,
MAX(CASE WHEN drug IN ('cortisol', 'cortisone', 'dexamethasone') THEN 1 END) steroids,
MAX(CASE WHEN drug IN ('peptide', 'paracetamol') THEN 1 END) aspirin
FROM db.Team01.Table_1.drug
GROUP BY subject_id
注意:目前尚不清楚您为什么使用LIKE
,因为您似乎有完全匹配; 我将LIKE
条件变为等式。
您缺少group-by
select subject_id,
sum(case when lower(drug) in ('cortisol','cortisone','dexamethasone')
then 1 else 0 end) as steroids,
sum(case when lower(drug) in ('peptide','paracetamol')
then 1 else 0 end) as aspirin
from db.Team01.Table_1
group by subject_id
使用like
关键字
select subject_id,
sum(case when lower(drug) like '%cortisol%'
or lower(drug) like '%cortisone%'
or lower(drug) like '%dexamethasone%'
then 1 else 0 end) as steroids,
sum(case when lower(drug) like '%peptide%'
or lower(drug) like '%paracetamol%'
then 1 else 0 end) as aspirin
from db.Team01.Table_1
group by subject_id
在 Postgres 中,我建议使用filter
子句:
select subject_id,
count(*) filter (where lower(drug) ~ 'cortisol|cortisone|dexamethasone') as steroids,
count(*) filter (where lower(drug) ~ 'peptide|paracetamol') as aspirin,
from db.Team01.Table_1
group by subject_id;
在 BigQuery 中,我会推荐countif()
:
select subject_id,
countif(regexp_contains(drug, 'cortisol|cortisone|dexamethasone') as steroids,
countif(drug ~ ' 'peptide|paracetamol') as aspirin,
from db.Team01.Table_1
group by subject_id;
您可以使用sum(case when. . . end)
作为更通用的方法。 但是,每个数据库都有一种更“本地”的方式来表达这种逻辑。 顺便说一句, FILTER
子句是标准 SQL,只是没有被广泛采用。
以下是 BigQuery 标准 SQL
#standardSQL
SELECT
subject_id,
SUM(CASE WHEN REGEXP_CONTAINS(LOWER(drug), r'cortisol|cortisone|dexamethasone') THEN 1 ELSE 0 END) AS steroids,
SUM(CASE WHEN REGEXP_CONTAINS(LOWER(drug), r'peptide|paracetamol') THEN 1 ELSE 0 END) AS aspirin
FROM `db.Team01.Table_1`
GROUP BY subject_id
如果适用于您的问题的样本数据 - 结果是
Row subject_id steroids aspirin
1 1 3 1
2 2 1 1
注意:而不是简单的 LIKE 以冗长和冗余的文本结尾 - 我LIKE on steroids
- 这是REGEXP_CONTAINS
另一个可能更直观的解决方案是使用BigQuery Contains_Substr返回 boolean 结果。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.