简体   繁体   English

如何通过在 bigquery sql 中进行分组字符串比较来返回同一列中字符串值的差异?

[英]How to return difference in string values from the same column by doing a grouped string comparison in bigquery sql?

I have a table of products with a lot of products with an example like this:我有一个产品表,其中包含很多产品,例如:

product产品 brand
colgate smile 250gr高露洁微笑 250gr colgate高露洁
colgate fresh breath 250gr高露洁清新口气 250gr colgate高露洁
colgate mint 250gr高露洁薄荷 250gr colgate高露洁
relx pod pro mango - 1pod relx pod pro 芒果 - 1pod relx放松
relx pod pro lychee - 1pod relx pod pro 荔枝 - 1pod relx放松
soju jinro chamisul green grape 360ml烧酒真露 chamisul 绿葡萄 360ml jinro真露
soju jinro chamisul strawberry 360ml烧酒真露 chamisul 草莓 360ml jinro真露
soju jinro chamisul apple grape 360ml烧酒真露 chamisul 苹果葡萄 360ml jinro真露

into进入

product产品 brand word单词
colgate smile 250gr高露洁微笑 250gr colgate高露洁 smile微笑
colgate fresh breath 250gr高露洁清新口气 250gr colgate高露洁 fresh breath清新口气
colgate mint 250gr高露洁薄荷 250gr colgate高露洁 mint薄荷
relx pod pro mango - 1pod relx pod pro 芒果 - 1pod relx放松 mango芒果
relx pod pro lychee - 1pod relx pod pro 荔枝 - 1pod relx放松 lychee荔枝
soju jinro chamisul green grape 360ml烧酒真露 chamisul 绿葡萄 360ml jinro真露 green grape绿葡萄
soju jinro chamisul strawberry 360ml烧酒真露 chamisul 草莓 360ml jinro真露 strawberry草莓
soju jinro chamisul apple 360ml烧酒真露 chamisul 苹果 360ml jinro真露 apple苹果

I want to group by brand and get difference in string and return that as new column.我想按品牌分组并获得字符串的差异并将其作为新列返回。 How do I do a transformation?我该如何进行转型? and check for regexp_contains(str_1, str_2_split)=false and return the value?并检查 regexp_contains(str_1, str_2_split)=false 并返回值?

Consider below naïve approach考虑以下幼稚的方法

  • split product to distinct word将产品拆分为不同的单词
  • identify words that are repeated in all rows of the same brand识别在同一品牌的所有行中重复的单词
  • join back to original table and remove (replace with empty string) all such words加入原始表并删除(替换为空字符串)所有此类单词
  • whatever left - trim it and [optionally] replace occurrence of multiple space with just one space剩下的 - 修剪它并 [可选地] 用一个空格替换多个空格的出现

So, query would look like below因此,查询如下所示

with common_words as (
  select brand, 
    r'' || array_to_string(array(
      select word
      from t.words word
      group by word
      having count(*) = cnt
    ), '|') words
  from (
    select brand, count(*) cnt, array_concat_agg(words) words
    from (
      select brand, array(
          select distinct word
          from unnest(split(product, ' ')) word
        ) words
      from your_table
    )
    group by brand
  ) t
)
select product, brand, 
  regexp_replace(trim(regexp_replace(product, words, '')), r'\s+', ' ') as diff
from your_table
join common_words
using (brand)    

if applied to sample data in your question - output is如果应用于您问题中的示例数据 - output 是

在此处输入图像描述

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM