如何通过在 bigquery sql 中进行分组字符串比较来返回同一列中字符串值的差异？

Question

I have a table of products with a lot of products with an example like this:我有一个产品表，其中包含很多产品，例如：

product产品	brand牌
colgate smile 250gr高露洁微笑 250gr	colgate高露洁
colgate fresh breath 250gr高露洁清新口气 250gr	colgate高露洁
colgate mint 250gr高露洁薄荷 250gr	colgate高露洁
relx pod pro mango - 1pod relx pod pro 芒果 - 1pod	relx放松
relx pod pro lychee - 1pod relx pod pro 荔枝 - 1pod	relx放松
soju jinro chamisul green grape 360ml烧酒真露 chamisul 绿葡萄 360ml	jinro真露
soju jinro chamisul strawberry 360ml烧酒真露 chamisul 草莓 360ml	jinro真露
soju jinro chamisul apple grape 360ml烧酒真露 chamisul 苹果葡萄 360ml	jinro真露

into进入

product产品	brand牌	word单词
colgate smile 250gr高露洁微笑 250gr	colgate高露洁	smile微笑
colgate fresh breath 250gr高露洁清新口气 250gr	colgate高露洁	fresh breath清新口气
colgate mint 250gr高露洁薄荷 250gr	colgate高露洁	mint薄荷
relx pod pro mango - 1pod relx pod pro 芒果 - 1pod	relx放松	mango芒果
relx pod pro lychee - 1pod relx pod pro 荔枝 - 1pod	relx放松	lychee荔枝
soju jinro chamisul green grape 360ml烧酒真露 chamisul 绿葡萄 360ml	jinro真露	green grape绿葡萄
soju jinro chamisul strawberry 360ml烧酒真露 chamisul 草莓 360ml	jinro真露	strawberry草莓
soju jinro chamisul apple 360ml烧酒真露 chamisul 苹果 360ml	jinro真露	apple苹果

I want to group by brand and get difference in string and return that as new column.我想按品牌分组并获得字符串的差异并将其作为新列返回。 How do I do a transformation?我该如何进行转型？ and check for regexp_contains(str_1, str_2_split)=false and return the value?并检查 regexp_contains(str_1, str_2_split)=false 并返回值？

Answer 1

Consider below naïve approach考虑以下幼稚的方法

split product to distinct word将产品拆分为不同的单词
identify words that are repeated in all rows of the same brand识别在同一品牌的所有行中重复的单词
join back to original table and remove (replace with empty string) all such words加入原始表并删除（替换为空字符串）所有此类单词
whatever left - trim it and [optionally] replace occurrence of multiple space with just one space剩下的 - 修剪它并 [可选地] 用一个空格替换多个空格的出现

So, query would look like below因此，查询如下所示

with common_words as (
  select brand, 
    r'' || array_to_string(array(
      select word
      from t.words word
      group by word
      having count(*) = cnt
    ), '|') words
  from (
    select brand, count(*) cnt, array_concat_agg(words) words
    from (
      select brand, array(
          select distinct word
          from unnest(split(product, ' ')) word
        ) words
      from your_table
    )
    group by brand
  ) t
)
select product, brand, 
  regexp_replace(trim(regexp_replace(product, words, '')), r'\s+', ' ') as diff
from your_table
join common_words
using (brand)

if applied to sample data in your question - output is如果应用于您问题中的示例数据 - output 是

如何通过在 bigquery sql 中进行分组字符串比较来返回同一列中字符串值的差异？

问题描述

1 个解决方案

解决方案1
1 已采纳 2022-08-24 21:11:31

如何通过在 bigquery sql 中进行分组字符串比较来返回同一列中字符串值的差异？

问题描述

1 个解决方案

解决方案1 1 已采纳 2022-08-24 21:11:31

解决方案1
1 已采纳 2022-08-24 21:11:31