根据跨多个列的第一个可用非空值连接两个表

Question

I have 2 different tables in BigQuery, one detailing a hierarchy for my organization, and the other table containing planning values for different entities.我在 BigQuery 中有 2 个不同的表，一个详细说明了我的组织的层次结构，另一个表包含不同实体的计划值。 Before I explain further, here is how the tables look:在我进一步解释之前，这里是表格的样子：

Table A - Hierarchy This table is defined at a granular level for each warehouse.表 A - 层次结构此表是在每个仓库的粒度级别定义的。 This is essentially a flattened hierarchy (Warehouse -> District -> City -> State -> Country)这本质上是一个扁平化的层次结构（仓库 -> 地区 -> 城市 -> State -> 国家）

Country国家	State State	City城市	District区	Warehouse仓库
C1 C1	S1 S1	CY1 CY1	D1 D1	WH1 WH1
C1 C1	S1 S1	CY1 CY1	D1 D1	WH2 WH2
C1 C1	S1 S1	CY1 CY1	D2 D2	WH3 WH3
C1 C1	S1 S1	CY1 CY1	D2 D2	WH4 WH4
C1 C1	S1 S1	CY2 CY2	D3 D3	WH5 WH5
C1 C1	S1 S1	CY2 CY2	D3 D3	WH6 WH6
... ...	... ...	... ...	... ...	... ...

Here is the other table: Table B - Planned Values这是另一张表：表 B - 计划值

Frequency频率	PeriodStart期间开始	PeriodEnd期末	PlanAmount计划金额	Territory领土
MTD最大传输距离	01/01/2022 01/01/2022	01/31/2022 01/31/2022	500 500	WH1 WH1
YTD年初至今	01/01/2022 01/01/2022	01/31/2022 01/31/2022	790 790	WH1 WH1
... ...	... ...	... ...	... ...	... ...
MTD最大传输距离	12/01/2022 12/01/2022	12/31/2022 12/31/2022	340 340	WH1 WH1
YTD年初至今	12/01/2022 12/01/2022	12/31/2022 12/31/2022	1790 1790	WH1 WH1
MTD最大传输距离	01/01/2022 01/01/2022	01/31/2022 01/31/2022	1500 1500	D1 D1
YTD年初至今	01/01/2022 01/01/2022	01/31/2022 01/31/2022	1800 1800	D1 D1
... ...	... ...	... ...	... ...	... ...
MTD最大传输距离	12/01/2022 12/01/2022	12/31/2022 12/31/2022	1200 1200	D1 D1
YTD年初至今	12/01/2022 12/01/2022	12/31/2022 12/31/2022	6600 6600	D1 D1

I need to join Table A and Table B in the following manner to create a new table ( Table C ):我需要按以下方式加入表 A 和表 B 以创建一个新表（表 C ）：

The driving table is Table A.驱动表是Table A。
Table B contains planned values for warehouses, districts, cities etc. in Table B. However, it may contain these planned values defined at any level - sometimes at a warehouse level, and sometimes at only the country level.表 B 包含表 B 中仓库、地区、城市等的计划值。但是，它可能包含在任何级别定义的这些计划值 - 有时在仓库级别，有时仅在国家级别。
For every warehouse in Table A, Table C must have the corresponding plan values from Table B at the most granular level possible .对于表 A 中的每个仓库，表 C 必须具有表 B中尽可能最细粒度的相应计划值。 -- For example, Table B already has plan values for warehouse WH1, but does not have plan values for WH2. -- 例如，表 B 已有仓库 WH1 的计划值，但没有WH2 的计划值。 So, for WH1, Table C shows the plan values as defined within Table B. But for WH2, Table C has to show the district's (D1) plan values instead.因此，对于 WH1，表 C 显示了表 B 中定义的计划值。但是对于 WH2，表 C 必须改为显示学区 (D1) 的计划值。 If the district level value is not available, it has to skip to the next available level (leading all the way to the country level).如果地区级别值不可用，则必须跳到下一个可用级别（一直到国家级别）。

Is anyone able to help me with the logic to create this type of a join?有谁能帮助我了解创建这种类型的连接的逻辑吗？

I am unable to think of the logical way to approach this since I am rather new to SQL. My approach was to create multiple left joins across each level and then use a coalesce, but I fear this will create duplicate values.我想不出解决这个问题的逻辑方法，因为我对 SQL 还很陌生。我的方法是在每个级别创建多个左连接，然后使用合并，但我担心这会创建重复值。

Answer 1

First I extract all dates from column PeriodStart in tableB.首先，我从表 B 的PeriodStart列中提取所有日期。 So there should be for each month a row with values.所以每个月应该有一行值。 If you want to apply a row for several, please split them on a monthly base (unnest).如果您想连续申请多个，请按月拆分（unnest）。 The table A is written for each date in tableB.表 A 是为表 B 中的每个日期编写的。 For each entry in tableB the script will take the largest value per month and territory.对于表 B 中的每个条目，脚本将采用每月和地区的最大值。 If there are for this month are any match between territory and warehouse, the maximum of PlanAmount from these datasets in table B is taken.如果本月在territory和仓库之间存在任何匹配，则采用表 B 中这些数据集中的PlanAmount最大值。 Otherwise ( ifnull ) it is checked for a match between district and territory .否则 ( ifnull ) 检查district和territory之间的匹配。

with tblA as (select "C1" Country, "S1" State, "CY"|| (1+div(x,4)) City, "D"|| (1+div(x,2)) District, "WH"||x   Warehouse from unnest([1,2,3,4,5,6]) x),
tblB as (Select date("2022-01-01") PeriodStart, 500 PlanAmount, "WH1" Territory
UNION ALL SELECT date("2022-12-01"), 340, "WH1"
UNION ALL SELECT date("2022-12-01"), 1500, "D1"

),
months as (Select * from unnest(generate_date_array(  (Select min(PeriodStart) from tblB), (Select max(PeriodStart) from tblB),interval 1 month ))  as date_month ) ##generate all months in between
,
month_list as (Select distinct PeriodStart as date_month from tblB )

 
SELECT
date_month,country,state,city,District,
ifnull(ifnull(max(WHplan),max(Distplan)),max(Stateplane)) as plan 

from(
Select date_month, tblA.* ,
Wh.PlanAmount as WHplan,
Dist.PlanAmount as Distplan,
State.PlanAmount as Stateplane

from tblA,
#months # generate all months in between OR use:
month_list

left join  tblB WH
on tblA.Warehouse=WH.Territory and date_month=WH.PeriodStart

left join  tblB Dist
on tblA.District=Dist.Territory and date_month=Dist.PeriodStart

left join  tblB State
on tblA.District=State.Territory and date_month=State.PeriodStart
)
group by 1,2,3,4,5

Please tell if your dataset is too large for joins.请告知您的数据集是否太大而无法连接。

根据跨多个列的第一个可用非空值连接两个表

问题描述

1 个解决方案

解决方案1
0 2022-11-16 22:19:39

根据跨多个列的第一个可用非空值连接两个表

问题描述

1 个解决方案

解决方案1 0 2022-11-16 22:19:39

解决方案1
0 2022-11-16 22:19:39