忽略配置單元查詢中帶有 NULL 連接列的行

Question

我有三個表 A、B 和 C。A 有 10 億條記錄，B 有 1000 萬條記錄，C 有 500 萬條記錄。 我的查詢就像

select * from tableA a left outer join tableB b on a.id=b.id left outer join tableC c on b.id=c.id;

第一次加入后，我將擁有超過 9.9 億個 NULL b.id 列。 現在表 C 上的第二個連接將需要處理所有 9.9 億個 NULL 行 (b.Id)，這會導致一個減速器加載很長時間。 有沒有辦法可以避免帶有 NULL 連接列的行？

Answer 1

我們已經將 rand() 用於 NULL ； 所以我們的加入條件將是

coalesce(b.id, rand()) = c.id

因此，空值是由它自己分配的，但我想知道為什么 skewjoin 設置沒有幫助（我們嘗試了 coalesce(b.id, 'SomeString') = c.id with skewjoin enable ）

Answer 2

在 ON 子句中添加b.id is not null條件。 根據您的 Hive 版本，這可能會有所幫助：

select * 
   from tableA a 
       left outer join tableB b on a.id=b.id 
       left outer join tableC c on b.id=c.id and b.id is not null;

但據我所知，這不是問題，因為 0.14 版本。

您也可以划分空行和非空行，並僅連接非空行。 在第一個查詢中只選擇了空行。 為 C 表中的列添加 NULL 作為 col。 然后使用 UNION ALL + 選擇所有非空行：

with a as(
select a.*, b.* 
   from tableA a 
       left outer join tableB b on a.id=b.id
)

select a.*, null as c_col1 --add all other columns(from c) as null to get same schema
   from a where a.b_id_col is null
UNION ALL
select a.*, c.*
   left outer join tableC c on a.b_id_col=c.id
   from a where a.b_id_col is not null

忽略配置單元查詢中帶有 NULL 連接列的行

問題描述

2 個解決方案

解決方案1
2 2017-11-15 04:23:43

解決方案2
1 2017-11-03 14:27:47

忽略配置單元查詢中帶有 NULL 連接列的行

問題描述

2 個解決方案

解決方案1 2 2017-11-15 04:23:43

解決方案2 1 2017-11-03 14:27:47

解決方案1
2 2017-11-15 04:23:43

解決方案2
1 2017-11-03 14:27:47