[英]Left join with CoGroupByKey sink to BigQuery using Dataflow
[英]Optimizing LEFT JOIN speed in BigQuery
我需要在BigQuery
中加入六個表。 數據總計 500mb,每個表包含數千行。 我正在嘗試LEFT JOIN
共享列上的表。 該查詢目前估計需要幾周的運行時間,並且在它接近完成之前很久就超時了。 有沒有更好的方法來優化這個查詢?
SELECT
*
FROM
`report1` t1
LEFT JOIN
`report2` t2
ON
t1.campaignid = t2.campaignid
LEFT JOIN
`report3` t3
ON
t1.campaignid = t3.campaignid
LEFT JOIN
`report4` t4
ON
t1.campaignid = t4.campaignid
LEFT JOIN
`report5` t5
ON
t1.campaignid = t5.campaignid
LEFT JOIN
`report6` t6
ON
t1.campaignid = t6.campaignid
這太長了,無法發表評論。
您的查詢問題不是left join
本身。 相反,問題是(很可能)基礎表的每個campaignid
有多行。 您可以通過執行以下操作來估計查詢生成的行數:
select sum(t1.cnt * t2.cnt * t3.cnt * t4.cnt * t5.cnt * t6.cnt)
from (select campaignid, count(*) as cnt from table1 group by 1) t1 left join
(select campaignid, count(*) as cnt from table2 group by 1) t2
using (campaignid) left join
(select campaignid, count(*) as cnt from table3 group by 1) t3
using (campaignid) left join
(select campaignid, count(*) as cnt from table4 group by 1) t4
using (campaignid) left join
(select campaignid, count(*) as cnt from table5 group by 1) t5
using (campaignid) left join
(select campaignid, count(*) as cnt from table6 group by 1) t6
using (campaignid);
這可能會返回一個非常大的數字——這解釋了運行查詢的時間很長。
您需要修復join
條件或以不同方式構建查詢。 您可能會問另一個問題,其中包含示例數據、所需結果和邏輯解釋。
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.