[英]BigQuery DeDuplication on two columns as unique key
我們虔誠地使用BigQuery,並且實際上有兩個表是由不同的進程並行更新的。 我有一個問題,我們沒有表的唯一標識符,目標是在可能的情況下將兩個表合並為零重復。唯一標識符是將兩列組合在一起。
我已經嘗試過各種基於MySQL的查詢,但似乎都無法在BigQuery中使用。 因此,我在這里發布一些幫助。 :)
步驟1.將“ clean”表復制到新的合並表中。
步驟2.查詢“ dirty”(舊)表並插入所有丟失的條目。
查詢嘗試1:
SELECT
COUNT(c.*)
FROM
[flash-student-96619:device_data.device_datav3_20160530] AS old
WHERE NOT EXISTS (
SELECT
1
FROM
[flash-student-96619:device_data_v7_merged.20160530] AS new
WHERE
new.dsn = old.dsn
AND new.timestamp = old.timestamp
)
錯誤:錯誤發生在:6.1-10.65。 一次只能執行一個查詢。
查詢嘗試2:
SELECT
*
FROM
[flash-student-96619:device_data.device_datav3_20160530]
WHERE
(dsn, timestamp) NOT IN (
SELECT
dsn,
timestamp
FROM
[flash-student-96619:device_data_v7_merged.20160530]
)
錯誤:在第6行第7列遇到“”,“”,“”。期望:“)” ...
老實說,如果我可以在一個查詢中做到這一點,我會很高興。 我需要從兩個表中獲取數據,並使用唯一數據創建一個新表。
有什么幫助嗎?
像下面這樣的東西應該工作
SELECT *
FROM (
SELECT *,
ROW_NUMBER() OVER(PARTITION BY dsn, timestamp) AS dup
FROM
[flash-student-96619:device_data.device_datav3_20160530],
[flash-student-96619:device_data_v7_merged.20160530]
)
WHERE dup = 1
我建議在外部SELECT中使用顯式字段列表而不是*,以便您可以從實際輸出中省略dup
有點晚了,但我想指出的是,您的原始查詢可以使用標准SQL進行較小的修改(取消選中“顯示選項”下的“使用舊版SQL”框)。 我只是不得不將new
更改為其他內容,因為這是一個保留關鍵字。 例如,此查詢有效:
WITH OldData AS (
SELECT
x AS dsn,
TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL x HOUR) AS timestamp
FROM UNNEST([1, 2, 3, 4]) AS x),
NewData AS (
SELECT
x AS dsn,
TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL x HOUR) AS timestamp
FROM UNNEST([5, 2, 1, 6]) AS x)
SELECT
COUNT(*)
FROM OldData oldData
WHERE NOT EXISTS (
SELECT 1
FROM NewData newData
WHERE
newData.dsn = oldData.dsn
AND newData.timestamp = oldData.timestamp
);
+-----+
| f0_ |
+-----+
| 2 |
+-----+
關於第二次嘗試,您可以執行以下操作:
WITH OldData AS (
SELECT
x AS dsn,
TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL x HOUR) AS timestamp
FROM UNNEST([1, 2, 3, 4]) AS x),
NewData AS (
SELECT
x AS dsn,
TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL x HOUR) AS timestamp
FROM UNNEST([5, 2, 1, 6]) AS x)
SELECT
*
FROM OldData
WHERE
STRUCT(dsn, timestamp) NOT IN (
SELECT AS STRUCT
dsn,
timestamp
FROM NewData);
+-----+---------------------+
| dsn | timestamp |
+-----+---------------------+
| 3 | 2016-07-21 11:54:08 |
| 4 | 2016-07-21 10:54:08 |
+-----+---------------------+
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.