[英]How to do repeatable sampling in BigQuery Standard SQL?
In this blog a Google Cloud employee explains how to do repeatable sampling of data sets for machine learning in BigQuery.在此博客中,一位 Google Cloud 员工解释了如何在 BigQuery 中为机器学习进行可重复的数据集采样。 This is very important for creating (and replicating) train/validation/test partitions of your data.这对于创建(和复制)数据的训练/验证/测试分区非常重要。
However the blog uses Legacy SQL, which Google has now deprecated in favor of Standard SQL.然而,该博客使用了 Legacy SQL, Google 现在已弃用它而支持标准 SQL。
How would you re-write the blog's sampling code shown below, but using Standard SQL?您将如何使用标准 SQL 重新编写如下所示的博客示例代码?
#legacySQL
SELECT
date,
airline,
departure_airport,
departure_schedule,
arrival_airport,
arrival_delay
FROM
[bigquery-samples:airline_ontime_data.flights]
WHERE
ABS(HASH(date)) % 10 < 8
Standard SQL would re-write the query thus:标准 SQL 会这样重写查询:
#standardSQL
SELECT
date,
airline,
departure_airport,
departure_schedule,
arrival_airport,
arrival_delay
FROM
`bigquery-samples.airline_ontime_data.flights`
WHERE
ABS(MOD(FARM_FINGERPRINT(date), 10)) < 8
Specifically here are the changes:具体变化如下:
MOD
function (not %
). MOD
函数(不是%
)。FARM_FINGERPRINT
(not HASH
). FARM_FINGERPRINT
(不是HASH
)。 This is actually a different hashing function than Legacy SQL's HASH
, which wasn't in fact consistent over time as the blog had implied.这实际上是一个与 Legacy SQL 的HASH
不同的散列函数,正如博客所暗示的那样, 随着时间的推移,它实际上并不一致。Based on accepted answer, provide a more general way to generate unique key for each row:根据接受的答案,提供一种更通用的方法来为每一行生成唯一键:
TO_JSON_STRING(STRUCT(col1, col2, ..., colN))
#standardSQL
SELECT
date,
airline,
departure_airport,
departure_schedule,
arrival_airport,
arrival_delay
FROM
`bigquery-samples.airline_ontime_data.flights`
WHERE
ABS(MOD(FARM_FINGERPRINT(TO_JSON_STRING(STRUCT(date, airline, arrival_delay))), 10)) < 8
What if there is no unique key to identify each row?如果没有唯一键来标识每一行怎么办?
Yes, it can happen that there are by design duplicated data rows in your dataset, with above query, either all of or none of the duplicates are included in sample set.是的,您的数据集中可能存在设计重复的数据行,使用上述查询,样本集中包含所有重复数据或不包含任何重复数据行。
Depending on how big your dataset is, you can try to order the source dataset and using window function to generate a row_number for each row.根据您的数据集有多大,您可以尝试对源数据集进行排序并使用窗口函数为每一行生成一个 row_number。 Then do sampling based on row_number.然后根据 row_number 进行采样。 This trick will work until you hit error sorting your dataset:此技巧将一直有效,直到您对数据集进行排序时遇到错误:
Resources exceeded during query execution: The query could not be executed in the allotted memory.查询执行期间资源超出:无法在分配的内存中执行查询。
What if I do hit above error如果我确实遇到了上述错误怎么办
Well, above way is simpler to implement but if you hit the limit, consider doing something more complex:好吧,上面的方法实现起来更简单,但如果你达到了极限,请考虑做一些更复杂的事情:
Even less verbose versions I am using in my practice instead of lengthy TO_JSON_STRING(STRUCT(col1, col2, ..., colN))
我在实践中使用更TO_JSON_STRING(STRUCT(col1, col2, ..., colN))
版本,而不是冗长的TO_JSON_STRING(STRUCT(col1, col2, ..., colN))
TO_JSON_STRING(t)
and和
FORMAT('%t', t)
as in below examples如下例所示
#standardSQL
SELECT
date,
airline,
departure_airport,
departure_schedule,
arrival_airport,
arrival_delay
FROM
`bigquery-samples.airline_ontime_data.flights` t
WHERE
MOD(ABS(FARM_FINGERPRINT(FORMAT('%t', t))), 10) < 8
and和
#standardSQL
SELECT
date,
airline,
departure_airport,
departure_schedule,
arrival_airport,
arrival_delay
FROM
`bigquery-samples.airline_ontime_data.flights` t
WHERE
MOD(ABS(FARM_FINGERPRINT(TO_JSON_STRING(t))), 10) < 8
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.