如何在 BigQuery Standard SQL 中进行可重复采样？

Question

In this blog a Google Cloud employee explains how to do repeatable sampling of data sets for machine learning in BigQuery.在此博客中，一位 Google Cloud 员工解释了如何在 BigQuery 中为机器学习进行可重复的数据集采样。 This is very important for creating (and replicating) train/validation/test partitions of your data.这对于创建（和复制）数据的训练/验证/测试分区非常重要。

However the blog uses Legacy SQL, which Google has now deprecated in favor of Standard SQL.然而，该博客使用了 Legacy SQL， Google 现在已弃用它而支持标准 SQL。

How would you re-write the blog's sampling code shown below, but using Standard SQL?您将如何使用标准 SQL 重新编写如下所示的博客示例代码？

#legacySQL
SELECT
  date,
  airline,
  departure_airport,
  departure_schedule,
  arrival_airport,
  arrival_delay
FROM
  [bigquery-samples:airline_ontime_data.flights]
WHERE
  ABS(HASH(date)) % 10 < 8

Answer 1

Standard SQL would re-write the query thus:标准 SQL 会这样重写查询：

#standardSQL
SELECT
  date,
  airline,
  departure_airport,
  departure_schedule,
  arrival_airport,
  arrival_delay
FROM
  `bigquery-samples.airline_ontime_data.flights`
WHERE
  ABS(MOD(FARM_FINGERPRINT(date), 10)) < 8

Specifically here are the changes:具体变化如下：

a period (not colon) to separate the Google Cloud project from table name.将 GCP 项目与表名分开的句点（不是冒号）。
backticks (not square brackets) to escape hyphen in the table name.反引号（不是方括号）以转义表名中的连字符。
MOD function (not % ). MOD函数（不是% ）。
FARM_FINGERPRINT (not HASH ). FARM_FINGERPRINT （不是HASH ）。 This is actually a different hashing function than Legacy SQL's HASH , which wasn't in fact consistent over time as the blog had implied.这实际上是一个与 Legacy SQL 的HASH不同的散列函数，正如博客所暗示的那样，随着时间的推移，它实际上并不一致。

Answer 2

Based on accepted answer, provide a more general way to generate unique key for each row:根据接受的答案，提供一种更通用的方法来为每一行生成唯一键：

TO_JSON_STRING(STRUCT(col1, col2, ..., colN))

#standardSQL
SELECT
  date,
  airline,
  departure_airport,
  departure_schedule,
  arrival_airport,
  arrival_delay
FROM
  `bigquery-samples.airline_ontime_data.flights`
WHERE
  ABS(MOD(FARM_FINGERPRINT(TO_JSON_STRING(STRUCT(date, airline, arrival_delay))), 10)) < 8

What if there is no unique key to identify each row?如果没有唯一键来标识每一行怎么办？

Yes, it can happen that there are by design duplicated data rows in your dataset, with above query, either all of or none of the duplicates are included in sample set.是的，您的数据集中可能存在设计重复的数据行，使用上述查询，样本集中包含所有重复数据或不包含任何重复数据行。

Depending on how big your dataset is, you can try to order the source dataset and using window function to generate a row_number for each row.根据您的数据集有多大，您可以尝试对源数据集进行排序并使用窗口函数为每一行生成一个 row_number。 Then do sampling based on row_number.然后根据 row_number 进行采样。 This trick will work until you hit error sorting your dataset:此技巧将一直有效，直到您对数据集进行排序时遇到错误：

Resources exceeded during query execution: The query could not be executed in the allotted memory.查询执行期间资源超出：无法在分配的内存中执行查询。

What if I do hit above error如果我确实遇到了上述错误怎么办

Well, above way is simpler to implement but if you hit the limit, consider doing something more complex:好吧，上面的方法实现起来更简单，但如果你达到了极限，请考虑做一些更复杂的事情：

Generate a deduplicated table with a COUNT of how many times it appears in original dataset.生成一个重复数据删除表，其中包含它在原始数据集中出现的次数。
After hashing the row, increase the odds that a row is picked based on COUNT.对行进行哈希处理后，根据 COUNT 增加一行被选中的几率。
Since you don't want to use all COUNT number of duplicates, you may do a hash again to decide how big a portion of duplicated should be included in sample set.由于您不想使用所有 COUNT 个重复项，您可以再次进行散列以决定样本集中应包含多少重复项。 (There must be some better way mathematically though) （虽然必须有一些更好的数学方法）

Answer 3

Even less verbose versions I am using in my practice instead of lengthy TO_JSON_STRING(STRUCT(col1, col2, ..., colN))我在实践中使用更TO_JSON_STRING(STRUCT(col1, col2, ..., colN))版本，而不是冗长的TO_JSON_STRING(STRUCT(col1, col2, ..., colN))

TO_JSON_STRING(t)

and和

FORMAT('%t', t)

as in below examples如下例所示

#standardSQL
SELECT 
  date,
  airline,
  departure_airport,
  departure_schedule,
  arrival_airport,
  arrival_delay
FROM
  `bigquery-samples.airline_ontime_data.flights` t
WHERE
  MOD(ABS(FARM_FINGERPRINT(FORMAT('%t', t))), 10) < 8

and和

#standardSQL
SELECT 
  date,
  airline,
  departure_airport,
  departure_schedule,
  arrival_airport,
  arrival_delay
FROM
  `bigquery-samples.airline_ontime_data.flights` t
WHERE
  MOD(ABS(FARM_FINGERPRINT(TO_JSON_STRING(t))), 10) < 8

如何在 BigQuery Standard SQL 中进行可重复采样？

问题描述

3 个解决方案

解决方案1
3 已采纳 2017-09-03 01:54:11

解决方案2
1 2019-10-25 08:17:22

解决方案3
0 2019-10-25 14:13:26

如何在 BigQuery Standard SQL 中进行可重复采样？

问题描述

3 个解决方案

解决方案1 3 已采纳 2017-09-03 01:54:11

解决方案2 1 2019-10-25 08:17:22

解决方案3 0 2019-10-25 14:13:26

解决方案1
3 已采纳 2017-09-03 01:54:11

解决方案2
1 2019-10-25 08:17:22

解决方案3
0 2019-10-25 14:13:26