简体   繁体   English

如何减少 BigQuery 在查询期间扫描的数据量?

[英]How can I reduce the amount of data scanned by BigQuery during a query?

Please someone tell and explain the correct answer to the following Multiple Choice Question?请有人告诉并解释以下多项选择题的正确答案?

You have a query that filters a BigQuery table using a WHERE clause on timestamp and ID columns.您有一个查询,该查询使用时间戳和 ID 列上的 WHERE 子句过滤 BigQuery 表。 By using bq query –-dry_run you learn that the query triggers a full scan of the table, even though the filter on timestamp and ID select a tiny fraction of the overall data.通过使用bq query –-dry_run您了解到该查询触发了对表的完整扫描,即使时间戳和 ID select 上的过滤器仅占整个数据的一小部分。 You want to reduce the amount of data scanned by BigQuery with minimal changes to existing SQL queries.您希望通过对现有 SQL 查询的最小更改来减少 BigQuery 扫描的数据量。 What should you do?你该怎么办?

  1. Create a separate table for each ID.为每个 ID 创建一个单独的表。
  2. Use the LIMIT keyword to reduce the number of rows returned.使用 LIMIT 关键字来减少返回的行数。
  3. Recreate the table with a partitioning column and clustering column.使用分区列和集群列重新创建表。
  4. Use the bq query --maximum_bytes_billed flag to restrict the number of bytes billed.使用bq query --maximum_bytes_billed标志来限制计费的字节数。

Assuming these are the only four possible answers, the answer is almost certainly "Recreate the table with a partitioning column and clustering column."假设这些是仅有的四个可能的答案,答案几乎肯定是“使用分区列和聚集列重新创建表”。

Lets eliminate the others:让我们消除其他人:

  • Use the LIMIT keyword to reduce the number of rows returned.使用 LIMIT 关键字来减少返回的行数。

This isn't going to help at all, since the LIMIT is only applied after a full table scan has already happened , so you'll still be billed the same, despite the limit.这根本无济于事,因为LIMIT 仅在全表扫描已经发生后才应用,因此尽管有限制,您仍将被收取相同的费用。

  • Create a separate table for each ID.为每个 ID 创建一个单独的表。

This doesn't seem likely to help, as in addition to being an organizational mess, then you'd have to query every table to find all the right timestamps, and process the same amount of data as before (but with a lot more work).这似乎没有帮助,因为除了组织混乱之外,您还必须查询每个表以找到所有正确的时间戳,并处理与以前相同数量的数据(但需要做更多的工作)。

  • Use the bq query --maximum_bytes_billed flag to restrict the number of bytes billed.使用bq query --maximum_bytes_billed标志来限制计费的字节数。

You could do this, but then the query would fail when the maximum bytes to be billed were too high, so you wouldn't get your results.您可以这样做,但是当要计费的最大字节数太高时,查询会失败,因此您不会得到结果。


So why partitioning and clustering ?那么为什么要分区集群呢?

BigQuery (on-demand) billing is based on the columns that you select, and the amount of data that you read in those columns. BigQuery(按需)计费基于您 select 的列以及您在这些列中读取的数据量。 So you want to do everything you can to reduce the amount of data processed.因此,您想尽一切可能减少处理的数据量。

Depending on the exact query, partitioning by the timestamp allows you to only scan the data for the relevant days.根据确切的查询,按时间戳进行分区允许您仅扫描相关日期的数据。 This can obviously be a huge savings compared to an entire table scan.与整个表扫描相比,这显然可以节省大量资金。

Clustering allows to to put commonly used data together within a table by sorting based on the clustering column, so that it can eliminate the need to scan irrelevant data based on the filter (WHERE clause).聚类允许通过基于聚类列的排序将常用数据放在一个表中,这样就可以消除基于过滤器(WHERE子句)扫描不相关数据的需要。 Thus, you scan less data and reduce your cost.因此,您扫描的数据更少并降低了成本。 There is a similar benefit for aggregation of data.数据聚合也有类似的好处。

This of course all assumes you have a good understanding of the queries you are actually making and which columns make sense to cluster on.当然,这一切都假设您对实际进行的查询以及哪些列对集群有意义有很好的理解。

As far as I know, the only way to limit the number of bytes read by BigQuery is either through removing (entirely) column references, removing table references, or through partitioning (and perhaps clustering in some cases).据我所知,限制 BigQuery 读取的字节数的唯一方法是通过删除(完全)列引用、删除表引用或通过分区(在某些情况下可能是集群)。

One of the challenges when starting to use BigQuery is that a query like this:开始使用 BigQuery 时的挑战之一是这样的查询:

select *
from t
limit 1;

can be really, really expensive.可能真的,真的很贵。

However, a query like this:但是,这样的查询:

select sum(x)
from t;

on the same table can be quite cheap.在同一张桌子上可以相当便宜。

To answer the question, you should learn more about how BigQuery bills for usage.要回答这个问题,您应该详细了解 BigQuery 如何按使用量计费。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM