简体   繁体   中英

How can I reduce the amount of data scanned by BigQuery during a query?

Please someone tell and explain the correct answer to the following Multiple Choice Question?

You have a query that filters a BigQuery table using a WHERE clause on timestamp and ID columns. By using bq query –-dry_run you learn that the query triggers a full scan of the table, even though the filter on timestamp and ID select a tiny fraction of the overall data. You want to reduce the amount of data scanned by BigQuery with minimal changes to existing SQL queries. What should you do?

  1. Create a separate table for each ID.
  2. Use the LIMIT keyword to reduce the number of rows returned.
  3. Recreate the table with a partitioning column and clustering column.
  4. Use the bq query --maximum_bytes_billed flag to restrict the number of bytes billed.

Assuming these are the only four possible answers, the answer is almost certainly "Recreate the table with a partitioning column and clustering column."

Lets eliminate the others:

  • Use the LIMIT keyword to reduce the number of rows returned.

This isn't going to help at all, since the LIMIT is only applied after a full table scan has already happened , so you'll still be billed the same, despite the limit.

  • Create a separate table for each ID.

This doesn't seem likely to help, as in addition to being an organizational mess, then you'd have to query every table to find all the right timestamps, and process the same amount of data as before (but with a lot more work).

  • Use the bq query --maximum_bytes_billed flag to restrict the number of bytes billed.

You could do this, but then the query would fail when the maximum bytes to be billed were too high, so you wouldn't get your results.


So why partitioning and clustering ?

BigQuery (on-demand) billing is based on the columns that you select, and the amount of data that you read in those columns. So you want to do everything you can to reduce the amount of data processed.

Depending on the exact query, partitioning by the timestamp allows you to only scan the data for the relevant days. This can obviously be a huge savings compared to an entire table scan.

Clustering allows to to put commonly used data together within a table by sorting based on the clustering column, so that it can eliminate the need to scan irrelevant data based on the filter (WHERE clause). Thus, you scan less data and reduce your cost. There is a similar benefit for aggregation of data.

This of course all assumes you have a good understanding of the queries you are actually making and which columns make sense to cluster on.

As far as I know, the only way to limit the number of bytes read by BigQuery is either through removing (entirely) column references, removing table references, or through partitioning (and perhaps clustering in some cases).

One of the challenges when starting to use BigQuery is that a query like this:

select *
from t
limit 1;

can be really, really expensive.

However, a query like this:

select sum(x)
from t;

on the same table can be quite cheap.

To answer the question, you should learn more about how BigQuery bills for usage.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM