简体   繁体   English

将 BigQuery 分区/集群键传播到 CTE 表 - 性能

[英]Propagation of BigQuery parition/cluster keys to a CTE table - Performance

I setup a persistent table in our BigQuery database (using Looker, if that's relevant).我在我们的 BigQuery 数据库中设置了一个持久表(使用 Looker,如果相关的话)。 The table has both a partition_key and a few cluster_keys .该表有一个partition_key和几个cluster_keys I partition on time, then cluster on my primary key (made with GENERATE_UUID ), plus the two major fields that users will search on.我按时分区,然后在我的主键(使用GENERATE_UUID制作)以及用户将搜索的两个主要字段上进行集群。

I then have a CTE table that the rest of the queries pull data from.然后我有一个 CTE 表,查询的 rest 从中提取数据。 This CTE selects a subset of the persistent table (the one with the partition and cluster keys), but this table is not itself persistent, so I don't think I can include partition and cluster keys in it.此 CTE 选择持久表的一个子集(具有分区键和簇键的表),但该表本身不是持久的,因此我认为我不能在其中包含分区键和簇键。 It looks like this:它看起来像这样:

WITH my_table_pre_exclusion AS (--
        SELECT
            *
        FROM
            `server.data.prefix_my_table_persist`
        WHERE
        (
                      ( -- Some filter here
                ) -- AND ... some filter here
        )
    )

My question is: does pulling from this table (which pre-applies a bunch of filters) hurt performance when I later do a ton of joins involving fields that ARE in the partition or cluster key fields?我的问题是:当我稍后进行大量涉及分区或集群键字段中的字段的连接时,从这个表中提取(预先应用一堆过滤器)是否会损害性能?

Would it be more performant to skip this CTE table, pull directly from the persistent table in all my downstream joins, and then re-apply the filters (which apply to everything downstream)?跳过这个 CTE 表,直接从我所有下游连接中的持久表中提取,然后重新应用过滤器(适用于下游的所有内容)是否会更高效? It would be a lot more code bloat, but I did some benchmarking, and I thinkkkk it's hurting performance, but I'm not really sure.代码会膨胀很多,但我做了一些基准测试,我认为它会损害性能,但我不太确定。

Is there a "best of both worlds" approach where I don't have to apply the same filters to a ton of downstream tables, but I still get optimal performance?是否有一种“两全其美”的方法,我不必将相同的过滤器应用于大量下游表,但我仍然可以获得最佳性能? Maybe inner join my_table_pre_exclusion to all the downstream tables after-the-fact?也许事后将my_table_pre_exclusion内部加入所有下游表?

Posting my own answer to this, though I'd be happy for anybody else to elaborate, as I could only find very sparse documentation on this.发布我自己的答案,尽管我很乐意让其他人详细说明,因为我只能找到非常稀疏的文档。

I was able to get some info from a helpful BigQuery expert: what I'm asking about is something called "Predicate Pushdown" , which BigQuery recently added support for.我能够从一位乐于助人的 BigQuery 专家那里获得一些信息:我要问的是一个叫做“Predicate Pushdown”的东西,BigQuery 最近添加了对它的支持。

I'm still trying to read up on the details of the support, but this does not appear to be something unique to only BigQuery (although I'm sure its optimizers play a huge role in overall performance).我仍在尝试阅读支持的详细信息,但这似乎不仅仅是 BigQuery 独有的东西(尽管我确信它的优化器在整体性能中发挥着巨大作用)。 You can read a little about it here: https://modern-sql.com/feature/with/performance#predicate-pushdown您可以在这里阅读一些相关信息: https://modern-sql.com/feature/with/performance#predicate-pushdown

The bottom line is that if BigQuery's support is sufficient for the query I'm running, then queries-on-subqueries will be efficiently executed using the partition/cluster keys.底线是,如果BigQuery 的支持足以满足我正在运行的查询,那么使用分区/集群键可以高效地执行子查询查询 I read some docs from the initial release, anyways, saying that it might only work with the date-based partition key, but maybe it has since expanded support.无论如何,我从最初的版本中读了一些文档,说它可能只适用于基于日期的分区键,但也许它已经扩展了支持。 The general topic of "总题“

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM