简体   繁体   English

使用 CTE 与 SubQuery 重构 SQL 查询

[英]Refactor SQL Query , using CTE vs SubQuery

I am creating a dataset from an S3 Bucket, and currently I am trying to improve the performance of the query as the current two approaches I have work but I would like to see a better query and learn how to improve my sql skills.我正在从 S3 Bucket 创建一个数据集,目前我正在尝试提高查询的性能,因为我目前有两种方法可以使用,但我希望看到更好的查询并学习如何提高我的 sql 技能。 Sorry for no sample dataset to work with as I have not figured out a practical way to provide mock data when pulling from .json files in S3.抱歉,没有可使用的示例数据集,因为在从 S3 中的 .json 文件中提取时,我还没有找到提供模拟数据的实用方法。

QUERY # 1查询#1

 WITH block_1 AS
    (
    SELECT 
    VALUE:COL1 AS COL1, 
    VALUE:COL2 AS COL2, 
    VALUE:COL3 AS COL3,
    VALUE:COL4 AS COL4
    from '@S3_BUCKET/', 
     lateral flatten( input => $1:value)), block_2 as 

(
SELECT 
VALUE:COL1 AS COL1, 
max(VALUE:COL4) AS MaxCOL4
from '@S3_BUCKET/', 
lateral flatten( input => $1:value)
group by COL1
 )

select b.COL1 as COL1B, b.COLB as COL1B, 
 a.COL3, a.COL4 from block_1 as A
join block_2 b 
on a.COL1 = b.COL1  and a.COL4 = b.MaxCOL4
 ;

QUERY #2 , I felt was an improvement, especially because you do not need to specify the column you want in the final SELECT statement (as I did above) QUERY #2 ,我觉得这是一个改进,特别是因为你不需要在最终的SELECT语句中指定你想要的列(就像我上面所做的那样)

select a.* from 
(
SELECT 
VALUE:COL1 AS COL1, 
VALUE:COL2 AS COL2, 
VALUE:COL3 AS COL3,
VALUE:COL4 AS COL4
from '@S3_BUCKET/', 
lateral flatten( input => $1:value))a 
join 
(
select COL1, MAX(COL4) COL4
from 
(
SELECT 
VALUE:COL1 AS COL1, 
VALUE:COL2 AS COL2, 
VALUE:COL3 AS COL3,
VALUE:COL4 AS COL4
from '@S3_BUCKET/', 
 lateral flatten( input => $1:value))
group by COL1) b
on a.COL1 = b.COL1 and a.COL4 = b.Col4;

The two above are my current attempt, wondering if there would be a way to make this query better?以上两个是我目前的尝试,想知道是否有办法使这个查询更好? The other route I was thinking was possibly using "where in" , and the list of COL1, but essentially then I still have to hit s3 2x , as the queries above.我想的另一条路线可能是使用 "where in" 和 COL1 的列表,但基本上我仍然必须按 s3 2x ,如上面的查询。

You should be able to use window functions , specifically RANK() to simplify this query:您应该能够使用window functions ,特别是RANK()来简化此查询:

WITH block_1 AS (
    SELECT 
    VALUE:COL1 AS COL1, 
    VALUE:COL2 AS COL2, 
    VALUE:COL3 AS COL3,
    VALUE:COL4 AS COL4,
    RANK() OVER (PARTITION BY VALUE:COL1 ORDER BY VALUE:COL4 DESC) AS rk
    FROM '@S3_BUCKET/', 
     lateral flatten( input => $1:value)
)
SELECT COL1, COL2, COL3, COL4
FROM block_1
WHERE rk = 1

This can be simplified thanks to Snowflake's QUALIFY clause, which allows you to use an alias for a window function in what is effectively a HAVING clause:由于 Snowflake 的QUALIFY子句,这可以简化,它允许您在有效的HAVING子句中使用窗口函数的别名:

SELECT 
    VALUE:COL1 AS COL1, 
    VALUE:COL2 AS COL2, 
    VALUE:COL3 AS COL3,
    VALUE:COL4 AS COL4,
    RANK() OVER (PARTITION BY VALUE:COL1 ORDER BY VALUE:COL4 DESC) AS rk
FROM '@S3_BUCKET/', 
     lateral flatten( input => $1:value)
QUALIFY rk = 1

@nick. @缺口。 Use qualify , this will act as where filter and set = 1. Also replace rank with row_number.使用qualify,这将作为where filter 和set = 1。同时用row_number 替换rank。 Does that make sense ?那有意义吗 ?

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM