[英]Understanding explain while using CTEs - trying to get a query to compute
I've been wrestling with a query and experimented with variations to arrive at my desired result.我一直在努力解决一个查询,并尝试了各种变化来达到我想要的结果。 But I have failed.但我失败了。 I'm hoping that if I share the variations that I have tried along with the explain statement output, anyone might have a pointer.我希望如果我与解释语句输出一起分享我尝试过的变体,任何人都可能有一个指针。
Postgres 11.6. Postgres 11.6。
For the code blocks below, dimension1 is a field that exists on all tables that I am referencing.对于下面的代码块,dimension1 是存在于我引用的所有表中的字段。 Date only appears in sessions table, so to pull data for a particular date, I create a cte filter_sessions to get only dimension1 's that appear on a given date then join to my other tables.日期仅出现在会话表中,因此要提取特定日期的数据,我创建了一个 cte filter_sessions 以仅获取出现在给定日期的维度 1 ,然后加入我的其他表。 This allows my query to select data for a particular date, in this case February 6th.这允许我的查询选择特定日期的数据,在本例中为 2 月 6 日。
Here was my initial attempt.这是我最初的尝试。 It uses a CTE which I prefer for readability and that I could get away with writing less code if it would just run, which it does not:它使用 CTE,我更喜欢它的可读性,并且如果它可以运行,我可以编写更少的代码,但它不会:
with
filter_sessions as (
select
dimension1,
dimension2,
date,
channel_grouping,
device_category,
user_type
from ga_flagship_ecom.sessions
where date >= '2020-02-06'
and date <= '2020-02-06'
),
ee as (
select
e.dimension1,
e.dimension3,
case when sum(case when e.metric1 = 0 then 1 else 0 end) > 0 then 1 else 0 end as zero_val_product, -- roll up to event level
-- approximation for inferring if the product i a download and hence sees all the checkout steps
case when sum(case when lower(product_name) ~ 'digital|download|file' then 1 else 0 end) > 0 then 1 else 0 end as download
from ga_flagship_ecom.ecom e
join filter_sessions f on f.dimension1 = e.dimension1
group by 1,2
),
ecom_events as (
select
ev.dimension1,
ev.dimension3,
ev.event_action,
ev.event_label,
ee.zero_val_product,
ee.download
from ga_flagship_ecom.events ev
join ee on ee.dimension1 = ev.dimension1 and ee.dimension3 = ev.dimension3
where ev.event_category = 'ecom'
)
select
s.date,
lower(s.channel_grouping) as channel_grouping,
lower(s.device_category) as device_category,
lower(s.user_type) as user_type,
lower(ev.event_action) as event_action,
lower(coalesce(ev.event_label, 'na')) as event_label,
ev.zero_val_product,
ev.download,
count(distinct s.dimension1) as sessions,
count(distinct s.dimension2) as daily_users
from filter_sessions s
join ecom_events ev on ev.dimension1 = s.dimension1
group by 1,2,3,4,5,6,7,8;
Here is what the explain output of this query looks like:下面是这个查询的解释输出的样子:
GroupAggregate (cost=222818.83..222818.88 rows=1 width=188)
Group Key: s.date, (lower((s.channel_grouping)::text)), (lower((s.device_category)::text)), (lower((s.user_type)::text)), (lower((ev.event_action)::text)), (lower((COALESCE(ev.event_label, 'na'::character varying))::text)), ev.zero_val_product, ev.download
CTE filter_sessions
-> Index Scan using sessions_date_idx on sessions (cost=0.56..2.78 rows=1 width=76)
Index Cond: ((date >= '2020-02-06'::date) AND (date <= '2020-02-06'::date))
CTE ee
-> GroupAggregate (cost=47604.61..47606.29 rows=48 width=38)
Group Key: e.dimension1, e.dimension3
-> Sort (cost=47604.61..47604.73 rows=48 width=51)
Sort Key: e.dimension1, e.dimension3
-> Nested Loop (cost=0.56..47603.27 rows=48 width=51)
-> CTE Scan on filter_sessions f (cost=0.00..0.02 rows=1 width=32)
-> Index Scan using ecom_dimension1_idx on ecom e (cost=0.56..47602.77 rows=48 width=51)
Index Cond: ((dimension1)::text = (f.dimension1)::text)
CTE ecom_events
-> Hash Join (cost=1.68..175209.67 rows=1 width=60)
Hash Cond: (((ev_1.dimension1)::text = (ee.dimension1)::text) AND (ev_1.dimension3 = ee.dimension3))
-> Seq Scan on events ev_1 (cost=0.00..150210.69 rows=3332973 width=52)
Filter: ((event_category)::text = 'ecom'::text)
-> Hash (cost=0.96..0.96 rows=48 width=48)
-> CTE Scan on ee (cost=0.00..0.96 rows=48 width=48)
-> Sort (cost=0.08..0.08 rows=1 width=236)
Sort Key: s.date, (lower((s.channel_grouping)::text)), (lower((s.device_category)::text)), (lower((s.user_type)::text)), (lower((ev.event_action)::text)), (lower((COALESCE(ev.event_label, 'na'::character varying))::text)), ev.zero_val_product, ev.download
-> Nested Loop (cost=0.00..0.07 rows=1 width=236)
Join Filter: ((s.dimension1)::text = (ev.dimension1)::text)
-> CTE Scan on filter_sessions s (cost=0.00..0.02 rows=1 width=164)
-> CTE Scan on ecom_events ev (cost=0.00..0.02 rows=1 width=104)
Someone suggested that cte ee was my bottleneck and that I should focus on that.有人建议 cte ee 是我的瓶颈,我应该专注于此。 I tried a subquery on cte ee rather than referencing cte filter_sessions.我在 cte ee 上尝试了一个子查询,而不是引用 cte filter_sessions。 So change:所以改变:
ee as (
select
e.dimension1,
e.dimension3,
case when sum(case when e.metric1 = 0 then 1 else 0 end) > 0 then 1 else 0 end as zero_val_product, -- roll up to event level
-- approximation for inferring if the product i a download and hence sees all the checkout steps
case when sum(case when lower(product_name) ~ 'digital|download|file' then 1 else 0 end) > 0 then 1 else 0 end as download
from ga_flagship_ecom.ecom e
--join filter_sessions f on f.dimension1 = e.dimension1
join (select dimension1 from ga_flagship_ecom.sessions where date >= '2020-02-06' and date <= '2020-02-06') f
on f.dimension1 = e.dimension1
group by 1,2
),
Here's explain with that small change:下面用这个小改动来解释一下:
GroupAggregate (cost=107619.19..107619.24 rows=1 width=188)
Group Key: s.date, (lower((s.channel_grouping)::text)), (lower((s.device_category)::text)), (lower((s.user_type)::text)), (lower((ev.event_action)::text)), (lower((COALESCE(ev.event_label, 'na'::character varying))::text)), ev.zero_val_product, ev.download
CTE filter_sessions
-> Index Scan using sessions_date_idx on sessions (cost=0.56..2.78 rows=1 width=76)
Index Cond: ((date >= '2020-02-06'::date) AND (date <= '2020-02-06'::date))
CTE ee
-> GroupAggregate (cost=47606.05..47606.08 rows=1 width=38)
Group Key: e.dimension1, e.dimension3
-> Sort (cost=47606.05..47606.05 rows=1 width=51)
Sort Key: e.dimension1, e.dimension3
-> Nested Loop (cost=1.12..47606.04 rows=1 width=51)
-> Index Only Scan using sessions_date_idx on sessions sessions_1 (cost=0.56..2.78 rows=1 width=22)
Index Cond: ((date >= '2020-02-06'::date) AND (date <= '2020-02-06'::date))
-> Index Scan using ecom_dimension1_idx on ecom e (cost=0.56..47602.77 rows=48 width=51)
Index Cond: ((dimension1)::text = (sessions_1.dimension1)::text)
CTE ecom_events
-> Nested Loop (cost=0.56..60010.25 rows=1 width=60)
-> CTE Scan on ee (cost=0.00..0.02 rows=1 width=48)
-> Index Scan using events_pk on events ev_1 (cost=0.56..60010.22 rows=1 width=52)
Index Cond: (((dimension1)::text = (ee.dimension1)::text) AND (dimension3 = ee.dimension3))
Filter: ((event_category)::text = 'ecom'::text)
-> Sort (cost=0.08..0.08 rows=1 width=236)
Sort Key: s.date, (lower((s.channel_grouping)::text)), (lower((s.device_category)::text)), (lower((s.user_type)::text)), (lower((ev.event_action)::text)), (lower((COALESCE(ev.event_label, 'na'::character varying))::text)), ev.zero_val_product, ev.download
-> Nested Loop (cost=0.00..0.07 rows=1 width=236)
Join Filter: ((s.dimension1)::text = (ev.dimension1)::text)
-> CTE Scan on filter_sessions s (cost=0.00..0.02 rows=1 width=164)
-> CTE Scan on ecom_events ev (cost=0.00..0.02 rows=1 width=104)
I am unsure how to interpret the numbers in explain output, but for cte ee, those numbers are practically the same so I don't think that change made much difference?我不确定如何解释解释输出中的数字,但对于 cte ee,这些数字实际上是相同的,所以我认为这种变化没有太大区别? CTE ee-> GroupAggregate (cost=47606.05..47606.08 rows=1 width=38)
Either way, the query still does not complete.无论哪种方式,查询仍然没有完成。 Other things that I have tried (All failed to run, the query just runs indefinitely):我尝试过的其他事情(都无法运行,查询只是无限期地运行):
Instead of an inner join, a where filter like so:而不是内部连接,一个 where 过滤器,如下所示:
ee as (
select
e.dimension1,
e.dimension3,
case when sum(case when e.metric1 = 0 then 1 else 0 end) > 0 then 1 else 0 end as zero_val_product, -- roll up to event level
-- approximation for inferring if the product i a download and hence sees all the checkout steps
case when sum(case when lower(product_name) ~ 'digital|download|file' then 1 else 0 end) > 0 then 1 else 0 end as download
from ga_flagship_ecom.ecom e
--join filter_sessions f on f.dimension1 = e.dimension1
where e.dimension1 in (select dimension1 from filter_sessions)
group by 1,2
),
Here is the explain output based on using a where filter instead of an inner join:这是基于使用 where 过滤器而不是内部联接的解释输出:
GroupAggregate (cost=222818.84..222818.89 rows=1 width=188)
Group Key: s.date, (lower((s.channel_grouping)::text)), (lower((s.device_category)::text)), (lower((s.user_type)::text)), (lower((ev.event_action)::text)), (lower((COALESCE(ev.event_label, 'na'::character varying))::text)), ev.zero_val_product, ev.download
CTE filter_sessions
-> Index Scan using sessions_date_idx on sessions (cost=0.56..2.78 rows=1 width=76)
Index Cond: ((date >= '2020-02-06'::date) AND (date <= '2020-02-06'::date))
CTE ee
-> GroupAggregate (cost=47604.63..47606.31 rows=48 width=38)
Group Key: e.dimension1, e.dimension3
-> Sort (cost=47604.63..47604.75 rows=48 width=51)
Sort Key: e.dimension1, e.dimension3
-> Nested Loop (cost=0.58..47603.29 rows=48 width=51)
-> HashAggregate (cost=0.02..0.03 rows=1 width=32)
Group Key: (filter_sessions.dimension1)::text
-> CTE Scan on filter_sessions (cost=0.00..0.02 rows=1 width=32)
-> Index Scan using ecom_dimension1_idx on ecom e (cost=0.56..47602.77 rows=48 width=51)
Index Cond: ((dimension1)::text = (filter_sessions.dimension1)::text)
CTE ecom_events
-> Hash Join (cost=1.68..175209.67 rows=1 width=60)
Hash Cond: (((ev_1.dimension1)::text = (ee.dimension1)::text) AND (ev_1.dimension3 = ee.dimension3))
-> Seq Scan on events ev_1 (cost=0.00..150210.69 rows=3332973 width=52)
Filter: ((event_category)::text = 'ecom'::text)
-> Hash (cost=0.96..0.96 rows=48 width=48)
-> CTE Scan on ee (cost=0.00..0.96 rows=48 width=48)
-> Sort (cost=0.08..0.08 rows=1 width=236)
Sort Key: s.date, (lower((s.channel_grouping)::text)), (lower((s.device_category)::text)), (lower((s.user_type)::text)), (lower((ev.event_action)::text)), (lower((COALESCE(ev.event_label, 'na'::character varying))::text)), ev.zero_val_product, ev.download
-> Nested Loop (cost=0.00..0.07 rows=1 width=236)
Join Filter: ((s.dimension1)::text = (ev.dimension1)::text)
-> CTE Scan on filter_sessions s (cost=0.00..0.02 rows=1 width=164)
-> CTE Scan on ecom_events ev (cost=0.00..0.02 rows=1 width=104)
I then tried to split cte ee into two parts like so:然后我尝试将 cte ee 分成两部分,如下所示:
ee_base as (
select
e.dimension1,
e.dimension3,
e.metric1,
lower(product_name) as product_name
from ga_flagship_ecom.ecom e
join filter_sessions f on f.dimension1 = e.dimension1
),
ee as (
select
dimension1,
dimension3,
case when sum(case when metric1 = 0 then 1 else 0 end) > 0 then 1 else 0 end as zero_val_product, -- roll up to event level
-- approximation for inferring if the product i a download and hence sees all the checkout steps
case when sum(case when product_name ~ 'digital|download|file' then 1 else 0 end) > 0 then 1 else 0 end as download
from ee_base
group by 1,2
),
This also failed (I was really optimistic this was going to work).这也失败了(我真的很乐观这会奏效)。 Here is the explain output of this attempt:这是此尝试的解释输出:
GroupAggregate (cost=222818.33..222818.38 rows=1 width=188)
Group Key: s.date, (lower((s.channel_grouping)::text)), (lower((s.device_category)::text)), (lower((s.user_type)::text)), (lower((ev.event_action)::text)), (lower((COALESCE(ev.event_label, 'na'::character varying))::text)), ev.zero_val_product, ev.download
CTE filter_sessions
-> Index Scan using sessions_date_idx on sessions (cost=0.56..2.78 rows=1 width=76)
Index Cond: ((date >= '2020-02-06'::date) AND (date <= '2020-02-06'::date))
CTE ee_base
-> Nested Loop (cost=0.56..47603.39 rows=48 width=66)
-> CTE Scan on filter_sessions f (cost=0.00..0.02 rows=1 width=32)
-> Index Scan using ecom_dimension1_idx on ecom e (cost=0.56..47602.77 rows=48 width=51)
Index Cond: ((dimension1)::text = (f.dimension1)::text)
CTE ee
-> HashAggregate (cost=1.68..2.40 rows=48 width=48)
Group Key: ee_base.dimension1, ee_base.dimension3
-> CTE Scan on ee_base (cost=0.00..0.96 rows=48 width=76)
CTE ecom_events
-> Hash Join (cost=1.68..175209.67 rows=1 width=60)
Hash Cond: (((ev_1.dimension1)::text = (ee.dimension1)::text) AND (ev_1.dimension3 = ee.dimension3))
-> Seq Scan on events ev_1 (cost=0.00..150210.69 rows=3332973 width=52)
Filter: ((event_category)::text = 'ecom'::text)
-> Hash (cost=0.96..0.96 rows=48 width=48)
-> CTE Scan on ee (cost=0.00..0.96 rows=48 width=48)
-> Sort (cost=0.08..0.08 rows=1 width=236)
Sort Key: s.date, (lower((s.channel_grouping)::text)), (lower((s.device_category)::text)), (lower((s.user_type)::text)), (lower((ev.event_action)::text)), (lower((COALESCE(ev.event_label, 'na'::character varying))::text)), ev.zero_val_product, ev.download
-> Nested Loop (cost=0.00..0.07 rows=1 width=236)
Join Filter: ((s.dimension1)::text = (ev.dimension1)::text)
-> CTE Scan on filter_sessions s (cost=0.00..0.02 rows=1 width=164)
-> CTE Scan on ecom_events ev (cost=0.00..0.02 rows=1 width=104)
Something that does work is creating a temp table.有用的东西是创建一个临时表。 But I really want to find a way around that and figure this out, in order of preference:但我真的很想找到一种方法来解决这个问题,并按优先顺序解决这个问题:
Are there any other things that I can do here?我可以在这里做其他事情吗?
You can simply rewrite the CTEs into temp views, which are included into the main query plan.您可以简单地将 CTE 重写为临时视图,这些视图包含在主查询计划中。
CREATE TEMP VIEW filter_sessions as
select
dimension1,
dimension2,
zdate,
channel_grouping,
device_category,
user_type
from ga_flagship_ecom.sessions
where zdate >= '2020-02-06'
and zdate <= '2020-02-06'
;
CREATE TEMP VIEW ee as
select
e.dimension1,
e.dimension3,
case when sum(case when e.metric1 = 0 then 1 else 0 end) > 0 then 1 else 0 end as zero_val_product, -- roll up to event level
-- approximation for inferring if the product i a download and hence sees all the checkout steps
case when sum(case when lower(product_name) ~ 'digital|download|file' then 1 else 0 end) > 0 then 1 else 0 end as download
from ga_flagship_ecom.ecom e
join filter_sessions f on f.dimension1 = e.dimension1
group by 1,2
;
CREATE TEMP VIEW ecom_events as
select
ev.dimension1,
ev.dimension3,
ev.event_action,
ev.event_label,
ee.zero_val_product,
ee.download
from ga_flagship_ecom.events ev
join ee on ee.dimension1 = ev.dimension1 and ee.dimension3 = ev.dimension3
where ev.event_category = 'ecom'
;
select
s.zdate,
lower(s.channel_grouping) as channel_grouping,
lower(s.device_category) as device_category,
lower(s.user_type) as user_type,
lower(ev.event_action) as event_action,
lower(coalesce(ev.event_label, 'na')) as event_label,
ev.zero_val_product,
ev.download,
count(distinct s.dimension1) as sessions,
count(distinct s.dimension2) as daily_users
from filter_sessions s
join ecom_events ev on ev.dimension1 = s.dimension1
group by 1,2,3,4,5,6,7,8;
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.