[英]What is meant by STAT FUNCTION in the Teradata EXPLAIN?
When I use TOP in a select to get sample data from a Teradata table, it uses a LOT more spool (and hence often I spool out) than using SAMPLE for instance.例如,当我在选择中使用 TOP 从 Teradata 表中获取样本数据时,它使用了更多的假脱机(因此我经常假脱机)比使用 SAMPLE 多。
Looking at the EXPLAINs to see what the difference in processing is between SAMPLE vs TOP, it seems there is lot more copying of spool tables going for TOP;查看解释以了解 SAMPLE 与 TOP 之间的处理差异,似乎有更多的假脱机表复制用于 TOP; but the bit I'm confused about is where it says it does a "STAT FUNCTION" step.但我感到困惑的是它说它执行“STAT FUNCTION”步骤的地方。 Can anyone explain what this step is?谁能解释一下这一步是什么? Below are the two explains.下面是两个解释。 The Primary Index of the table is UNIQUE PRIMARY INDEX (Customer_ID).表的主索引是 UNIQUE PRIMARY INDEX (Customer_ID)。 The Teradata version is 16.10.05.03. Teradata 版本为 16.10.05.03。
Explain
SELECT TOP 2000
M.Customer_ID
, M.customer_type
from ESRE.MEAS_CUST_TBL as M
WHERE M.Customer_ID is not null;
1) First, we lock ESRE.M for read on a reserved RowHash to
prevent global deadlock.
2) Next, we lock ESRE.M for read.
3) We do an all-AMPs RETRIEVE step from ESRE.M by way of an
all-rows scan with a condition of ("NOT (ESRE.M.Customer_ID IS
NULL)") into Spool 2 (all_amps), which is built locally on the
AMPs. The size of Spool 2 is estimated with high confidence to be
43,384,684 rows (1,778,772,044 bytes). The estimated time for
this step is 1.28 seconds.
4) We do an all-AMPs STAT FUNCTION step from Spool 2 by way of an
all-rows scan into Spool 5, which is built locally on the AMPs.
The result rows are put into Spool 1 (group_amps), which is built
locally on the AMPs. This step is used to retrieve the TOP 2000
rows. One AMP is randomly selected to retrieve 2000 rows.
If this step retrieves less than 2000 rows, then execute step 5.
The size is estimated with high confidence to be 2,000 rows (
94,000 bytes).
5) We do an all-AMPs STAT FUNCTION step from Spool 2 (Last Use) by
way of an all-rows scan into Spool 5 (Last Use), which is
redistributed by hash code to all AMPs. The result rows are put
into Spool 1 (group_amps), which is built locally on the AMPs.
This step is used to retrieve the TOP 2000 rows. The size is
estimated with high confidence to be 2,000 rows (94,000 bytes).
6) Finally, we send out an END TRANSACTION step to all AMPs involved
in processing the request.
-> The contents of Spool 1 are sent back to the user as the result of
statement 1.
Explain
SELECT M.Customer_ID
, M.customer_type
from ESRE.MEAS_CUST_TBL as M
WHERE M.Customer_ID is not null
sample 2000;
1) First, we lock ESRE.M for read on a reserved RowHash to
prevent global deadlock.
2) Next, we lock ESRE.M for read.
3) We do an all-AMPs RETRIEVE step from ESRE.M by way of an
all-rows scan with a condition of ("NOT (ESRE.M.Customer_ID IS
NULL)") into Spool 2 (all_amps), which is built locally on the
AMPs. The size of Spool 2 is estimated with high confidence to be
43,384,684 rows (2,039,080,148 bytes). The estimated time for
this step is 1.28 seconds.
4) We do an all-AMPs SAMPLING step from Spool 2 (Last Use) by way of
an all-rows scan into Spool 1 (group_amps), which is built locally
on the AMPs. Samples are specified as a number of rows. The size
of Spool 1 is estimated with high confidence to be 2,000 rows (
94,000 bytes).
5) Finally, we send out an END TRANSACTION step to all AMPs involved
in processing the request.
-> The contents of Spool 1 are sent back to the user as the result of
statement 1.
There's a common characteristic of both SAMPLE and TOP, they are executed as the last step in Explain. SAMPLE 和 TOP 有一个共同的特点,它们是作为 Explain 的最后一步执行的。 Thus your WHERE-condition is applied first, spooling almost all rows.因此,您的 WHERE 条件首先被应用,几乎所有行都被假脱机。
A simple workaround to avoid spooling large amounts of rows is nested sampling (similar for TOP):避免假脱机大量行的简单解决方法是嵌套采样(类似于 TOP):
select *
from
(
SELECT M.Customer_ID
, M.customer_type
from ESRE.MEAS_CUST_TBL as M
-- must be large enough to still return 2000 rows in the next step
sample 3000
) as M
WHERE M.Customer_ID is not null
sample 2000
Now you get a fast sample step first returning a small subset of the rows, followed by the filter on NOT NULL and the 2nd sample.现在,您将获得一个快速采样步骤,首先返回行的一小部分,然后是 NOT NULL 和第二个示例的过滤器。 Of course, you need some knowledge of the actual data to decide an appropriate sample size, otherwise the outer sample might not return enough rows.当然,您需要对实际数据有所了解才能确定合适的样本大小,否则外部样本可能无法返回足够多的行。 But it seems like you just want to examine some rows, in that case you probably don't care if the query returns exactly 2000 rows.但似乎您只想检查一些行,在这种情况下,您可能并不关心查询是否正好返回 2000 行。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.