Teradata EXPLAIN 中的 STAT FUNCTION 是什么意思？

Question

When I use TOP in a select to get sample data from a Teradata table, it uses a LOT more spool (and hence often I spool out) than using SAMPLE for instance.例如，当我在选择中使用 TOP 从 Teradata 表中获取样本数据时，它使用了更多的假脱机（因此我经常假脱机）比使用 SAMPLE 多。

Looking at the EXPLAINs to see what the difference in processing is between SAMPLE vs TOP, it seems there is lot more copying of spool tables going for TOP;查看解释以了解 SAMPLE 与 TOP 之间的处理差异，似乎有更多的假脱机表复制用于 TOP； but the bit I'm confused about is where it says it does a "STAT FUNCTION" step.但我感到困惑的是它说它执行“STAT FUNCTION”步骤的地方。 Can anyone explain what this step is?谁能解释一下这一步是什么？ Below are the two explains.下面是两个解释。 The Primary Index of the table is UNIQUE PRIMARY INDEX (Customer_ID).表的主索引是 UNIQUE PRIMARY INDEX (Customer_ID)。 The Teradata version is 16.10.05.03. Teradata 版本为 16.10.05.03。

Explain     
SELECT      TOP 2000 
            M.Customer_ID
            , M.customer_type

from        ESRE.MEAS_CUST_TBL as M
WHERE       M.Customer_ID is not null;

  1) First, we lock ESRE.M for read on a reserved RowHash to
     prevent global deadlock.
  2) Next, we lock ESRE.M for read.
  3) We do an all-AMPs RETRIEVE step from ESRE.M by way of an
     all-rows scan with a condition of ("NOT (ESRE.M.Customer_ID IS
     NULL)") into Spool 2 (all_amps), which is built locally on the
     AMPs.  The size of Spool 2 is estimated with high confidence to be
     43,384,684 rows (1,778,772,044 bytes).  The estimated time for
     this step is 1.28 seconds.
  4) We do an all-AMPs STAT FUNCTION step from Spool 2 by way of an
     all-rows scan into Spool 5, which is built locally on the AMPs.
     The result rows are put into Spool 1 (group_amps), which is built
     locally on the AMPs.  This step is used to retrieve the TOP 2000
     rows.  One AMP is randomly selected to retrieve 2000 rows.
     If this step retrieves less than 2000 rows, then execute step 5.
     The size is estimated with high confidence to be 2,000 rows (
     94,000 bytes).
  5) We do an all-AMPs STAT FUNCTION step from Spool 2 (Last Use) by
     way of an all-rows scan into Spool 5 (Last Use), which is
     redistributed by hash code to all AMPs.  The result rows are put
     into Spool 1 (group_amps), which is built locally on the AMPs.
     This step is used to retrieve the TOP 2000 rows.  The size is
     estimated with high confidence to be 2,000 rows (94,000 bytes).
  6) Finally, we send out an END TRANSACTION step to all AMPs involved
     in processing the request.
  -> The contents of Spool 1 are sent back to the user as the result of
     statement 1.



Explain     
SELECT      M.Customer_ID
            , M.customer_type

from        ESRE.MEAS_CUST_TBL as M

WHERE       M.Customer_ID is not null

sample      2000;

  1) First, we lock ESRE.M for read on a reserved RowHash to
     prevent global deadlock.
  2) Next, we lock ESRE.M for read.
  3) We do an all-AMPs RETRIEVE step from ESRE.M by way of an
     all-rows scan with a condition of ("NOT (ESRE.M.Customer_ID IS
     NULL)") into Spool 2 (all_amps), which is built locally on the
     AMPs.  The size of Spool 2 is estimated with high confidence to be
     43,384,684 rows (2,039,080,148 bytes).  The estimated time for
     this step is 1.28 seconds.
  4) We do an all-AMPs SAMPLING step from Spool 2 (Last Use) by way of
     an all-rows scan into Spool 1 (group_amps), which is built locally
     on the AMPs.  Samples are specified as a number of rows.  The size
     of Spool 1 is estimated with high confidence to be 2,000 rows (
     94,000 bytes).
  5) Finally, we send out an END TRANSACTION step to all AMPs involved
     in processing the request.
  -> The contents of Spool 1 are sent back to the user as the result of
     statement 1.

Answer 1

There's a common characteristic of both SAMPLE and TOP, they are executed as the last step in Explain. SAMPLE 和 TOP 有一个共同的特点，它们是作为 Explain 的最后一步执行的。 Thus your WHERE-condition is applied first, spooling almost all rows.因此，您的 WHERE 条件首先被应用，几乎所有行都被假脱机。

A simple workaround to avoid spooling large amounts of rows is nested sampling (similar for TOP):避免假脱机大量行的简单解决方法是嵌套采样（类似于 TOP）：

select *
from 
 ( 
   SELECT      M.Customer_ID
            , M.customer_type

   from        ESRE.MEAS_CUST_TBL as M
   -- must be large enough to still return 2000 rows in the next step
   sample      3000
 ) as M
WHERE       M.Customer_ID is not null
sample 2000

Now you get a fast sample step first returning a small subset of the rows, followed by the filter on NOT NULL and the 2nd sample.现在，您将获得一个快速采样步骤，首先返回行的一小部分，然后是 NOT NULL 和第二个示例的过滤器。 Of course, you need some knowledge of the actual data to decide an appropriate sample size, otherwise the outer sample might not return enough rows.当然，您需要对实际数据有所了解才能确定合适的样本大小，否则外部样本可能无法返回足够多的行。 But it seems like you just want to examine some rows, in that case you probably don't care if the query returns exactly 2000 rows.但似乎您只想检查一些行，在这种情况下，您可能并不关心查询是否正好返回 2000 行。

Teradata EXPLAIN 中的 STAT FUNCTION 是什么意思？

问题描述

1 个解决方案

解决方案1
1 2020-01-10 09:13:17

Teradata EXPLAIN 中的 STAT FUNCTION 是什么意思？

问题描述

1 个解决方案

解决方案1 1 2020-01-10 09:13:17

解决方案1
1 2020-01-10 09:13:17