简体   繁体   English

如何在CTE中生成随机数而不在JOIN中更改?

[英]How can I get a random number generated in a CTE not to change in JOIN?

The problem 问题

I'm generating a random number for each row in a table #Table_1 in a CTE, using this technique . 我正在使用这种技术为CTE中的表#Table_1中的每一行生成一个随机数。 I'm then joining the results of the CTE on another table, #Table_2 . 然后我在另一张桌子#Table_2上加入CTE的结果。 Instead of getting a random number for each row in #Table_1 , I'm getting a new random number for every resulting row in the join! 而不是为#Table_1每一行获取一个随机数,我为连接中的每个结果行获取一个新的随机数!

CREATE TABLE #Table_1 (Id INT)

CREATE TABLE #Table_2 (MyId INT, ParentId INT)

INSERT INTO #Table_1
VALUES (1), (2), (3)

INSERT INTO #Table_2
VALUES (1, 1), (2, 1), (3, 1), (4, 1), (1, 2), (2, 2), (3, 2), (1, 3)


;WITH RandomCTE AS
(
    SELECT Id, (ABS(CHECKSUM(NewId())) % 5)RandomNumber
    FROM #Table_1
)
SELECT r.Id, t.MyId, r.RandomNumber
FROM RandomCTE r
INNER JOIN #Table_2 t
    ON r.Id = t.ParentId

The results 结果

Id          MyId        RandomNumber
----------- ----------- ------------
1           1           1
1           2           2
1           3           0
1           4           3
2           1           4
2           2           0
2           3           0
3           1           3

The desired results 期望的结果

Id          MyId        RandomNumber
----------- ----------- ------------
1           1           1
1           2           1
1           3           1
1           4           1
2           1           4
2           2           4
2           3           4
3           1           3

What I tried 我尝试了什么

I tried to obscure the logic of the random number generation from the optimizer by casting the random number to VARCHAR , but that did not work. 我试图通过将随机数转换为VARCHAR来模糊优化器中随机数生成的逻辑,但这不起作用。

What I don't want to do 我不想做什么

I'd like to avoid using a temporary table to store the results of the CTE. 我想避免使用临时表来存储CTE的结果。

How can I generate a random number for a table and preserve that random number in a join without using temporary storage? 如何为表生成随机数并在连接中保留该随机数而不使用临时存储?

This seems to do the trick: 这似乎可以解决问题:

WITH CTE AS(
    SELECT Id, (ABS(CHECKSUM(NewId())) % 5)RandomNumber
    FROM #Table_1),
RandomCTE AS(
    SELECT Id,
           RandomNumber
    FROM CTE
    GROUP BY ID, RandomNumber)
SELECT *
FROM RandomCTE r
INNER JOIN #Table_2 t
    ON r.Id = t.ParentId;

It looks like SQL Server is aware that, at the point of being outside the CTE, that RandomNumber is effectively just NEWID() with some additional functions wrapped around it ( DB<>Fiddle ), and hence it still generates a unique ID for each row. 看起来SQL Server意识到,在CTE之外, RandomNumber实际上只是NEWID() ,其中包含一些额外的函数( DB <> Fiddle ),因此它仍然为每个函数生成一个唯一的ID行。 The GROUP BY clause in the second CTE therefore forces the data engine to define RandomNumber a value so it can perform the GROUP BY . 因此,第二个CTE中的GROUP BY子句强制数据引擎为RandomNumber定义一个值,以便它可以执行GROUP BY

Per the quote in this answer 根据这个答案中的引用

The optimizer does not guarantee timing or number of executions of scalar functions. 优化器不保证标量函数的执行时间或执行次数。 This is a long-estabilished tenet. 这是一个长期建立的宗旨。 It's the fundamental 'leeway' tha allows the optimizer enough freedom to gain significant improvements in query-plan execution. 它是基本的“余地”,它允许优化器有足够的自由度来获得查询计划执行方面的重大改进。

If it is important for your application that the random number be evaluated once and only once you should calculate it up front and store it into a temp table. 如果对您的应用程序来说重要的是,应该对随机数进行一次评估,并且只应该预先计算一次并将其存储到临时表中。

Anything else is not guaranteed and so is irresponsible to add into your application's code base - as even if it works now it may break as a result of a schema change/execution plan change/version upgrade/CU install. 其他任何事情都不能得到保证,因此添加到您的应用程序的代码库中是不负责任的 - 即使它现在可以正常工作,它也可能因架构更改/执行计划更改/版本升级/ CU安装而中断。

For example Lamu's answer breaks if a unique index is added to #Table_1 (Id) 例如,如果向#Table_1 (Id)添加唯一索引,则Lamu的答案会中断

How about not using a real random number at all? 如何不使用真正的随机数? Use rand() with a seed: 使用rand()和种子:

WITH RandomCTE AS (
      SELECT Id,
             CONVERT(INT, RAND(ROW_NUMBER() OVER (ORDER BY NEWID()) * 999999) * 5) as RandomNumber
      FROM #Table_1
     )
SELECT r.Id, t.MyId, r.RandomNumber
FROM RandomCTE rINNER JOIN
     #Table_2 t
     ON r.Id = t.ParentId;

The seed argument to rand() is pretty awful. rand()的种子参数非常糟糕。 Values of the seed near each other produce similar initial values, which is the reason for the multiplication. 彼此接近的种子的值产生相似的初始值,这是乘法的原因。

Here is the db<>fiddle. 是db <>小提琴。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM