[英]How can I get a random number generated in a CTE not to change in JOIN?
The problem 问题
I'm generating a random number for each row in a table #Table_1
in a CTE, using this technique . 我正在使用这种技术为CTE中的表
#Table_1
中的每一行生成一个随机数。 I'm then joining the results of the CTE on another table, #Table_2
. 然后我在另一张桌子
#Table_2
上加入CTE的结果。 Instead of getting a random number for each row in #Table_1
, I'm getting a new random number for every resulting row in the join! 而不是为
#Table_1
每一行获取一个随机数,我为连接中的每个结果行获取一个新的随机数!
CREATE TABLE #Table_1 (Id INT)
CREATE TABLE #Table_2 (MyId INT, ParentId INT)
INSERT INTO #Table_1
VALUES (1), (2), (3)
INSERT INTO #Table_2
VALUES (1, 1), (2, 1), (3, 1), (4, 1), (1, 2), (2, 2), (3, 2), (1, 3)
;WITH RandomCTE AS
(
SELECT Id, (ABS(CHECKSUM(NewId())) % 5)RandomNumber
FROM #Table_1
)
SELECT r.Id, t.MyId, r.RandomNumber
FROM RandomCTE r
INNER JOIN #Table_2 t
ON r.Id = t.ParentId
The results 结果
Id MyId RandomNumber
----------- ----------- ------------
1 1 1
1 2 2
1 3 0
1 4 3
2 1 4
2 2 0
2 3 0
3 1 3
The desired results 期望的结果
Id MyId RandomNumber
----------- ----------- ------------
1 1 1
1 2 1
1 3 1
1 4 1
2 1 4
2 2 4
2 3 4
3 1 3
What I tried 我尝试了什么
I tried to obscure the logic of the random number generation from the optimizer by casting the random number to VARCHAR
, but that did not work. 我试图通过将随机数转换为
VARCHAR
来模糊优化器中随机数生成的逻辑,但这不起作用。
What I don't want to do 我不想做什么
I'd like to avoid using a temporary table to store the results of the CTE. 我想避免使用临时表来存储CTE的结果。
How can I generate a random number for a table and preserve that random number in a join without using temporary storage? 如何为表生成随机数并在连接中保留该随机数而不使用临时存储?
This seems to do the trick: 这似乎可以解决问题:
WITH CTE AS(
SELECT Id, (ABS(CHECKSUM(NewId())) % 5)RandomNumber
FROM #Table_1),
RandomCTE AS(
SELECT Id,
RandomNumber
FROM CTE
GROUP BY ID, RandomNumber)
SELECT *
FROM RandomCTE r
INNER JOIN #Table_2 t
ON r.Id = t.ParentId;
It looks like SQL Server is aware that, at the point of being outside the CTE, that RandomNumber
is effectively just NEWID()
with some additional functions wrapped around it ( DB<>Fiddle ), and hence it still generates a unique ID for each row. 看起来SQL Server意识到,在CTE之外,
RandomNumber
实际上只是NEWID()
,其中包含一些额外的函数( DB <> Fiddle ),因此它仍然为每个函数生成一个唯一的ID行。 The GROUP BY
clause in the second CTE therefore forces the data engine to define RandomNumber a value so it can perform the GROUP BY
. 因此,第二个CTE中的
GROUP BY
子句强制数据引擎为RandomNumber定义一个值,以便它可以执行GROUP BY
。
Per the quote in this answer 根据这个答案中的引用
The optimizer does not guarantee timing or number of executions of scalar functions.
优化器不保证标量函数的执行时间或执行次数。 This is a long-estabilished tenet.
这是一个长期建立的宗旨。 It's the fundamental 'leeway' tha allows the optimizer enough freedom to gain significant improvements in query-plan execution.
它是基本的“余地”,它允许优化器有足够的自由度来获得查询计划执行方面的重大改进。
If it is important for your application that the random number be evaluated once and only once you should calculate it up front and store it into a temp table. 如果对您的应用程序来说重要的是,应该对随机数进行一次评估,并且只应该预先计算一次并将其存储到临时表中。
Anything else is not guaranteed and so is irresponsible to add into your application's code base - as even if it works now it may break as a result of a schema change/execution plan change/version upgrade/CU install. 其他任何事情都不能得到保证,因此添加到您的应用程序的代码库中是不负责任的 - 即使它现在可以正常工作,它也可能因架构更改/执行计划更改/版本升级/ CU安装而中断。
For example Lamu's answer breaks if a unique index is added to #Table_1 (Id)
例如,如果向
#Table_1 (Id)
添加唯一索引,则Lamu的答案会中断
How about not using a real random number at all? 如何不使用真正的随机数? Use
rand()
with a seed: 使用
rand()
和种子:
WITH RandomCTE AS (
SELECT Id,
CONVERT(INT, RAND(ROW_NUMBER() OVER (ORDER BY NEWID()) * 999999) * 5) as RandomNumber
FROM #Table_1
)
SELECT r.Id, t.MyId, r.RandomNumber
FROM RandomCTE rINNER JOIN
#Table_2 t
ON r.Id = t.ParentId;
The seed argument to rand()
is pretty awful. rand()
的种子参数非常糟糕。 Values of the seed near each other produce similar initial values, which is the reason for the multiplication. 彼此接近的种子的值产生相似的初始值,这是乘法的原因。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.