使用选择概率优化 T-SQL 查询

Question

I have a table with entries from which two entries are supposed to be selected.我有一个条目的表，应该从中选择两个条目。 The probability of some entries to be selected should be higher than the probability of others.某些条目被选中的概率应该高于其他条目的概率。

Currently I solve this with UNION ALL accordingly I select once all entries and then again the entries which should have a higher probability.目前我用UNION ALL解决了这个问题，因此我选择一次所有条目，然后再次选择应该具有更高概率的条目。 From this merged table I select then after the call of ORDER BY NEWID() for mixing with TOP 2 two entries.从这个合并的表中，我选择在ORDER BY NEWID()调用后与TOP 2两个条目混合。

SELECT TOP 2 EMail 
FROM (
   SELECT EMail 
   FROM dbo.Benutzer 
   UNION ALL 
   SELECT EMail 
   FROM dbo.Benutzer 
   WHERE param1 = 1 
   UNION ALL 
   SELECT EMail 
   FROM dbo.Benutzer 
   WHERE param2 = 1
) AS EMail 
ORDER BY NEWID();

Example table:示例表：

EMail           param1      param2
______________|_________|___________
Test@test.com |0        | 0             -> probability is 1 (normal)
Test1@test.com|1        | 0             -> probability is 2 (higher than 1)
Test2@test.com|1        | 0             -> probability is 2 (higher than 1)
Test3@test.com|1        | 1             -> probability is 3 (higher than 1 and 2)
Test4@test.com|1        | 0             -> probability is 2 (higher than 1)
Test5@test.com|1        | 0             -> probability is 2 (higher than 1)

so if I make a select from this table and shuffle it before that, that with each new select different data comes out, the probability should be based on this table.因此，如果我从该表中进行选择并在此之前对其进行洗牌，那么每个新选择的不同数据都会出现，概率应该基于该表。 So how likely it is that this entry comes up.那么这个条目出现的可能性有多大。

Always two different users should come out.总是两个不同的用户应该出来。 eg Test3 and Test4.例如Test3 和Test4。 However, also Test unf Test3 so it should be only a probability.但是，也测试 unf Test3 所以它应该只是一个概率。

However, this query is not performat.但是，此查询未执行。 How can this problem be solved in a performant way?如何以高性能的方式解决这个问题？

Answer 1

Updated answer:更新的答案：

It seems that you want to randomly choose one row per e-mail with higher probabilty.似乎您想以更高的概率随机选择每封电子邮件的一行。 In this situation you need to number the rows for each different e-mail:在这种情况下，您需要对每个不同的电子邮件的行进行编号：

SELECT *
INTO Benutzer
FROM (VALUES
   ('Test@test.com',  0, 0),             
   ('Test1@test.com', 1, 0),             
   ('Test2@test.com', 1, 0),             
   ('Test3@test.com', 1, 1),             
   ('Test4@test.com', 1, 0),             
   ('Test5@test.com', 1, 0)              
) v (EMail, param1, param2)

SELECT TOP 2 Email
FROM (
   SELECT b.*, ROW_NUMBER() OVER (PARTITION BY b.Email ORDER BY a.probability DESC) AS rn
   FROM Benutzer b
   CROSS APPLY (VALUES
      (1), -- normal probability
      (
      1 + 
      CASE WHEN param1 = 1 THEN 1 ELSE 0 END + 
      CASE WHEN param2 = 1 THEN 1 ELSE 0 END
      )    -- calculated probability
   ) a (probability)
) t
WHERE t.rn = 1
ORDER BY NEWID()

Finally, you may try to number the rows randomly:最后，您可以尝试对行进行随机编号：

ROW_NUMBER() OVER (PARTITION BY b.Email ORDER BY NEWID()) AS rn

Original answer:原答案：

If I understand the question correctly, you simply need to calculate the probability for each row based on desired conditions:如果我正确理解了这个问题，您只需要根据所需条件计算每一行的概率：

SELECT TOP 2 EMail
FROM (VALUES
   ('Test@test.com',  0, 0),             
   ('Test1@test.com', 1, 0),             
   ('Test2@test.com', 1, 0),             
   ('Test3@test.com', 1, 1),             
   ('Test4@test.com', 1, 0),             
   ('Test5@test.com', 1, 0)              
) Benutzer (EMail, param1, param2)
ORDER BY 
   (
   1 +
   CASE WHEN param1 = 1 THEN 1 ELSE 0 END +
   CASE WHEN param2 = 1 THEN 1 ELSE 0 END
   ) DESC,
   NEWID()

An important note (based on the @GordonLinoff's comment) - if NEWID() is the first order expression, the second order by expression is ignored, so I placed NEWID() as second order expression (and this was the first version of the answer).一个重要的说明（基于@GordonLinoff 的评论） - 如果NEWID()是一阶表达式，则忽略二阶表达式，因此我将NEWID()作为二阶表达式（这是答案的第一个版本）。 But, in this case, the rows are not returned randomly.但是，在这种情况下，行不会随机返回。

Answer 2

You can try using a tally table and ROW_NUMBER() functions.您可以尝试使用计数表和 ROW_NUMBER() 函数。 The Tally table is used to create the correct number of duplicate rows for each row based upon probability as given. Tally 表用于根据给定的概率为每行创建正确数量的重复行。 The ROW_NUMBER() function creates to row numbers one based upon a sort order of NewId() - this is RN, the other is based upon the ordering within an email address - this is RNK. ROW_NUMBER() 函数根据 NewId() 的排序顺序创建行号 - 这是 RN，另一个基于电子邮件地址中的排序 - 这是 RNK。

We want to select 2 rows ordered by RN, but use RNK = 1 to ensure we don't receive duplicates.我们想选择由 RN 排序的 2 行，但使用 RNK = 1 以确保我们不会收到重复项。

SQL Fiddle SQL小提琴

MS SQL Server 2017 Schema Setup : MS SQL Server 2017 架构设置：

CREATE TABLE Benutzer
(
    Email VARCHAR(30),
    Param1 Int,
    Param2 Int
);

INSERT INTO Benutzer
VALUES
       ('Test@test.com',  0, 0),       -- probability 1      
       ('Test1@test.com', 1, 0),       -- probability 2      
       ('Test2@test.com', 1, 0),       -- probability 2      
       ('Test3@test.com', 1, 1),       -- probability 3      
       ('Test4@test.com', 1, 0),       -- probability 2      
       ('Test5@test.com', 1, 0)        -- probability 2

Query 1 :查询 1 ：

WITH Tally As
(
    SELECT *
    FROM (VALUES
        (1),             
        (2),             
        (3)          
        ) Tally (Num)
),
Results AS
(
    SELECT Email, ROW_NUMBER() OVER (ORDER By NewId()) AS RN,
           ROW_NUMBER() OVER (PArtition BY Email ORDER BY PAram1) As Rnk
    FROM Benutzer
    INNER JOIN Tally
        ON TAlly.Num <= 1 + Param1 + Param2
)
SELECT TOP 2 Email
FROM Results
WHERE Results.Rnk =1
ORDER BY RN

Results :结果：

|          Email |
|----------------|
| Test2@test.com |
| Test5@test.com |

Answer 3

Taking the probability into account.考虑到概率。 Using a recursion because the next step depends on the result of the previous one, ignore new random index if it points to the same range.使用递归，因为下一步取决于前一步的结果，如果它指向相同的范围，则忽略新的随机索引。

   with weighted as (
      -- weight = f (param1, param2) , using 1 + param1 + param2 here
      select Email, param1, param2, count(*) * (1 + param1 + param2) n
      from Benutzer
      group by Email, param1, param2
   ), ranges as (
     --  a range width = weight
     select Email, sum(n) over(order by Email)- n + 1 n1, sum(n) over(order by Email) n2, sum(n) over() sn
     from weighted
   ), h as (
     select 1 level, t2.*, ABS(CHECKSUM(NewId())) r
         -- , @rnd  % t2.sn + 1  k
     from ranges t2 
     where (@rnd  % t2.sn + 1) between t2.n1 and t2.n2 

     union all
     -- try next random index till the next row is a new one
     select case when t2.Email = h.Email then level else level + 1 end
         , t2.* , ABS(CHECKSUM(NewId()))
         -- ,  r % t2.sn + 1
     from ranges t2 
     join h on r % t2.sn + 1  between t2.n1 and t2.n2 
     where level <=1
   )
   -- take one row from every level
   select top(1) with ties Email, @cnt
   from h
   order by row_number() over(partition by level order by newid());

See test db<>fiddle见测试数据库<>小提琴

500 runs statistics: 500 次运行统计：

Email   n   p
Test@test.com   68  6.800000000000
Test1@test.com  179 17.900000000000
Test2@test.com  159 15.900000000000
Test3@test.com  246 24.600000000000
Test4@test.com  180 18.000000000000
Test5@test.com  168 16.800000000000

Not ideal but close.不理想但接近。

Answer 4

What's the main problem?主要问题是什么？

When using ORDER BY NEWID() approach the ORDER BY clause will cause all records in the result set be sorted which can be a very expensive operation (can use a lot of disk I/O) when the base table has a big number of records.当使用ORDER BY NEWID()方法时， ORDER BY子句将导致结果集中的所有记录被排序，当基表有大量记录时，这可能是一项非常昂贵的操作（可能使用大量磁盘 I/O） .

My suggested solution:我建议的解决方案：

If we add an ID column to the base table like this:如果我们像这样向基表添加一个 ID 列：

CREATE TABLE #T
(
  ID int IDENTITY(1, 3) PRIMARY KEY,
  Email varchar(100),
  Param1 int,
  Param2 int
);

Then we can use this query to get the desired result (without duplicated values):然后我们可以使用这个查询来得到想要的结果（没有重复的值）：

WITH RndRange AS
(
    SELECT MIN(ID) RangeStart, MAX(ID) + 2 RangeEnd FROM #T
)
,RndSelector AS
(
    SELECT NULL AS HitID, 0 AS HitCounter, CAST('' AS varchar(100)) AS HitIDList
    UNION ALL
    SELECT 
        NewHitID,
        HitCounter + IIF(HitID IS NULL, 0, 1), 
        CAST(HitIDList + IIF(NewHitID IS NULL, '', CONCAT(NewHitID, ',')) AS varchar(100))
    FROM 
        RndSelector 
    CROSS APPLY
    (
        SELECT ABS(CHECKSUM(NEWID()) % (RangeEnd - RangeStart + 1)) + RangeStart AS RndNum FROM RndRange
    ) C
    OUTER APPLY
    (
        SELECT 
            ID AS NewHitID 
        FROM 
            #T 
        WHERE 
            (ID BETWEEN RndNum - 2 AND RndNum) --Added just for performance
            AND
            (RndNum BETWEEN ID AND ID + Param1 + Param2) 
            AND 
            ID NOT IN (SELECT Value FROM string_split(HitIDList, ','))
    ) O
    WHERE 
        HitCounter + IIF(HitID IS NULL, 0, 1) < 2
)
SELECT * FROM #T WHERE ID IN (SELECT HitID FROM RndSelector)

How does it work?它是如何工作的？

Suppose we have these records in the table:假设我们在表中有这些记录：

ID  Email   Param1  Param2
1   User1   1       1
4   User2   1       0
7   User3   0       0

Each round the recursive is executed one random int number between 1 and 9 (min ID and max ID + 2) is generated.递归执行的每一轮都会生成一个介于 1 和 9（最小 ID 和最大 ID + 2）之间的随机整数。 When the random number is between 1 and 3 then the ID of User1 is selected (hit).当随机数介于 1 和 3 之间时，则选择（命中）User1 的 ID。 If random number is 4 or 5 then User2 is selected and when random number is 6 then nothing is selected so go to the next round.如果随机数为 4 或 5，则选择 User2，当随机数为 6 时，则不选择任何内容，因此进入下一轮。 The execution of recursive will continue until 2 different IDs are selected (HitCounter = 2).递归的执行将继续直到选择了 2 个不同的 ID (HitCounter = 2)。

And with the help of HitIDList, repeated IDs will not be selected again.并且在 HitIDList 的帮助下，不会再次选择重复的 ID。

As you see there is no table sort required by using this solution.如您所见，使用此解决方案不需要表排序。 We just generate a few random numbers and find their related records in which dbms will use index scans to find records.我们只是生成一些随机数并找到它们的相关记录，其中 dbms 将使用索引扫描来查找记录。 So I expect considerable performance improvement because of reduced I/O operations specially for large tables.由于减少了专门针对大表的 I/O 操作，因此我预计性能会得到显着提高。

Answer Updated:答案更新：

I added a condition ID BETWEEN RndNum - 2 AND RndNum to the query and now it is blazing fast.我在查询中添加了一个条件ID BETWEEN RndNum - 2 AND RndNum ，现在它非常快。 Speed test results on a table with 1000000 records:具有 1000000 条记录的表的速度测试结果：

Applied Query应用查询	Test1测试1	Test2测试2	Test3测试3
ORDER BY NEWID()按 NEWID() 排序	477ms 477ms	400ms 400ms	510ms 510ms
Random generator随机发生器	2ms 2ms	3ms 3ms	2ms 2ms

使用选择概率优化 T-SQL 查询

问题描述

4 个解决方案

解决方案1
0 已采纳 2021-07-20 09:05:11

解决方案2
0 2021-07-20 21:34:26

解决方案3
0 2021-07-21 08:52:07

解决方案4
0 2021-07-21 17:29:44

使用选择概率优化 T-SQL 查询

问题描述

4 个解决方案

解决方案1 0 已采纳 2021-07-20 09:05:11

解决方案2 0 2021-07-20 21:34:26

解决方案3 0 2021-07-21 08:52:07

解决方案4 0 2021-07-21 17:29:44

解决方案1
0 已采纳 2021-07-20 09:05:11

解决方案2
0 2021-07-20 21:34:26

解决方案3
0 2021-07-21 08:52:07

解决方案4
0 2021-07-21 17:29:44