简体   繁体   中英

How to update each row of a table with a random row from another table

I'm building my first de-identification script, and running into issues with my approach.

I have a table dbo.pseudonyms whose firstname column is populated with 200 rows of data. Every row in this column of 200 rows has a value (none are null). This table also has an id column (int, primary key, not null) with the numbers 1-200.

What I want to do is, in one statement, re-populate my entire USERS table with firstname data randomly selected for each row from my pseudonyms table.

To generate the random number for picking I'm using ABS(Checksum(NewId())) % 200 . Every time I do SELECT ABS(Checksum(NewId())) % 200 I get a numeric value in the range I'm looking for just fine, no intermittently erratic behavior.

HOWEVER, when I use this formula in the following statement:

SELECT pn.firstname 
FROM DeIdentificationData.dbo.pseudonyms pn 
WHERE pn.id = ABS(Checksum(NewId())) % 200

I get VERY intermittent results. I'd say about 30% of the results return one name picked out of the table (this is the expected result), about 30% come back with more than one result (which is baffling, there are no duplicate id column values), and about 30% come back with NULL (even though there are no empty rows in the firstname column)

I did look for quite a while for this specific issue, but to no avail so far. I'm assuming the issue has to do with using this formula as a pointer, but I'd be at a loss how to do this otherwise.

Thoughts?

Why your query in the question returns unexpected results

Your original query selects from Pseudonyms . Server scans through each row of the table, picks the ID from that row, generates a random number, compares the generated number to the ID .

When by chance the generated number for particular row happen to be the same as ID of that row, this row is returned in the result set. It is quite possible that by chance generated number would never be the same as ID , as well as that generated number coincided with ID several times.

A bit more detailed:

  • Server picks a row with ID=1 .
  • Generates a random number, say 25 . Why not? A decent random number.
  • Is 1 = 25 ? No => This row is not returned.
  • Server picks a row with ID=2 .
  • Generates a random number, say 125 . Why not? A decent random number.
  • Is 2 = 125 ? No => This row is not returned.
  • And so on...

Here is a complete solution on SQL Fiddle

Sample data

DECLARE @VarPseudonyms TABLE (ID int IDENTITY(1,1), PseudonymName varchar(50) NOT NULL);
DECLARE @VarUsers TABLE (ID int IDENTITY(1,1), UserName varchar(50) NOT NULL);

INSERT INTO @VarUsers (UserName)
SELECT TOP(1000)
    'UserName' AS UserName
FROM sys.all_objects
ORDER BY sys.all_objects.object_id;

INSERT INTO @VarPseudonyms (PseudonymName)
SELECT TOP(200)
    'PseudonymName'+CAST(ROW_NUMBER() OVER(ORDER BY sys.all_objects.object_id) AS varchar) AS PseudonymName
FROM sys.all_objects
ORDER BY sys.all_objects.object_id;

Table Users has 1000 rows with the same UserName for each row. Table Pseudonyms has 200 rows with different PseudonymNames :

SELECT * FROM @VarUsers;
ID   UserName
--   --------
1    UserName
2    UserName
...
999  UserName
1000 UserName

SELECT * FROM @VarPseudonyms;
ID   PseudonymName
--   -------------
1    PseudonymName1
2    PseudonymName2
...
199  PseudonymName199
200  PseudonymName200

First attempt

At first I tried a direct approach. For each row in Users I want to get one random row from Pseudonyms :

SELECT
    U.ID
    ,U.UserName
    ,CA.PseudonymName
FROM
    @VarUsers AS U
    CROSS APPLY
    (
        SELECT TOP(1)
            P.PseudonymName
        FROM @VarPseudonyms AS P
        ORDER BY CRYPT_GEN_RANDOM(4)
    ) AS CA
;

It turns out that optimizer is too smart and this produced some random, but the same PseudonymName for each User , which is not what I expected:

ID   UserName   PseudonymName
1    UserName   PseudonymName181
2    UserName   PseudonymName181
...
999  UserName   PseudonymName181
1000 UserName   PseudonymName181

So, I tweaked this approach a bit and generated a random number for each row in Users first. Then I used the generated number to find the Pseudonym with this ID for each row in Users using CROSS APPLY .

CTE_Users has an extra column with random number from 1 to 200. In CTE_Joined we pick a row from Pseudonyms for each User . Finally we UPDATE the original Users table.

Final solution

WITH
CTE_Users
AS
(
    SELECT
        U.ID
        ,U.UserName
        ,1 + 200 * (CAST(CRYPT_GEN_RANDOM(4) as int) / 4294967295.0 + 0.5) AS rnd
    FROM @VarUsers AS U
)
,CTE_Joined
AS
(
    SELECT
        CTE_Users.ID
        ,CTE_Users.UserName
        ,CA.PseudonymName
    FROM
        CTE_Users
        CROSS APPLY
        (
            SELECT P.PseudonymName
            FROM @VarPseudonyms AS P
            WHERE P.ID = CAST(CTE_Users.rnd AS int)
        ) AS CA
)
UPDATE CTE_Joined
SET UserName = PseudonymName;

Results

SELECT * FROM @VarUsers;
ID   UserName
1    PseudonymName41
2    PseudonymName132
3    PseudonymName177
...
998  PseudonymName60
999  PseudonymName141
1000 PseudonymName157

SQL Fiddle

A simpler approach:

UPDATE u
SET u.FirstName = p.Name
FROM Users u
CROSS APPLY (
    SELECT TOP(1) p.Name
    FROM pseudonyms p
    WHERE u.Id IS NOT NULL -- must be some unique identifier on Users
    ORDER BY NEWID()
) p

Full example from: https://stackoverflow.com/a/36185100/6620329

Update a random Users id into UpdatedBy column of Table01

UPDATE a 
SET a.UpdatedBy=b.id
FROM [dbo].[Table01] a
CROSS APPLY (
    SELECT
      id,
      ROW_NUMBER() over(partition by 1 order by NEWID()) RN
    FROM Users b
    WHERE a.id != b.id
) b 
WHERE RN = 1

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM