I'm building my first de-identification script, and running into issues with my approach.
I have a table dbo.pseudonyms
whose firstname
column is populated with 200 rows of data. Every row in this column of 200 rows has a value (none are null). This table also has an id
column (int, primary key, not null) with the numbers 1-200.
What I want to do is, in one statement, re-populate my entire USERS
table with firstname
data randomly selected for each row from my pseudonyms
table.
To generate the random number for picking I'm using ABS(Checksum(NewId())) % 200
. Every time I do SELECT ABS(Checksum(NewId())) % 200
I get a numeric value in the range I'm looking for just fine, no intermittently erratic behavior.
HOWEVER, when I use this formula in the following statement:
SELECT pn.firstname
FROM DeIdentificationData.dbo.pseudonyms pn
WHERE pn.id = ABS(Checksum(NewId())) % 200
I get VERY intermittent results. I'd say about 30% of the results return one name picked out of the table (this is the expected result), about 30% come back with more than one result (which is baffling, there are no duplicate id
column values), and about 30% come back with NULL (even though there are no empty rows in the firstname
column)
I did look for quite a while for this specific issue, but to no avail so far. I'm assuming the issue has to do with using this formula as a pointer, but I'd be at a loss how to do this otherwise.
Thoughts?
Why your query in the question returns unexpected results
Your original query selects from Pseudonyms
. Server scans through each row of the table, picks the ID
from that row, generates a random number, compares the generated number to the ID
.
When by chance the generated number for particular row happen to be the same as ID
of that row, this row is returned in the result set. It is quite possible that by chance generated number would never be the same as ID
, as well as that generated number coincided with ID
several times.
A bit more detailed:
ID=1
. 25
. Why not? A decent random number. 1 = 25
? No => This row is not returned. ID=2
. 125
. Why not? A decent random number. 2 = 125
? No => This row is not returned. Here is a complete solution on SQL Fiddle
Sample data
DECLARE @VarPseudonyms TABLE (ID int IDENTITY(1,1), PseudonymName varchar(50) NOT NULL);
DECLARE @VarUsers TABLE (ID int IDENTITY(1,1), UserName varchar(50) NOT NULL);
INSERT INTO @VarUsers (UserName)
SELECT TOP(1000)
'UserName' AS UserName
FROM sys.all_objects
ORDER BY sys.all_objects.object_id;
INSERT INTO @VarPseudonyms (PseudonymName)
SELECT TOP(200)
'PseudonymName'+CAST(ROW_NUMBER() OVER(ORDER BY sys.all_objects.object_id) AS varchar) AS PseudonymName
FROM sys.all_objects
ORDER BY sys.all_objects.object_id;
Table Users
has 1000 rows with the same UserName
for each row. Table Pseudonyms
has 200 rows with different PseudonymNames
:
SELECT * FROM @VarUsers;
ID UserName
-- --------
1 UserName
2 UserName
...
999 UserName
1000 UserName
SELECT * FROM @VarPseudonyms;
ID PseudonymName
-- -------------
1 PseudonymName1
2 PseudonymName2
...
199 PseudonymName199
200 PseudonymName200
First attempt
At first I tried a direct approach. For each row in Users
I want to get one random row from Pseudonyms
:
SELECT
U.ID
,U.UserName
,CA.PseudonymName
FROM
@VarUsers AS U
CROSS APPLY
(
SELECT TOP(1)
P.PseudonymName
FROM @VarPseudonyms AS P
ORDER BY CRYPT_GEN_RANDOM(4)
) AS CA
;
It turns out that optimizer is too smart and this produced some random, but the same PseudonymName
for each User
, which is not what I expected:
ID UserName PseudonymName
1 UserName PseudonymName181
2 UserName PseudonymName181
...
999 UserName PseudonymName181
1000 UserName PseudonymName181
So, I tweaked this approach a bit and generated a random number for each row in Users
first. Then I used the generated number to find the Pseudonym
with this ID
for each row in Users
using CROSS APPLY
.
CTE_Users
has an extra column with random number from 1 to 200. In CTE_Joined
we pick a row from Pseudonyms
for each User
. Finally we UPDATE
the original Users
table.
Final solution
WITH
CTE_Users
AS
(
SELECT
U.ID
,U.UserName
,1 + 200 * (CAST(CRYPT_GEN_RANDOM(4) as int) / 4294967295.0 + 0.5) AS rnd
FROM @VarUsers AS U
)
,CTE_Joined
AS
(
SELECT
CTE_Users.ID
,CTE_Users.UserName
,CA.PseudonymName
FROM
CTE_Users
CROSS APPLY
(
SELECT P.PseudonymName
FROM @VarPseudonyms AS P
WHERE P.ID = CAST(CTE_Users.rnd AS int)
) AS CA
)
UPDATE CTE_Joined
SET UserName = PseudonymName;
Results
SELECT * FROM @VarUsers;
ID UserName
1 PseudonymName41
2 PseudonymName132
3 PseudonymName177
...
998 PseudonymName60
999 PseudonymName141
1000 PseudonymName157
A simpler approach:
UPDATE u
SET u.FirstName = p.Name
FROM Users u
CROSS APPLY (
SELECT TOP(1) p.Name
FROM pseudonyms p
WHERE u.Id IS NOT NULL -- must be some unique identifier on Users
ORDER BY NEWID()
) p
Full example from: https://stackoverflow.com/a/36185100/6620329
Update a random Users id into UpdatedBy column of Table01
UPDATE a
SET a.UpdatedBy=b.id
FROM [dbo].[Table01] a
CROSS APPLY (
SELECT
id,
ROW_NUMBER() over(partition by 1 order by NEWID()) RN
FROM Users b
WHERE a.id != b.id
) b
WHERE RN = 1
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.