PROBLEM
When Dynamic Data Masking is not an option you can go for Data Shuffle.
Data Shuffle is when you keep the data of your columns but you mix them together like in a card deck.
This will allow you to have as result real data that cannot be really backtracked to the original ones.
In the example below you can see how the order of the columns name, phone and country where mixed up. Who can tell what’s Martin’s phone and country?
This obfuscation is perfect for User Acceptance Testing (UAT) environments because testers can still play with close-to-reality data.
It would be a blessing to have a query that could do that in bulk right?
SOLUTION
If you have already setup Data Classification on SQL Server 2017 or older you can now apply data shuffle in bulk thanks to this query:
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 |
SELECT schema_name(O.schema_id) AS schema_name, O.NAME AS table_name, C.NAME AS column_name, information_type, label, rank, rank_desc ,';WITH TCE_Shuffled AS ( SELECT ROW_NUMBER() OVER ( ORDER BY ( SELECT TOP 1 COLUMN_NAME FROM INFORMATION_SCHEMA.COLUMNS WHERE TABLE_SCHEMA = '''+ schema_name(O.schema_id) +''' and TABLE_NAME = '''+O.NAME+''' ) ) AS ORIG_ROWNUM ,ROW_NUMBER() OVER ( ORDER BY NEWID() ) AS NEW_ROWNUM ,* FROM ' + schema_name(O.schema_id) + '.' + O.NAME + ' ) UPDATE t1 SET t1.' + C.NAME + ' = t2.' + C.NAME + ' FROM TCE_Shuffled t1 JOIN TCE_Shuffled t2 ON t1.ORIG_ROWNUM = t2.NEW_ROWNUM;' FROM sys.sensitivity_classifications sc JOIN sys.objects O ON sc.major_id = O.object_id JOIN sys.columns C ON sc.major_id = C.object_id AND sc.minor_id = C.column_id |
And if you are on SQL Server 2016 or below you can execute this:
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 |
SELECT schema_name(O.schema_id) AS schema_name ,O.NAME AS table_name ,C.NAME AS column_name ,information_type ,sensitivity_label ,';WITH TCE_Shuffled AS ( SELECT ROW_NUMBER() OVER ( ORDER BY ( SELECT TOP 1 COLUMN_NAME FROM INFORMATION_SCHEMA.COLUMNS WHERE TABLE_SCHEMA = '''+ schema_name(O.schema_id) +''' and TABLE_NAME = '''+O.NAME+''' ) ) AS ORIG_ROWNUM ,ROW_NUMBER() OVER ( ORDER BY NEWID() ) AS NEW_ROWNUM ,* FROM ' + schema_name(O.schema_id) + '.' + O.NAME + ' ) UPDATE t1 SET t1.' + C.NAME + ' = t2.' + C.NAME + ' FROM TCE_Shuffled t1 JOIN TCE_Shuffled t2 ON t1.ORIG_ROWNUM = t2.NEW_ROWNUM;' FROM ( SELECT IT.major_id ,IT.minor_id ,IT.information_type ,L.sensitivity_label FROM ( SELECT major_id ,minor_id ,value AS information_type FROM sys.extended_properties WHERE NAME = 'sys_information_type_name' ) IT FULL OUTER JOIN ( SELECT major_id ,minor_id ,value AS sensitivity_label FROM sys.extended_properties WHERE NAME = 'sys_sensitivity_label_name' ) L ON IT.major_id = L.major_id AND IT.minor_id = L.minor_id ) EP JOIN sys.objects O ON EP.major_id = O.object_id JOIN sys.columns C ON EP.major_id = C.object_id AND EP.minor_id = C.column_id order by 1,2,3 |
The last column scripts out the code for you for each column you have classified.
Keep in mind that you might receive error like Cannot update identity column and in that case you need to :
set identity_insert YourTable ON- Paste your query
set identity_insert YourTable OFF
I preferred not to add this to the bulk scripts so you can choose what to do in those cases.
CONCLUSION
Is good if you need to have realistic data but keep in mind that if you apply data shuffle to columns like phone or e-mail those fields are still visible and they represent a data leak.

