Obfuscating data through Data Shuffle in Bulk

PROBLEM

When Dynamic Data Masking is not an option you can go for Data Shuffle.

Data Shuffle is when you keep the data of your columns but you mix them together like in a card deck.

This will allow you to have as result real data that cannot be really backtracked to the original ones.

In the example below you can see how the order of the columns name, phone and country where mixed up. Who can tell what’s Martin’s phone and country?

SQl Server Data Shuffle

This obfuscation is perfect for User Acceptance Testing (UAT) environments because testers can still play with close-to-reality data.

It would be a blessing to have a query that could do that in bulk right?

SOLUTION

If you have already setup Data Classification on SQL Server 2017 or older you can now apply data shuffle in bulk thanks to this query:

SELECT 
    schema_name(O.schema_id) AS schema_name,
    O.NAME AS table_name,
    C.NAME AS column_name,
    information_type,
	label,
	rank,
	rank_desc
	,';WITH TCE_Shuffled
AS (
	SELECT ROW_NUMBER() OVER (
			ORDER BY (
					SELECT TOP 1 COLUMN_NAME
					FROM INFORMATION_SCHEMA.COLUMNS
					WHERE 
					TABLE_SCHEMA = '''+ schema_name(O.schema_id) +''' and TABLE_NAME = '''+O.NAME+'''
	)
			) AS ORIG_ROWNUM
		,ROW_NUMBER() OVER (
			ORDER BY NEWID()
			) AS NEW_ROWNUM
		,*
	FROM ' + schema_name(O.schema_id) + '.' +  O.NAME + '
	)
UPDATE t1
SET t1.' + C.NAME + ' = t2.' + C.NAME + '
FROM TCE_Shuffled t1
JOIN TCE_Shuffled t2 ON t1.ORIG_ROWNUM = t2.NEW_ROWNUM;'
FROM sys.sensitivity_classifications sc
    JOIN sys.objects O
    ON  sc.major_id = O.object_id
	JOIN sys.columns C 
    ON  sc.major_id = C.object_id  AND sc.minor_id = C.column_id

And if you are on SQL Server 2016 or below you can execute this:

SELECT schema_name(O.schema_id) AS schema_name
	,O.NAME AS table_name
	,C.NAME AS column_name
	,information_type
	,sensitivity_label
	,';WITH TCE_Shuffled
AS (
	SELECT ROW_NUMBER() OVER (
			ORDER BY (
					SELECT TOP 1 COLUMN_NAME
					FROM INFORMATION_SCHEMA.COLUMNS
					WHERE 
					TABLE_SCHEMA = '''+ schema_name(O.schema_id) +''' and TABLE_NAME = '''+O.NAME+'''
	)
			) AS ORIG_ROWNUM
		,ROW_NUMBER() OVER (
			ORDER BY NEWID()
			) AS NEW_ROWNUM
		,*
	FROM ' + schema_name(O.schema_id) + '.' +  O.NAME + '
	)
UPDATE t1
SET t1.' + C.NAME + ' = t2.' + C.NAME + '
FROM TCE_Shuffled t1
JOIN TCE_Shuffled t2 ON t1.ORIG_ROWNUM = t2.NEW_ROWNUM;'
FROM (
	SELECT IT.major_id
		,IT.minor_id
		,IT.information_type
		,L.sensitivity_label
	FROM (
		SELECT major_id
			,minor_id
			,value AS information_type
		FROM sys.extended_properties
		WHERE NAME = 'sys_information_type_name'
		) IT
	FULL OUTER JOIN (
		SELECT major_id
			,minor_id
			,value AS sensitivity_label
		FROM sys.extended_properties
		WHERE NAME = 'sys_sensitivity_label_name'
		) L ON IT.major_id = L.major_id
		AND IT.minor_id = L.minor_id
	) EP
JOIN sys.objects O ON EP.major_id = O.object_id
JOIN sys.columns C ON EP.major_id = C.object_id
	AND EP.minor_id = C.column_id
order by 1,2,3

The last column scripts out the code for you for each column you have classified.

Keep in mind that you might receive error like Cannot update identity column and in that case you need to :

  1. set identity_insert YourTable ON
  2. Paste your query
  3. set identity_insert YourTable OFF

I preferred not to add this to the bulk scripts so you can choose what to do in those cases.

CONCLUSION

Is good if you need to have realistic data but keep in mind that if you apply data shuffle to columns like phone or e-mail those fields are still visible and they represent a data leak.

 

 

Related

Leave a Reply

You have to agree to the comment policy.