I’m currently working on a project that is a sort of spaghetti monster. On all sides.
The databases are 20-30 years old and have probably passed in the hands of multiple teams, they have been attached to multiple technologies; permissions were probably setup buy drunk cow-boys and as result every database developer passing by was free to create a table, drop a column and piss in the corner.
Do you know those TV realities were a cameraman visits people buried alive in a messy home. Think about these databases as something like that.
Because these databases are obviously full of sensible data I was assigned to the tedious task of apply Data Discovery and Classification to it.
For those who are not familiar with it, Data Discovery and Classification is a nice feature that was introduced on SSMS a few years ago. Microsoft for once made a quite nice and easy-to-use tool. You can fire it by doing
right click on the database > Tasks > Data Discovery and Classification > Classify Data...
On AdventureWorks2019 this is the result:
That is impressive! How could have SSMS figured out what columns need to be classified?
By the column name.
If you fire up SQL Server Profiler and run the tool again you can catch the query that generated that classification and if you have a look at it you will see that the words it’s seeking are hardcoded:
You can download a full copy of the query here: MatchingColumns
As you can see Microsoft was wise enough to add multiple languages but don’t think not even for a minute that this tool will make the work at your place. You have to work!
You have to keep in mind that this tool might have caught 50% of the columns but you still need to:
- double check if the classification is correct
- find the other 50%.
In my case I had even more problems:
- The database columns are in multiple languages and they don’t really represent the data they contain
- Each database has more than 900 tables and because there is no data preview I had to run a
SELECT *nearly a thousand time
I’ll rather drink bleach while poking my eye balls with a rusty fork than do that.
This query will help you:
- List Schemas, Tables and Columns along with a preview of the data.
- Search for a specific Schemas, Table or Column name, or for a specific word mentioned inside the data.
DECLARE @WordMatch AS NVARCHAR(50) SET @WordMatch = 'last' DROP TABLE IF EXISTS #ColumnsToDisplay SELECT ROW_NUMBER () OVER (ORDER BY tab.name) AS Iteration, s.name AS SchemaName, tab.name AS table_name, col.column_id, col.name AS column_name, t.name AS data_type, col.max_length, col.precision AS PrecisionNumber, CAST(NULL AS VARCHAR(MAX)) AS DataSample INTO #ColumnsToDisplay FROM sys.tables AS tab JOIN sys.schemas AS s ON s.schema_id = tab.schema_id JOIN sys.columns AS col ON col.object_id = tab.object_id LEFT JOIN sys.types AS t ON col.user_type_id = t.user_type_id DECLARE @Iterations INT = 0, @CurrentIteration INT = 1; SELECT @Iterations = MAX (Iteration) FROM #ColumnsToDisplay WHILE @CurrentIteration <= @Iterations BEGIN DECLARE @CurrentTableName VARCHAR(100) = '', @CurrentColumnName VARCHAR(100) = '', @DynamicQuery NVARCHAR(1000) = N'' DECLARE @Sample VARCHAR(MAX) SET @CurrentTableName = ''; SET @DynamicQuery = N''; SELECT @CurrentTableName = CONCAT (ttq.SchemaName, '.', ttq.table_name), @CurrentColumnName = ttq.column_name FROM #ColumnsToDisplay AS ttq WHERE ttq.Iteration = @CurrentIteration IF (@CurrentTableName = '') BEGIN SET @CurrentIteration += 1 CONTINUE END -- SQL Server 2019 -- SET @DynamicQuery = CONCAT (N' -- SELECT @Sample = STRING_AGG(t.ColumnData,'', '') -- FROM ( -- SELECT TOP 5 CAST(x.[', @CurrentColumnName, '] AS VARCHAR(MAX)) AS ColumnData -- FROM ', @CurrentTableName, ' AS x -- WHERE x.[', @CurrentColumnName, '] IS NOT NULL -- )t') -- SQL Server 2016 and lower where Stuff is supported SET @DynamicQuery = CONCAT (N' SELECT @Sample = STUFF((SELECT '', ''+ t.ColumnData FROM ( SELECT TOP 5 CAST(x.[', @CurrentColumnName, '] AS VARCHAR(MAX)) AS ColumnData FROM ', @CurrentTableName, ' AS x WHERE x.[', @CurrentColumnName, '] IS NOT NULL ) AS t FOR XML PATH('''')),1,1,'''')') EXECUTE sys.sp_executesql @DynamicQuery, N'@Sample VARCHAR(MAX) OUTPUT', @Sample = @Sample OUTPUT UPDATE #ColumnsToDisplay SET DataSample = @Sample WHERE Iteration = @CurrentIteration SET @CurrentIteration += 1 END SELECT ctd.Iteration, ctd.SchemaName, ctd.table_name, ctd.column_id, ctd.column_name, ctd.data_type, ctd.max_length, ctd.PrecisionNumber, ctd.DataSample FROM #ColumnsToDisplay AS ctd WHERE ctd.column_name like '%' + @WordMatch + '%'
Here an example of the query returning all Column name matching words like “last”, you can see at a glance what are the fields that you need to classify and what are false positive:
Data Discovery and Classification is a nice tool but is half done.
Microsoft is probably not going to add a preview feature for security reason (with the wrong permissions reports like this can trig data disclosure).
And is not going to add a word search because…. well, that’s why vendors charge the big bucks for that.
So now take this query and go to make money.
I hope you find the data preview useful. My mind is more relaxed now and my soul is in peace.
Some people say that data are the new oil: easy to store and easy to make money with.
Some other reply that data are the new uranium: with plenty of energy, difficult to handle, highly dangerous and hard to get rid of.
The project I’m on is more like digging for asbestos: from the moment you find it it can only get worst.