Data Discovery and Classification with Data Preview

PROBLEM

Dirty Database
Make Marie Kondo explode at 3… 2… 1…

I’m currently working on a project that is a sort of spaghetti monster. On all sides.

The databases are 20-30 years old and have probably passed in the hands of multiple teams, they have been attached to multiple technologies; permissions were probably setup buy drunk cow-boys and as result every database developer passing by was free to create a table, drop a column and piss in the corner.

Do you know those TV realities were a cameraman visits people buried alive in a messy home. Think about these databases as something like that.

Because these databases are obviously full of sensible data I was assigned to the tedious task of apply Data Discovery and Classification to it.

For those who are not familiar with it, Data Discovery and Classification is a nice feature that was introduced on SSMS a few years ago. Microsoft for once made a quite nice and easy-to-use tool. You can fire it by doing right click on the database > Tasks > Data Discovery and Classification > Classify Data...

On AdventureWorks2019 this is the result:

Data-classification-wizard

That is impressive! How could have SSMS figured out what columns need to be classified?

By the column name.

If you fire up SQL Server Profiler and run the tool again you can catch the query that generated that classification and if you have a look at it you will see that the words it’s seeking are hardcoded:

Data Classification Match Columns

You can download a full copy of the query here: MatchingColumns

As you can see Microsoft was wise enough to add multiple languages but don’t think not even for a minute that this tool will make the work at your place. You have to work!

You have to keep in mind that this tool might have caught 50% of the columns but you still need to:

  1. double check if the classification is correct
  2. find the other 50%.

In my case I had even more problems:

  • The database columns are in multiple languages and they don’t really represent the data they contain
  • Each database has more than 900 tables and because there is no data preview I had to run a SELECT * nearly a thousand time

I’ll rather drink bleach while poking my eye balls with a rusty fork than do that.

SOLUTION

This query will help you:

  • List Schemas, Tables and Columns along with a preview of the data.
  • Search for a specific Schemas, Table or Column name, or for a specific word mentioned inside the data.

 

DECLARE @WordMatch AS NVARCHAR(50) 
SET @WordMatch = 'last'

DROP TABLE IF EXISTS #ColumnsToDisplay

SELECT    ROW_NUMBER () OVER (ORDER BY tab.name) AS Iteration,
          s.name AS SchemaName,
          tab.name AS table_name,
          col.column_id,
          col.name AS column_name,
          t.name AS data_type,
          col.max_length,
          col.precision AS PrecisionNumber,
          CAST(NULL AS VARCHAR(MAX)) AS DataSample
INTO      #ColumnsToDisplay
FROM      sys.tables AS tab
JOIN      sys.schemas AS s
    ON    s.schema_id = tab.schema_id
JOIN      sys.columns AS col
    ON    col.object_id = tab.object_id
LEFT JOIN sys.types AS t
    ON    col.user_type_id = t.user_type_id
     
DECLARE @Iterations       INT = 0,
        @CurrentIteration INT = 1;

SELECT @Iterations = MAX (Iteration)
FROM   #ColumnsToDisplay

WHILE @CurrentIteration <= @Iterations
BEGIN
    DECLARE @CurrentTableName  VARCHAR(100)   = '',
            @CurrentColumnName VARCHAR(100)   = '',
            @DynamicQuery      NVARCHAR(1000) = N''
    DECLARE @Sample VARCHAR(MAX)

    SET @CurrentTableName = '';
    SET @DynamicQuery = N'';
    SELECT @CurrentTableName = CONCAT (ttq.SchemaName, '.', ttq.table_name),
           @CurrentColumnName = ttq.column_name
    FROM   #ColumnsToDisplay AS ttq
    WHERE  ttq.Iteration = @CurrentIteration

    IF (@CurrentTableName = '')
    BEGIN
        SET @CurrentIteration += 1

        CONTINUE
    END

 -- SQL Server 2019
 -- SET @DynamicQuery = CONCAT (N'
 --       SELECT @Sample = STRING_AGG(t.ColumnData,'', '') 
 --       FROM (
 --               SELECT TOP 5  CAST(x.[', @CurrentColumnName, '] AS VARCHAR(MAX)) AS ColumnData 
 --               FROM ', @CurrentTableName, ' AS x 
 --               WHERE x.[', @CurrentColumnName, '] IS NOT NULL
 --       )t')

 -- SQL Server 2016 and lower where Stuff is supported   
    SET @DynamicQuery = CONCAT (N'
    SELECT @Sample =  STUFF((SELECT '', ''+ t.ColumnData  
    FROM (
            SELECT TOP 5 CAST(x.[', @CurrentColumnName, '] AS VARCHAR(MAX)) AS ColumnData 
            FROM ', @CurrentTableName, ' AS x 
            WHERE x.[', @CurrentColumnName, '] IS NOT NULL 
    ) AS t
       FOR XML PATH('''')),1,1,'''')')
        
    EXECUTE sys.sp_executesql @DynamicQuery,
                              N'@Sample VARCHAR(MAX) OUTPUT',
                              @Sample = @Sample OUTPUT

    UPDATE #ColumnsToDisplay
    SET    DataSample = @Sample
    WHERE  Iteration = @CurrentIteration

    SET @CurrentIteration += 1
END

SELECT ctd.Iteration,
       ctd.SchemaName,
       ctd.table_name,
       ctd.column_id,
       ctd.column_name,
       ctd.data_type,
       ctd.max_length,
       ctd.PrecisionNumber,
       ctd.DataSample
FROM   #ColumnsToDisplay AS ctd 
WHERE ctd.column_name like '%' + @WordMatch + '%'

Here an example of the query returning all Column name matching words like “last”, you can see at a glance what are the fields that you need to classify and what are false positive:

Data Classification Preview

CONCLUSIONS

Data Discovery and Classification is a nice tool but is half done.

Microsoft is probably not going to add a preview feature for security reason (with the wrong permissions reports like this can trig data disclosure).

And is not going to add a word search because…. well, that’s why vendors charge the big bucks for that.

So now take this query and go to make money.

I hope you find the data preview useful. My mind is more relaxed now and my soul is in peace.

Some people say that data are the new oil: easy to store and easy to make money with.

Some other reply that data are the new uranium: with plenty of energy, difficult to handle, highly dangerous and hard to get rid of.

The project I’m on is more like digging for asbestos: from the moment you find it it can only get worst.

 

Related

One Reply to “Data Discovery and Classification with Data Preview”

Leave a Reply

You have to agree to the comment policy.