Skip to content

Pre data migration analysis : dangerous assumptions about lists

The success or failure of a data migration is invariably in the details. One of those details is how items from a drop down list (for example ‘country’) are mapped from the old to the new system. The consequences of a fault in the mapping can range from nothing at all, to migration runs crashing and a lot of people hours being spent trying to find out what went wrong before redoing the migration run.

Two common assumptions about the data to map are often at fault.

  • Assumption 1: the items in the list item drop down list on a data entry form are all the possible values.
    This is often invalid because:

    • options in lists on data entry forms are sometimes filtered according to the task being done
    • list items are sometimes “retired” so they no longer appear (common in any system used for some years)
    • there are issues with the underlying table design – see the next assumption
  • Assumption 2: the table of list items behind the scenes contains all the possible values.
    This can be invalid when there is no requirement in the table design (relationship) that the list item ID in the main table must match a value in the table of list items. This happens surprisingly often, especially when custom metadata is added to a commercial product such as a CMS. This can lead to:

    • A lot of null values – often because the list item was not in use from day one, and when it was introduced the effort of fixing existing data was “too hard” so they were left blank or null
    • list items already in use accidentally deleted

How to determine what you actually have with SQL

Assume you have tables tListItem (with fields ListItemID and ListItemName), and tMaster (with a field ListItemID, and a bunch of others that I’m ignoring for this).

tListItem        tMaster
ListItemID ListItemName   id ListItemID
1 Australia 1 1
2 New Zealand 2 2
3 England 3 22
4 Hong Kong 4 1
5 2
6 42
7 42
8 NULL
9 NULL

For all the values in use:

Select distinct ListItemID from dbo.tMaster

For all the values in use and their match in table tListItem (if existing – will display ‘null’ if not):

SELECT tInUse.ListItemID, ISNULL(dbo.tListItem.ListItemName, 'null') AS name
 FROM (SELECT DISTINCT ListItemID
 FROM dbo.tMaster) AS tInUse LEFT OUTER JOIN
 dbo.tListItem ON tInUse.ListItemID = dbo.tListItem.ListItemID
ListItemID name
NULL null
1 Australia
2 New Zealand
22 null
42 null

Much more useful is to include the number of records using each of the values. As well as giving you an idea on the practicality of manually versus automatically cleansing the data, you may also decide to retire some values if they are rarely used. The catch with this one is getting the count on null values working, hence converting to a string type (nvarchar)

For a count of each item

SELECT DISTINCT ListItemID, COUNT(ListItemID) AS NumRecords

FROM dbo.tMaster GROUP BY ListItemID

but this excludes the null values, so

SELECT DISTINCT

ISNULL(CONVERT(nvarchar(50), ListItemID), N'null') AS ListItem,

COUNT(ISNULL(CONVERT(nvarchar(50), ListItemID), N'null')) AS NumRecords
 FROM dbo.tMaster
 GROUP BY ISNULL(CONVERT(nvarchar(50), ListItemID), N'null')

But for a nice listing …

SELECT tInUse.ListItem,

ISNULL(tInUse.NumRecords, 0) AS NumRecords,

ISNULL(dbo.tListItem.ListItemName, N'') AS ListItemName
 FROM

(SELECT DISTINCT
 ISNULL(CONVERT(nvarchar(50), ListItemID), N'null')

AS ListItem,

COUNT(ISNULL(CONVERT(nvarchar(50), ListItemID), N'null'))

AS NumRecords,
 ListItemID
 FROM dbo.tMaster
 GROUP BY ListItemID,

ISNULL(CONVERT(nvarchar(50), ListItemID), N'null'))

AS tInUse
 LEFT OUTER JOIN
 dbo.tListItem ON tInUse.ListItemID

= dbo.tListItem.ListItemID
ListItem NumRecords ListItemName
null 2
1 2 Australia
2 2 New Zealand
22 1
42 2

I like to set these up in my Metadata Viewer so I can easily recheck after data cleansing.

Leave a Comment





This site uses Akismet to reduce spam. Learn how your comment data is processed.