Pre data migration analysis : dangerous assumptions about lists
The success or failure of a data migration is invariably in the details. One of those details is how items from a drop down list (for example ‘country’) are mapped from the old to the new system. The consequences of a fault in the mapping can range from nothing at all, to migration runs crashing and a lot of people hours being spent trying to find out what went wrong before redoing the migration run.
Two common assumptions about the data to map are often at fault.
- Assumption 1: the items in the list item drop down list on a data entry form are all the possible values.
This is often invalid because:- options in lists on data entry forms are sometimes filtered according to the task being done
- list items are sometimes “retired” so they no longer appear (common in any system used for some years)
- there are issues with the underlying table design – see the next assumption
- Assumption 2: the table of list items behind the scenes contains all the possible values.
This can be invalid when there is no requirement in the table design (relationship) that the list item ID in the main table must match a value in the table of list items. This happens surprisingly often, especially when custom metadata is added to a commercial product such as a CMS. This can lead to:- A lot of null values – often because the list item was not in use from day one, and when it was introduced the effort of fixing existing data was “too hard” so they were left blank or null
- list items already in use accidentally deleted
How to determine what you actually have with SQL
Assume you have tables tListItem (with fields ListItemID and ListItemName), and tMaster (with a field ListItemID, and a bunch of others that I’m ignoring for this).
tListItem | tMaster | |||
ListItemID | ListItemName | id | ListItemID | |
1 | Australia | 1 | 1 | |
2 | New Zealand | 2 | 2 | |
3 | England | 3 | 22 | |
4 | Hong Kong | 4 | 1 | |
5 | 2 | |||
6 | 42 | |||
7 | 42 | |||
8 | NULL | |||
9 | NULL |
For all the values in use:
Select distinct ListItemID from dbo.tMaster
For all the values in use and their match in table tListItem (if existing – will display ‘null’ if not):
SELECT tInUse.ListItemID, ISNULL(dbo.tListItem.ListItemName, 'null') AS name FROM (SELECT DISTINCT ListItemID FROM dbo.tMaster) AS tInUse LEFT OUTER JOIN dbo.tListItem ON tInUse.ListItemID = dbo.tListItem.ListItemID
ListItemID | name |
NULL | null |
1 | Australia |
2 | New Zealand |
22 | null |
42 | null |
Much more useful is to include the number of records using each of the values. As well as giving you an idea on the practicality of manually versus automatically cleansing the data, you may also decide to retire some values if they are rarely used. The catch with this one is getting the count on null values working, hence converting to a string type (nvarchar)
For a count of each item
SELECT DISTINCT ListItemID, COUNT(ListItemID) AS NumRecords FROM dbo.tMaster GROUP BY ListItemID
but this excludes the null values, so
SELECT DISTINCT ISNULL(CONVERT(nvarchar(50), ListItemID), N'null') AS ListItem, COUNT(ISNULL(CONVERT(nvarchar(50), ListItemID), N'null')) AS NumRecords FROM dbo.tMaster GROUP BY ISNULL(CONVERT(nvarchar(50), ListItemID), N'null')
But for a nice listing …
SELECT tInUse.ListItem, ISNULL(tInUse.NumRecords, 0) AS NumRecords, ISNULL(dbo.tListItem.ListItemName, N'') AS ListItemName FROM (SELECT DISTINCT ISNULL(CONVERT(nvarchar(50), ListItemID), N'null') AS ListItem, COUNT(ISNULL(CONVERT(nvarchar(50), ListItemID), N'null')) AS NumRecords, ListItemID FROM dbo.tMaster GROUP BY ListItemID, ISNULL(CONVERT(nvarchar(50), ListItemID), N'null')) AS tInUse LEFT OUTER JOIN dbo.tListItem ON tInUse.ListItemID = dbo.tListItem.ListItemID
ListItem | NumRecords | ListItemName |
null | 2 | |
1 | 2 | Australia |
2 | 2 | New Zealand |
22 | 1 | |
42 | 2 |
I like to set these up in my Metadata Viewer so I can easily recheck after data cleansing.