Home > Data Warehouse, Modeling > An Overview of Source Key Pathologies

An Overview of Source Key Pathologies

I previously made the point that source systems cannot be trusted to generate good keys. In this post, I will explore the pollution you may encounter as you dip your feet into the crude oil leaked ocean of data quality.

Let us look at what can go wrong with keys:

Composite Keys

This is perhaps the most common pathology. It occurs when a programmer believes it is better to use a composite key than a single column key. This is rarely motivated by thoughts of sharding or of worries that the 64 bit integer space may run out. Normally, it is driven by nothing but laziness.

Composite keys have several problems:

  • They are difficult to use for junior SQL coders, especially in join conditions and IN / NOT IN clauses in filters. Wrong usage may lead to wrong results that go unnoticed.
  • They typically perform much worse than single column keys. We care about this in a data warehouse.
  • It is difficult to create good statistics on composite keys to serve queries well. This can lead to poor query plans and unpredictable performance

There ARE cases where composite keys make sense in a source system, but given the trouble they cause, the burden of proof rests on the designer of the composite key to show why it is superior to a single column key. As the ETL developer, you typically just have to accept the status quo.

Duplicate Keys

Mentioned for completeness. Even though good algorithms (both distributed and single machine) have existed for ensuring key uniqueness since the early 70’ies – there are still source system which have “keys” that are not real keys. If your key is not unique, it is not a key!

However, these “keys” are more common that you might think, especially when the source system delivers extracts that are result of joins (some programmers may throw a DISTINCT in there for good measure)

Zombie Keys

Screen shot 2011-08-14 at 11.02.48 PMThese keys die and then rise from the grave (Christians may prefer another term). This is perhaps best illustrated with an example:

Assume an inventory system that uses EAN as a key. When a product is in stock, the product table of the inventory system containss the EAN key. If a product goes out of stock, a cleanup job may run that removes all unused EAN keys. From the perspective of the ETL developers – it looks like the key just died and we are left with the question about what to do about it. Later, the product comes back in stock and the EAN number is inserted into the inventory products table again, the key has risen from the grave.

Unfortunately, zombie keys can be hard to tell apart from…

Dolly Keys

Named after the cloned sheep, because sheep are stupid and clones are hard to tell apart, even though they are two different entities. Dolly keys happen when a source deletes a key and later re-uses that key for a different purpose and to name a different entity.

Take Social Security numbers as an example. In Denmark, where I have a social security number, this key is in the format: YYMMDD-XXXX, with the first 6 digits being my birthday and the last four making the key unique (this by the way includes a checksum on the last digit). The quick reader may notice that this only leaves 10000 keys for people born on a given day (and not worrying about the shift of centuries). This is fast becoming a problem in Denmark, where we are now considering the use of letters in the keys, or re-using keys. This amounts to a lot of trouble for systems storing social security numbers. When it comes to keys, just use a value that is big enough to hold all instances – the 32 or 64 bit integer space should suffice for all by the most extreme values. 32-bit values were good enough for running the Internet, they are good enough for you!

Multi Keys

It happens that a source system has more than one key for the same entity. This can be the result of historical corrections and may represent a valid source scenario in the source.

For example, a source may have two rows for the same customer, because the duplicate entry was not discovered at the point of data entry. When the error is discovered, both keys are set to point to the same entity in the source. When we extract the data we are faced with correcting this error in the customer dimension.

Summary

In this post I have provided an overview of typical source system key pathologies. It now rests on my shoulders to describe how to handle these pathologies in the ETL code.

Related Posts:

Category: DW and Big Data

  1. August 12, 2011 at 09:56

    Hi Thomas

    Thanks for your blog post, “Source Key Pathologies” and “Transforming Source Keys to Real Keys”. It’s a good kicker, setting one’s mind set regarding the mapping of keys. Ever since my first Hub and Spoke implementation is key mapping been a reoccurring pest, which is often neglected on every project until it springs up and bite you.

    My personal key aversion must be “Zombie Keys”. Validity oscillating over time is causing issues in most of our data warehouse and master data projects.
    The grief is caused by a number of reasons.
    • When time validity is supplied explicit by the source and members is reactivated, will periods of inactivity reside within the current valid scope
    • Also when having deduct the time interval from the data received, while sources often aren’t behaving responsible sequence wise when supplying data.
    • Also when data is active in one system but inactive in another.

    I guess this leads up to, that Validity should must be lifted away from the T1 payload of an entity, and moved a structure able to hold the information as a T2 information as an intersection between one or more source systems and the entity.

    Another point to make is when addressing key mapping between systems, is realizing it’s not just about mapping keys. Looking at key mapping as a tangible benefit isn’t pursuing the full benefit of addressing this issue. For instance is it important to start addressing “Timeliness”, which is an important KPI when looking at data quality. It is not sufficient just to be able to MAP data, but also knowing how fast the information was consolidated throughout our systems.

    Guldmann

  2. Thomas Kejser
    August 14, 2011 at 19:45

    Hi Jørgen

    Good to see you on my blog. Knowing your extensive skills in master data management, I look forward to more discussion here.

    You make a great point about time validity of the keys – in praxis I have found that it is very hard to get that out of the source – people simply dont know. Using the min/max validity of keys across systems doesn’t fully solve the problem – but it is one of the reason you need a map table per source.

    I agree that validity needs to be lifted out of the “T1 payload” (good term by the way). One of the arguments I will make later is that the T1 and T2 attributes must be seperated into two tables. You could say this is “inspired” by hub and spoke – but I think that is getting history wrong. The notion of separating the “current” version of an entity from its history is an old OLTP trick that can be adopted to warehousing. I will look forward to discussing this more as the blog series moves forward.

    About the timeliness of the data, I think this would make for a great blog post. I think you might want to blog about this. Let me know if you do and I will link to it from here.

  3. August 16, 2011 at 09:51

    Thanks for replying to my previous comment, as always your work and blog is a key inspiratory to my professional life. I couldn’t resist sharing a few remarks on mapping.

    At a first glance mapping of keys might seems trivial, but for those who has implemented a repository to hold keys, its it painfully clear that it is a particularly difficult task.

    I recall my very first implementation of a hub and spoke solution, where one of the key features of it was to act as a key translator. Quite quickly we came to the realization that there are almost no natural keys. It is difficult to find an attribute or set of attributes that holds the minimum of characteristics (Mandatory, Unique, Stable, and Minimum) to be labeled as a valid natural key. The few we’ve encountered is State names, Animal names, Stage names (actors, films), Drug names, the periodic table And, of course, date. So where purely natural keys do not exist, artificial keys are often used. To be considered natural, they have to assume a business meaning (e.g., Customer Number: “What’s your customer number?”)

    The problem with these artificial keys is that their level isn’t clearly defined. What is a product in one system might equal a product group in another. The granularity of artificial keys and what they represent is particularly difficult to bring to a permanent and frozen understanding.

    Then there is the issue of Stable, how stable are your business keys? If only I had a dime for every time I’ve encountered a claim of “stable keys”, having to listen to back crawling explaining which conditions under which circumstances “stable” has to be understood.

    Regarding “timeliness” I do hope I’ll find time to blog on the KPI’s addressing data quality in the upcoming future. One remark is easily tied to these KPI’s “The easier they are to implement, the more worthless they are”. Timeliness is the most difficult KPI I’ve ever implemented on repository, yet also the most rewarding.

  1. July 30, 2011 at 11:48
  2. August 4, 2011 at 23:38
  3. August 15, 2011 at 16:02
  4. October 18, 2011 at 10:04
  5. October 23, 2011 at 23:26
  6. June 26, 2012 at 17:11
  7. September 2, 2012 at 16:39

Leave a comment