ETL | Thomas Kejser's Database Blog

The Information Staircase

July 1, 2012 1 comment

With the Big Data wave rolling over us these days, it seems everyone is trying to wrap their heads around how these new components fit into the overall information architecture of the enterprise.

Not only that, there are also organisational challenges on how to staff the systems drinking the big data stream. We are hearing about new job roles such as "Data Scientist” being coined (the banks have had them for a long time, they call them Quants) and old names being brought back like “Data Steward”.

While thinking of these issues, I have tried to put together a visual representation of the different architecture layers and the roles interacting with them:

Don’t Become a One-trick Architect

December 8, 2011 12 comments

We are near the dawn of a new workload: BigData. While some people say that “it is always darkest just before the dawn”. I beg to differ: I think it is darkest just before it goes pitch black. Have a cup of wakeup coffee, get your eyes adjusted to the new light, and to flying blind a bit, because the next couple of years are going to be really interesting.

In this post, I will be sharing my views on where we have been and a bit about where we are heading in the enterprise architecture space. I will say in advance that my opinions on BigData are just crystalizing, and it is most likely that I will be adjusting them and changing my mind.

Why Surrogate Keys are not Good Keys

October 22, 2011 33 comments

History tracking in warehouses is a controversial discipline. I this post, I will begin to unravel some of the apparent complexities by taking apart the history tracking problem, piece by piece.

Guest Posts Coming up

September 20, 2011 2 comments

Lately, I have been discussing data modeling with a lot of people and there is simply so much ground to cover. Lots of exciting developments – it seems I was on to something when I started a blog about this topic.

To make faster progress, I have teamed up with Marcel Franke and Cardory Van Rij. I am very exciting about this development, We will be doing some blogging together in the near future, both here on blog.kejser.org and on their individual blogs.

Based on feedback on my previous post, I will be writing about Type2 dimensions and history tracking for the next post in the modeling series. It will be a little delayed, since I am traveling a lot the next month.

Stay tuned…

The Big Picture – EDW/DW architecture

August 30, 2011 36 comments

Now that the cat is out of the bag on the Kimball forum, I figured it would be a good idea to present the full architecture that I will be arguing for. I was hoping to build up to it slowly, by establishing each premise on its own before moving on to the conclusion.

But perhaps it is better to start from the conclusion and then work my way down to the premises and show each one in turn.

Physically Placing the Maps in the architecture

August 19, 2011 2 comments

Before we leave the maps behind, I need to live up to my promise of describing the storage characteristics of tables visited during the journey through the warehouse architecture. This must include the physical location of maps. Since believe form must follow function in a DW, let us just recall their function:

From a functional perspective, I have shown you how map tables can be used to both track and correct source system keys. Maps are not visible to the end user, but they are a necessary part of the data’s journey from the source to the final data model. Maps also provide the abstraction of, or interface to, master data sources. In the absence of those sources – the maps can even serve as a makeshift master data repository.

Transforming Source Keys to Real Keys – Part 2: Using Maps To Fix Key Problems

August 15, 2011 11 comments

In part 1 of this post, I introduced the idea of map tables. These tables serve as an abstraction between the source systems and the entities in the data warehouse. In this post, I will describe how you can use the maps to correct for source key errors.

Using the taxonomy of key pathologies described earlier, I will walk you through some examples of map usage.

Transforming Source Keys to Real Keys – Part 1: Introducing Map tables

August 4, 2011 17 comments

I have ranted enough about the key quality, or lack thereof, in source systems. It is time to look at how we will practically go about transforming the dirty source to the final DW model. What I shall propose is a de-coupling of the source entities from the data warehouse entities. As will be shown, this leads to an interesting ETL pattern that can either address, or properly encapsulate, a variety of source problems.

An Overview of Source Key Pathologies

July 30, 2011 10 comments

I previously made the point that source systems cannot be trusted to generate good keys. In this post, I will explore the pollution you may encounter as you dip your feet into the crude oil leaked ocean of data quality.

Let us look at what can go wrong with keys:

Composite Keys

This is perhaps the most common pathology. It occurs when a programmer believes it is better to use a composite key than a single column key. This is rarely motivated by thoughts of sharding or of worries that the 64 bit integer space may run out. Normally, it is driven by nothing but laziness.

Composite keys have several problems:

They are difficult to use for junior SQL coders, especially in join conditions and IN / NOT IN clauses in filters. Wrong usage may lead to wrong results that go unnoticed.
They typically perform much worse than single column keys. We care about this in a data warehouse.
It is difficult to create good statistics on composite keys to serve queries well. This can lead to poor query plans and unpredictable performance

There ARE cases where composite keys make sense in a source system, but given the trouble they cause, the burden of proof rests on the designer of the composite key to show why it is superior to a single column key. As the ETL developer, you typically just have to accept the status quo.

Duplicate Keys

Mentioned for completeness. Even though good algorithms (both distributed and single machine) have existed for ensuring key uniqueness since the early 70’ies – there are still source system which have “keys” that are not real keys. If your key is not unique, it is not a key!

However, these “keys” are more common that you might think, especially when the source system delivers extracts that are result of joins (some programmers may throw a DISTINCT in there for good measure)

Zombie Keys

These keys die and then rise from the grave (Christians may prefer another term). This is perhaps best illustrated with an example:

Assume an inventory system that uses EAN as a key. When a product is in stock, the product table of the inventory system containss the EAN key. If a product goes out of stock, a cleanup job may run that removes all unused EAN keys. From the perspective of the ETL developers – it looks like the key just died and we are left with the question about what to do about it. Later, the product comes back in stock and the EAN number is inserted into the inventory products table again, the key has risen from the grave.

Unfortunately, zombie keys can be hard to tell apart from…

Dolly Keys

Named after the cloned sheep, because sheep are stupid and clones are hard to tell apart, even though they are two different entities. Dolly keys happen when a source deletes a key and later re-uses that key for a different purpose and to name a different entity.

Take Social Security numbers as an example. In Denmark, where I have a social security number, this key is in the format: YYMMDD-XXXX, with the first 6 digits being my birthday and the last four making the key unique (this by the way includes a checksum on the last digit). The quick reader may notice that this only leaves 10000 keys for people born on a given day (and not worrying about the shift of centuries). This is fast becoming a problem in Denmark, where we are now considering the use of letters in the keys, or re-using keys. This amounts to a lot of trouble for systems storing social security numbers. When it comes to keys, just use a value that is big enough to hold all instances – the 32 or 64 bit integer space should suffice for all by the most extreme values. 32-bit values were good enough for running the Internet, they are good enough for you!

Multi Keys

It happens that a source system has more than one key for the same entity. This can be the result of historical corrections and may represent a valid source scenario in the source.

For example, a source may have two rows for the same customer, because the duplicate entry was not discovered at the point of data entry. When the error is discovered, both keys are set to point to the same entity in the source. When we extract the data we are faced with correcting this error in the customer dimension.

Summary

In this post I have provided an overview of typical source system key pathologies. It now rests on my shoulders to describe how to handle these pathologies in the ETL code.

Related Posts:

Category: DW and Big Data

The SSIS Balanced Data Distributor is released

May 26, 2011 7 comments

Those of you who have seen my data loading presentation at SQLBits and PASS may remember that I recommend explicitly multi threading data loads to achieve higher load speeds.

This job has now become a lot easier from SSIS with the release of the Balanced Data Distributor (BDD). it is available here:

http://www.microsoft.com/downloads/en/details.aspx?FamilyID=ea0a1544-5ee4-4ad4-9d76-296d0632f162&displaylang=en

For max scale, you will still want to do some manual form of threading, but the BDD should help you achieve better scale for many common scenarios.

Older Entries

Thomas Kejser's Database Blog

Archive

The Information Staircase

Don’t Become a One-trick Architect

Why Surrogate Keys are not Good Keys

Guest Posts Coming up

The Big Picture – EDW/DW architecture

Physically Placing the Maps in the architecture

Transforming Source Keys to Real Keys – Part 2: Using Maps To Fix Key Problems

Transforming Source Keys to Real Keys – Part 1: Introducing Map tables

An Overview of Source Key Pathologies

Composite Keys

Duplicate Keys

Zombie Keys

Dolly Keys

Multi Keys

Summary

The SSIS Balanced Data Distributor is released

Categories