Archive

Posts Tagged ‘Keys’

Clustered Indexes vs. Heaps

January 12, 2014 21 comments

At Stack Overflow the other day, I once again found myself trying to debunk a lot of the “revealed wisdom” in the SQL Server community. You can find the post here: Indexing a PK GUID in SQL Server 2012 to read the discussion. However, this post is not about GUID or sequential keys, which I have written about elsewhere, it is about cluster indexes and the love affair that SQL Server DBAs seem to have with them.

Read more…

Why You Need to Stop Worrying about UPDATE Statements

April 27, 2012 4 comments

There seems to be a myth perpetuated out there in the database community that UPDATE statements are somehow “bad” and should be avoided in data warehouses.

Let us have a look at the facts for a moment and weigh up if this myth has any merit.

Read more…

Exploring Hash Functions in SQL Server

November 6, 2011 16 comments

Hash distributing rows is a wonderful trick that I often apply. It forms one of the foundations for most scale-out architectures. It is therefore natural to ask which hash functions are most efficient, so we may chose intelligently between them.

In this blog post, I will benchmark the build in function in SQL Server. I will focus on answering two questions:

  • How fast is the hash function?
  • How well does the hash function spread data over a 32-bit integer space

Read more…

Why Surrogate Keys are not Good Keys

October 22, 2011 33 comments

History tracking in warehouses is a controversial discipline. I this post, I will begin to unravel some of the apparent complexities by taking apart the history tracking problem, piece by piece.

Read more…

Physically Placing the Maps in the architecture

August 19, 2011 2 comments

Before we leave the maps behind, I need to live up to my promise of describing the storage characteristics of tables visited during the journey through the warehouse architecture. This must include the physical location of maps. Since believe form must follow function in a DW, let us just recall their function:

From a functional perspective, I have shown you how map tables can be used to both track and correct source system keys. Maps are not visible to the end user, but they are a necessary part of the data’s journey from the source to the final data model. Maps also provide the abstraction of, or interface to, master data sources. In the absence of those sources – the maps can even serve as a makeshift master data repository.

Read more…

Transforming Source Keys to Real Keys – Part 2: Using Maps To Fix Key Problems

August 15, 2011 11 comments

KeysIn part 1 of this post, I introduced the idea of map tables. These tables serve as an abstraction between the source systems and the entities in the data warehouse. In this post, I will describe how you can use the maps to correct for source key errors.

Using the taxonomy of key pathologies described earlier, I will walk you through some examples of map usage.

Read more…

Transforming Source Keys to Real Keys – Part 1: Introducing Map tables

August 4, 2011 17 comments

I have ranted enough about the key quality, or lack thereof, in source systems. It is time to look at how we will practically go about transforming the dirty source to the final DW model. What I shall propose is a de-coupling of the source entities from the data warehouse entities. As will be shown, this leads to an interesting ETL pattern that can either address, or properly encapsulate, a variety of source problems.

Read more…

An Overview of Source Key Pathologies

July 30, 2011 10 comments

I previously made the point that source systems cannot be trusted to generate good keys. In this post, I will explore the pollution you may encounter as you dip your feet into the crude oil leaked ocean of data quality.

Let us look at what can go wrong with keys:

Composite Keys

This is perhaps the most common pathology. It occurs when a programmer believes it is better to use a composite key than a single column key. This is rarely motivated by thoughts of sharding or of worries that the 64 bit integer space may run out. Normally, it is driven by nothing but laziness.

Composite keys have several problems:

  • They are difficult to use for junior SQL coders, especially in join conditions and IN / NOT IN clauses in filters. Wrong usage may lead to wrong results that go unnoticed.
  • They typically perform much worse than single column keys. We care about this in a data warehouse.
  • It is difficult to create good statistics on composite keys to serve queries well. This can lead to poor query plans and unpredictable performance

There ARE cases where composite keys make sense in a source system, but given the trouble they cause, the burden of proof rests on the designer of the composite key to show why it is superior to a single column key. As the ETL developer, you typically just have to accept the status quo.

Duplicate Keys

Mentioned for completeness. Even though good algorithms (both distributed and single machine) have existed for ensuring key uniqueness since the early 70’ies – there are still source system which have “keys” that are not real keys. If your key is not unique, it is not a key!

However, these “keys” are more common that you might think, especially when the source system delivers extracts that are result of joins (some programmers may throw a DISTINCT in there for good measure)

Zombie Keys

Screen shot 2011-08-14 at 11.02.48 PMThese keys die and then rise from the grave (Christians may prefer another term). This is perhaps best illustrated with an example:

Assume an inventory system that uses EAN as a key. When a product is in stock, the product table of the inventory system containss the EAN key. If a product goes out of stock, a cleanup job may run that removes all unused EAN keys. From the perspective of the ETL developers – it looks like the key just died and we are left with the question about what to do about it. Later, the product comes back in stock and the EAN number is inserted into the inventory products table again, the key has risen from the grave.

Unfortunately, zombie keys can be hard to tell apart from…

Dolly Keys

Named after the cloned sheep, because sheep are stupid and clones are hard to tell apart, even though they are two different entities. Dolly keys happen when a source deletes a key and later re-uses that key for a different purpose and to name a different entity.

Take Social Security numbers as an example. In Denmark, where I have a social security number, this key is in the format: YYMMDD-XXXX, with the first 6 digits being my birthday and the last four making the key unique (this by the way includes a checksum on the last digit). The quick reader may notice that this only leaves 10000 keys for people born on a given day (and not worrying about the shift of centuries). This is fast becoming a problem in Denmark, where we are now considering the use of letters in the keys, or re-using keys. This amounts to a lot of trouble for systems storing social security numbers. When it comes to keys, just use a value that is big enough to hold all instances – the 32 or 64 bit integer space should suffice for all by the most extreme values. 32-bit values were good enough for running the Internet, they are good enough for you!

Multi Keys

It happens that a source system has more than one key for the same entity. This can be the result of historical corrections and may represent a valid source scenario in the source.

For example, a source may have two rows for the same customer, because the duplicate entry was not discovered at the point of data entry. When the error is discovered, both keys are set to point to the same entity in the source. When we extract the data we are faced with correcting this error in the customer dimension.

Summary

In this post I have provided an overview of typical source system key pathologies. It now rests on my shoulders to describe how to handle these pathologies in the ETL code.

Related Posts:

Category: DW and Big Data

Good keys, what are they like?

May 13, 2011 28 comments

A central value add of data warehouses is their ability to restore the sanity that comes from using good keys. Taking a model-agnostic view of keys, they refer to “something” that is uniquely identifiable. Depending on what you are modeling, those “somethings” have different names, for example: entities, rows, tuples, cells, members, documents, attributes, object instances, and relations between the any of the latter. Because this article is about relational databases and because I don’t want to move up to the “logical” level with you (as I promised before), I will use the term: “row” as the “something” that a key refers to, and the term “column” as the data structure that implements the key.

Moving on, the definition of a “good key” is a column that has the following properties:

  1. It forced to be unique
  2. It is small
  3. It is an integer
  4. Once assigned to a row, it never changes
  5. Even if deleted, it will never be re-used to refer to a new row
  6. It is a single column
  7. It is stupid
  8. It is not intended as being remembered by users

Read more…

Why Integer Keys are the Right Choice!

April 26, 2011 4 comments

The blog subject has come up a lot during my years of consulting. Source systems, traditionally composed of shoddy code and poor data modeling, will often use “natural keys”. Such keys can be of any data type, strings (CHAR/VARCHAR), DECIMAL, or other inefficient types. It is my claim that keys in a warehouse should always be integers and that it is not that difficult to implement proper keys. In this post, I will show you what the price is of using non-integer keys and let you be the judge of how much hardware you are willing to waste on making lazy key choices.

Read more…