Home > Data Warehouse, Modeling > Why Integer Keys are the Right Choice!

Why Integer Keys are the Right Choice!

The blog subject has come up a lot during my years of consulting. Source systems, traditionally composed of shoddy code and poor data modeling, will often use “natural keys”. Such keys can be of any data type, strings (CHAR/VARCHAR), DECIMAL, or other inefficient types. It is my claim that keys in a warehouse should always be integers and that it is not that difficult to implement proper keys. In this post, I will show you what the price is of using non-integer keys and let you be the judge of how much hardware you are willing to waste on making lazy key choices.

There are several reasons quoted for using non-integer keys in source systems:

One is the use of “intelligent keys”, keys containing some form of information about the entity they represent. I will give this choice the proper beating it deserves in a future post.

Another reason I hear is that character keys are “good enough”. A convenient choice has been made that makes the key easy to render. For example, they key may be stored as a zero-padded string of integers, instead of using a proper integer type. The fact that rendering choices are put into the database has its own problems of course – but if you read data off most ERP systems – chances are you are reading zero padded integers turned into strings.

Yet another reason quoted, is that the keys arrive to the warehouse from a master data management system. Why a master data system would ever spit out a non-integer key is a mystery to me – for reasons that I hope become obvious in my next post). But I have seen it happen, master data systems generating character keys.

Believe it or not, sometimes the reason quoted for using non-integer keys is that the key is generated as string concatenations of several columns to make them unique.

From a best practice perspective, Kimball recommends using integer surrogate keys for many of the same reasons I will use. But often, this guidance is ignored and source system keys are loaded directly into fact table or other warehouse models to “save time” on development.

The basic fact, which I will now proceed to prove to you, is that non-integer keys are a waste of CPU cycles! CPUs deal better with integers than string types. But just how bad is it? Let me illustrate with an example. Let us consider three competing design choices for a fact and dimension table:

  • Using a 4-byte integer key as per Kimball/old style data modeling traditions
  • Using a character key with some basic case insensitive intelligence built in. To make a fair compare, we shall use a CHAR(4) which takes up the same space as the integer key.
  • Using a character key as above, but where we are a bit more cunning about the collation choices. To improve join speed, we will use Binary collations (Which makes string compares faster)
    The following code generates 64K row dimension tables representing these three choices. The code will also generate three, 65M row fact tables that match the dimensions:
      USE tempdb
      GO
      SET NOCOUNT ON
      GO
      /* Create dimension tables with different key types */ 
      CREATE TABLE Dim_Int
      ( 
        SK INT NOT NULL PRIMARY KEY
        , A1 INT NOT NULL
      )
      CREATE TABLE Dim_Char
      ( 
        SK CHAR(4) COLLATE Latin1_General_CI_AS NOT NULL PRIMARY KEY
        , A1 INT NOT NULL
      )
      CREATE TABLE Dim_Char_BIN
      ( 
        SK CHAR(4) COLLATE Latin1_General_100_BIN2 NOT NULL PRIMARY KEY
        , A1 INT NOT NULL
      )
      
      /* Populate dimension tables */
      DECLARE @i INT
      SET @i = 0
      WHILE @i < 65536 BEGIN
        INSERT Dim_Int (SK, A1) VALUES (@i, @i % 100)
        SET @i = @i + 1
      END
      
      INSERT Dim_Char 
      SELECT RIGHT('0000'+dbo.fn_convert_to_base(SK, 16),4), A1
      FROM Dim_Int
      
      INSERT Dim_Char_BIN
      SELECT * FROM Dim_Char
      
      /* Create fact tables */
      CREATE TABLE Fact_Int
      (
        SK INT NOT NULL
       ,M1 SmallMoney NOT NULL
      )
        
      CREATE TABLE Fact_Char
      (
        SK CHAR(4) COLLATE Latin1_General_CI_AS NOT NULL
       ,M1 SmallMoney NOT NULL
      )
      CREATE TABLE Fact_Char_BIN
      (
        SK CHAR(4) COLLATE Latin1_General_100_BIN2  NOT NULL
       ,M1 SmallMoney NOT NULL
      )
      /* Validate that we are indeed using the same space for dimensions */
      EXEC sp_spaceused 'Dim_Int'
      EXEC sp_spaceused 'Dim_Char'
      EXEC sp_spaceused 'Dim_Char_BIN'
      
      
      
      INSERT Fact_Int WITH (TABLOCK)
      SELECT D.SK, BINARY_CHECKSUM(D.SK, D.A1) % 1000 AS M1
      FROM Dim_INT D
      CROSS JOIN dbo.fn_nums(100)
      
      /* Might run a bit of time */
      INSERT Fact_Char WITH (TABLOCK)
      SELECT RIGHT('0000'+dbo.fn_convert_to_base(D.SK, 16),4)
      , F.M1 
      FROM Fact_Int F JOIN Dim_Int D ON D.SK = F.SK
      
      INSERT Fact_Char_BIN WITH (TABLOCK)
      SELECT * FROM Fact_Char
      
      /* Validate that we are indeed using the same space for facts */
      EXEC sp_spaceused 'Fact_Int'
      EXEC sp_spaceused 'Fact_Char'
      EXEC sp_spaceused 'Fact_Char_BIN'

    Now, we can run our typical data warehouse query: a  join of a dimension with the fact and grouping on an attribute:

    SET NOCOUNT ON
    SET STATISTICS TIME ON
    SET STATISTICS IO ON
    GO
    
    
    /* Int Join */
    SELECT D.A1, SUM(F.M1)
    FROM Fact_Int F
    JOIN Dim_Int D ON D.SK = F.SK
    GROUP BY D.A1
    OPTION (MAXDOP 1)
    GO
    
    
    /* CHAR(4) join */
    SELECT D.A1, SUM(F.M1)
    FROM Fact_Char F
    JOIN Dim_Char D ON D.SK = F.SK
    GROUP BY D.A1
    OPTION (MAXDOP 1)
    GO
    
    
    /* CHAR(4) join with Binary collation*/
    SELECT D.A1, SUM(F.M1)
    FROM Fact_Char_BIN F
    JOIN Dim_Char_BIN D ON D.SK = F.SK
    GROUP BY D.A1
    OPTION (MAXDOP 1)

    Running on my laptop with a warm cache (taking I/O out of the equation with zero physical reads) gives me this result:

    Key Choice CPU Time Clock Time Logical Reads Fact / Dim
    4-byte Integer ~7 sec ~7 sec 13770 / 140
    Char(4)

    Case Sensitive, Accent Sensitive Collations

    ~12 sec ~12 sec 13700 / 140
    Char(4)

    Binary Collation

    ~8.5 sec ~8.5 sec 13770 / 140

    This should make it pretty clear that it is worth using an integer key to save yourself some money on hardware and to spare your users the frustration of waiting for joins to return. Also note that the above effect will become even more pronounced as you add more dimensions to the fact table.

    If you are stuck with character keys, it also highlights that there you will most likely benefit from using a binary collation (if you can live with the semantics implied by that choice).

    Related Material:

    Category: DW and Big Data

    Advertisements
    1. October 24, 2011 at 03:11

      Nice research.

    2. October 24, 2011 at 09:17

      I totally agree with you in terms of a surrogate key approach, but as we move into an era of SSD and the random access seek and throughput we get, that coupled with the need to scale out rather than have one big monolythic database GUID’s should be considered, yes, fragmentation, yes more cache used, yes possibly more latching, but those problems in a BI scenario (big queries with lots of IO) are negated.

      PS – great article.

      • Thomas Kejser
        October 24, 2011 at 10:54

        Hi Tony

        I agree that we will have to worry much less about sequential IOPS in the future.

        HOWEVER, I dont think this makes the GUID point a valid one. GUID suffer from several problems (frag being the smallest). They do not fit in CPU registers (which means they are expensive for CPU), they occupy a lot of space in the L2 caches and they generally make everything larger. While GUID may make more sense in OLTP, I don’t think they have a place in data warehouses.

        The INSERT scale test I performaned in a previous blog is not that relevant for warehouses, dimensions typically have a MUCH lower INSERT rate and the latching is less of a concern here. For fact tables, you are in bulk mode anyway, and you are not generating unique keys for those tables.

        With regards to loop joins: Random IOPS are not the only problem with loops. They consume a LOT more CPU than hash joins when you have the big/small join that is typical for warehouses. One of the reasons for this is that loop joins have to do “pointer walking” to crawl from the top of the B-tree to the leaf. If the next page needed is not in L2 cache, such a pointer walk is really expensive (it requires over 100 CPU cycles to do a memory fetch). Hash tables are much more efficient, because they allow you to jump directly to the data you need.

        So even though we will get super fast random IOPS in the future, I dont think we will see a new rise of loop join strategies 🙂

    1. October 23, 2011 at 23:31

    Leave a Reply

    Fill in your details below or click an icon to log in:

    WordPress.com Logo

    You are commenting using your WordPress.com account. Log Out / Change )

    Twitter picture

    You are commenting using your Twitter account. Log Out / Change )

    Facebook photo

    You are commenting using your Facebook account. Log Out / Change )

    Google+ photo

    You are commenting using your Google+ account. Log Out / Change )

    Connecting to %s