Home > Data Warehouse, Modeling > Intermezzo–Data Modeling

Intermezzo–Data Modeling

From the very nice WordPress Dashboard, I can see that I now have over 100 regular readers. Comments are flowing in too.

Thanks to everyone who is listening, it is my hope that this blog can be a great place for debates about data modeling and high scale performance tuning. I was surprised that is was hard to find concrete guidance about data modeling for the warehouse on the web. Perhaps I am missing something out there?

UPDATE (6th July 2012): This post is kept here for historical reasons. For the latest updates, please see my DW and Big Data page.

My branch predictor tells me that is I time to beg for your patience with me in advance, for three reasons:

1) I have a very combative personality – I love arguing with people. I also have a super low tolerance for bullshit/bollocks (depending on your dialect of English). My reason for behaving like this is to discover the truth, which am I not only convinced exists (I am very far from a relativist), I also believe it can only be found through strong testing of hypothesis, rational inquiry and expressing yourself with clarity. This means that I may disagree vehemently with your opinion, but please don’t take it personal – disrespect and disagreement are two different things. I can almost guarantee that I will stepping, battle axe swinging, into the religious battle of Inmon vs. Kimball during this argument.

2) English is not my native language. I make spelling mistakes and sometimes my typing goes out of sync with my thinking. I will read over, and make minor corrections, to my own blog a few times AFTER publishing (having had a cup of coffee in the intermediate time). I prefer to get something out there fast as a brain dump, to get the debate going. That being said, I understand the need for semi-permanent content: If I make a change in the argument or a major update, I will point out the changes with a bit of manual history tracking.

3) My Data Modeling argument will be rather long. There are a lot of assumptions that I need to fix in order to proceed with the argument – and I want to establish those first before moving on to the final conclusions. I want to take away most of the “it depends” that are so prevalent in this area. There is also a lot of common ground between Kimball and Inmon that I want to acknowledge.

To give you an idea of the trajectory of the argument, it goes something like this:

  • Integer keys are right choice 
  • Keys in a data warehouses must have certain properties
  • Source Systems cannot be relied on to provide good keys – ever!
  • Surrogate Keys are not good keys!
  • It is not worth modeling the source system
  • Smart reloads of the warehouse (build to scale and to restrict impact) is preferable to modeling for flexibility
  • Business users should nearly always be presented with a pure star schema (not a snow flake)
  • Normalization techniques should be severely restricted in the relational warehouse
  • Update statements are NOT (that) expensive if done right!
  • Tracking history must be done with care and targeted to the end user – modeling for flexibility here leads to disaster
  • Guidance on how to handle changing business requirements in history tracking
  • Dimensional Modeling (with some modifications to Kimball’s model) is the right model choice for the relational data warehouse
  • Surrogate keys should still be used for certain scenarios (contradicting part of my own argument above Smile )
  • With proper data modeling, you can build a very large data warehouses that is both flexible, high performance and cheap.
    You can think of each bullet above mapping approximately to a blog entry – so there I a bit of way to go. You may already now disagree with me, so please chime in as I move the argument forward.

 

Changes since publication:

  • Added bullet about my bad English
  • Added information about common ground between Inmon/Kimball
  • 2011-07-30 – Added links to new entries
Advertisements
  1. May 12, 2011 at 05:48

    Hi Thomas,

    So, now you have 101 readers 🙂

    Interesting note in here on keys. So, if source system keys are bad keys, and surrogate keys are bad keys… I’d love to hear the end of the argument on what does make a good key. Possibly a good blog would be unpacking what you feel are the “certain properties” that keys need to have?

    Data Modeeling is top of mind for me at the moment, as we’re busy modelling a corporate DW, trying to incorporate data from everywhere (the usual consolidation exercise), and we are aiming to use surrogate keys in the DW (even though we are not fully star schema’d)

    I’m also trying to avoid UPDATE statements wherever possible, so I’m keen to hear how we can speed up those expensive statements 🙂

    All in all, good thoughts, and looking forward to discussing them.

    • April 27, 2012 at 18:41

      Hi Gavin

      Glad to have you joining the fray here 🙂 Welcome.

      With regards to what constitututes a good key, I have already blogged about this here: http://blog.kejser.org/2011/05/13/good-keys-what-are-they-like/

      Basically, what I am advocating is to SOMETIMES split the generation of the entity key (the “Type 1 dimension” if you like) from the “history tracking key”. Incidentally, this is the one place I agree with the Data Vault technique. The argument for “splitting history” boils down to a tradeoff between doing BETWEEN style non-equi joins and a pure integer join.

      Two big questions back to you:

      1) Why not use a star schema?
      2) Why are you afraid of UPDATE statements?

  2. JC
    April 26, 2012 at 07:25

    Hi Thomas,

    The link to [Integer keys are right choice] is broken. It should point to http://blog.kejser.org/2011/04/26/why-integer-keys-are-the-right-choice/

    What is your recommendation for the generation of integer keys? Use auto-sequence id or like Oracle sourcing from a sequence table?

  3. April 27, 2012 at 18:35

    Hi JC (John C?)

    For generating keys in SQL Server, you basically have two choices (all of them good)

    1) Use the IDENTITY column type on the Entity tables (since this is where the entity is maintained)
    2) Use a SEQUENCER object (only available in SQL 2012)

    It is also possible to generate the keys outside the database, for example in the ETL flow. However, this technique raises some concurrency concerns. You will need a central point that the key is generated so that multiple ETL servers can synchronize their key generation. A database is a pretty good place for such a central point.

  1. September 26, 2011 at 19:09

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s