Home > Data Warehouse, Modeling > Why Surrogate Keys are not Good Keys

Why Surrogate Keys are not Good Keys

October 22, 2011 Leave a comment Go to comments

History tracking in warehouses is a controversial discipline. I this post, I will begin to unravel some of the apparent complexities by taking apart the history tracking problem, piece by piece.

Observation: History tracking an EDW is not the same as archiving the source.

I already blogged about using HADOOP or "dumb databases" for archival or source system data. The purpose of these archives is to avoid the discussion with the business user about loosing data. We can always bring back the source if we do it right. I also voiced the opinion that you should not bother with modeling the source, and that it is trivial to automate this archival process in such a way that the source can be restored to any point in time.

Observation: history tracking is part of the business requirements for a warehouse.

We will only track history on data that the end user needs history about. We will not model history just for the sake of modeling it.

Observation: Fact table history is very different from dimension table history

I hope it is clear that tables containing fact data typically don’t act like dimension/reference tables. I will assert that OLTP designers (the guys we get our source data from) have known and agreed upon this simple observation for years: tables behave differently and fall into certain patterns. For example: Most OLTP system have tables that are "add only", which means we don’t need to track history on them (they are already historized). Another example: By following the well laid out arguments in Kimball’s books on how to model inventory systems, it is also clear that we need new modeling tricks, for example "snapshots", which are quite distinct from they way we model dimensions,

In other words: Just because a fact table has from/to date does not mean we use a type 2 dimension pattern to model it.

Dimensions, on the other hand, have a property that facts rarely have: The history we model on these tables depends a lot on what we are trying to achieve for the business user and their behavior is not dictated by the behavior of the source.

Modeling Facts

Kimball has written extensively about modeling fact data, and I mostly agree with his observations and techniques. I am not even sure this statement is disagreement, but: I would like to point out that all the talk about "choosing the grain" generally doesn’t make any sense to me: just pick the lowest possible grain you can get from the source.

Modeling Dimensions

After this initial tap-dance, let me turn my attention to the subject I really want to talk about: modeling dimensions and the problem with surrogate keys.

Let me start from a set of commonly perceived problems for Kimball style warehouses. Imagine you have a model like the below:

image

Now extend your mental image to include a situation where we change our mind about the history tracking in the customer dimension. We obtain knowledge (maybe through data cleansing) about the education of our customers. There are two ways to handle this:

1) In the simplest model, we may simple add this information "going forward". Our old, historical records remain unchanged, but we can analyze the future data by education. This is not the problem I am concerned about, and I think this modeling trick has limited use.

2) In the more complex model, we may wish to change history "backwards in time". This means that the data model may change like this:

image

It is the complex model that seems to worry the opponents of dimensional modeling for Enterprise Warehousing, especially the transformation indicated by the orange arrow above.

Here is the crux of the argument against dimensional modelers, as I have understood it from my many discussion with warehouse specialists:

Problem 1: changing your mind about history means that ALL fact tables that reference the Customer dimension must change to reference the new keys in Customer’. For some reason, this is perceived as a large dependency problem that is insurmountable at scale.

Answer 1: if you don’t have a meta model (or naming convention) that allow you to quickly (literally in minutes) analyse the impact on your database of making such a change, you are doing something very wrong with your dimensional model. This problem has been solved several times by builders of dimensional enterprise warehouses – I am sure you can solve it too.

Problem 2: changing the history tracking of the dimension requires large update statements to all facts that reference it.

Answer 2: this issue has some merit, but it depends on the implementation of the database you work on. It is true that in a multi TB system, such an update can be painful. But even in an SMP system, I would temper this statement with the comment: "but not as painful as you may think". This will be the subject of a future post.

We should also observe that the impact on fact tables means that the surrogate keys we use in Kimball models, are not good keys (as per my previous definition). The reason they are not good keys, is that Kimball surrogate keys track two things:

  1. The uniqueness of the dimension keys
  2. The history of the entity

When one of these two are volatile (the history) the entire model becomes volatile. This is not a desirable trait of a large warehouse.

Problem 3: not every business user wants the same version of history tracking, so a single type2 dimension table is likely not going to do it.

Answer 3: this is also true for some scenarios, and it hints at a model where there are multiple versions of each dimension and a new way to link facts and dimensions together. We observe that several warehouses are built where this never happens, so we should be careful not to over-generalize the model technique to always take an edge case like this into account.

Here we observe that this additional requirement does not mean that we need to abandon the notion of type 1 columns – we just have an additional requirement which we could call "multiple views of type 2 columns".

Summary

The three above problems (of which only the last two have any merit) are quoted, in different forms, by the persons I have heard argue that dimensional models are unsuitable for an "Enterprise" Data Warehouse.

The solution, and this baffles me, seems to be: use a 3NF data model instead of Kimball’s model. I think this 3NF approach is a fallacy and I have witnessed the horrors that arise from it. But as I have admitted, some of these arguments DO have merit and should be addressed in the model.

Related Posts:

Advertisements
  1. October 23, 2011 at 22:53

    Completely agree.
    The way we handle this in SQLBI Methodology is:
    1) Separate DWH (entity normalized, but not 3NF) from DM (pure Kimball with surrogate keys)
    2) Use “Mirror” db to create history of data that changes over time but you still have not modeled (Hadoop storage is definitely an option – but just for dimension our MirrorManager is a quick and cheap way to handle that with SQL Server)
    3) Regenerate DM (or create a new one) from DWH whenever you need a new view over a dimension (SCD2 or not)

    We effectively handle the problem you described in this way. Yes, we have a yet another copy of data (maybe two if you consider Mirror+DWH+DM). But this is the cost of flexibility and of not losing any data coming from your data sources.

  2. Thomas Kejser
    October 23, 2011 at 22:57

    Marco: I think the big question is what the DWH needs to look like. What sort of 3NF rules apply there and what guidance is used.

    I will also have a thing or two to say about the way type2 history is handled and how to scale it. Stay tuned 🙂

    • October 23, 2011 at 23:17

      Our common practice is:
      – two tables for each entity
      – first table has business key or “DWH key” with entity cardinality. One customer, one row, regardless the versions. Only attributes that don’t change over time (i.e. birthdate) are here
      – second table has a FK with first table and a PK that is composed by DWH Key + HistoryCount (counter for each version of dimension). Changing attributes are here. If you add changing attributes, you rebuild this table starting from Mirror. The “current HistoryCount” is maintained in the first table in order to obtain a SCD1 view in a very efficient way.
      – “events” table in DWH (i.e. Sales) does not contains HistoryCount but just DWH key for each entity

      We can build incremental ETL for both DWH and DM in this way, but both DWH and DM can be rebuilt at any time without requiring special ETL (the ETL we write automatically adapts to available data).
      Partitioning is mandatory and we usually rebuild partitions and don’t just add rows.

      Possible issues. very large dimensions (over 100 millions of items – performance for loading dimension might be critical), fast changing attributes (an attribute that change every day should be moved out of dimension – it becomes an “event” table in DWH).

      But there is nothing written in the stone, it’s nice to continue the conversation on this topic!

      • Thomas Kejser
        October 23, 2011 at 23:39

        Overall, I agree with the approach.

        I think there are some open questions though:

        – Normalize the dimensions or not in the DWH? – Reload the mart, or add a “history column” on the main fact table (i.e. dont use marts at all) – If you reload marts, how do you make sure you get the right query plan (more about this later)

        Agree that VERY fast changing attributes should be denormalized into the fact table for managability. A common Kimball trick that works very well.

      • October 24, 2011 at 09:36

        Hi Marco,

        Can you elaborate on this design pattern?
        Which keys are included in the fact tables? The DHW Key or the composed history key?

        Would be nice to read a case study on this design approach.

        \Rafi

      • October 24, 2011 at 10:03

        sirpadk :
        Hi Marco,
        Can you elaborate on this design pattern?
        Which keys are included in the fact tables? The DHW Key or the composed history key?
        Would be nice to read a case study on this design approach.
        \Rafi

        We use the DWH key in the fact table in the DWH. Finding the history key is demanded to the ETL that feeds the DM (only for those DMs that require it).
        We published a white paper (http://www.sqlbi.com/sqlbimethodology.aspx) that describes the methodology even if probably this pattern is not so well described in the documentation. We explained it better in a few presentation but we haven’t had time to update the paper – at least until 2nd half 2012 it will be hard to do that, we’re working on a BISM Tabular book now.

        Marco

  3. October 23, 2011 at 23:08

    Enjoying very much your views on large scale data warehousing. At my place of employment we suffer from many of the same issues you bring forth (modeling the source, attempting to keep history for history sake, etc.) and I like your ideas on a sound architecture to solve these issues. There is one thing I am missing. How do you address the need to interface several OLTP systems together and avoid point to point interfaces? We solve this by using a 3NF ODS, but I have read you advocate this is not needed. I see where the dimensional architecture effectively solves analytical reporting needs but struggling to see how I would solve operational interface needs. Any thoughts? Thanks again for continuing to post such great information.

    • Thomas Kejser
      October 23, 2011 at 23:16

      Hi Todd

      Thanks for the compliments and I am happy to hear you enjoy the blogs.

      Let me try to understand your question in more detail before I attempt an answer: When you speak about interfacing OLTP systems together, do you mean transferring data from one OLTP system to another?

      • October 24, 2011 at 01:20

        Sorry Thomas – realize I posted anew instead of replying. Putting my reply here: Yes exactly that. We have a good number of distinct OLTP systems that require data from other OLTP systems to function. Instead of developing point to point interfaces we have created a “hub” that is an Inmon style ODS. Lots of ETLs to move data from each OLTP into the ODS and then more ETLs to move data from the ODS back into a consuming OLTP. Some of the OLTPs also consume data directly from the ODS. We have designated a single OTLP to be the “system of record” for a given data fact, and the other OLTPs that consume that given data fact to be a “system of reference”. Thus changes to the given data fact occur only in the system of record. Our ODS is in SQL Server 2008. Thanks in advance for any thoughts you have and I’m happy to provide more details to you if you’d like.

      • Thomas Kejser
        October 24, 2011 at 04:32

        Hi Todd

        OK, I understand the scenario now. Thanks for elaborating.

        There are a few things that make OLTP integration different than warehousing, I would venture a guess that you have quite different characteristics than a typical DW, including, but not limited to:

        1) Master data cleaning in the ODS 2) Near real time integration that is transactionally consistent 3) less need for keeping history (once every “subscriber” has a copy, you can discard the data). It follows that data is MUCH smaller than a typical DW 4) data model targeted for fast, single row broadcasting of changes 5) if you store old data, it is “dimension” or “reference” data only (I would assume you don’t keep terabytes of transactional data around once everyone has the copy they need?) 5) goal is integration, not reporting, the system is accessed by machines, not humans.

        What you are describing is essentially an implementation of a message bus and a master data repository. I think of such a system as “yet another OLTP” system, with all the modeling tricks you would normally apply there. I would use an event based, asynchronous load system instead of a traditional, bulk based ETL for such a scenario. This is very similar to a SOA vision for integration and has traditionally been the space BizTalk tried to play in.

        From the perspective of the warehouse, I would treat this “ODS” system as either:

        1) non-existent. Use your “system of reference” as the source for ETL loads into staging/archive/EDW. The reason is that I would want to see data “as it originates” not in the intermediate, transitional form held by your “ODS”. Since you “system of reference” is designated as the master copy, I would want the untainted version of it, to avoid taking a dependency on any integration errors introduced later. 2) If the ODS stores history and is guaranteed to be “error free” I would just make a “shortcut” directly from the ODS to the main DW and simply treat it as part of my staging area and archive. This would of course assume that the ODS could guarantee that no data is ever lost (if not, I have to invoke my archive argument for taking over the responsibility of storing non-modeled source data).

        I hope this makes sense. Note that a major part of my choices here rests on the assumption that the data is relatively small in the ODS and only accessed a few rows at a time (this is where 3NF models shine).

      • October 24, 2011 at 19:22

        Thanks Thomas for your detailed answer. You have guessed much about our architecture. I will do some research along these lines of totally separating the “ODS” from the DW environment. I think we are making some mistakes in approach and appreciate your fresh perspective which is in a way I hadn’t seen before. I’ll keep following along :-)…

  4. October 24, 2011 at 00:56

    Yes exactly that. We have a good number of distinct OLTP systems that require data from other OLTP systems to function. Instead of developing point to point interfaces we have created a “hub” that is an Inmon style ODS. Lots of ETLs to move data from each OLTP into the ODS and then more ETLs to move data from the ODS back into a consuming OLTP. Some of the OLTPs also consume data directly from the ODS. We have designated a single OTLP to be the “system of record” for a given data fact, and the other OLTPs that consume that given data fact to be a “system of reference”. Thus changes to the given data fact occur only in the system of record. Our ODS is in SQL Server 2008. Thanks in advance for any thoughts you have and I’m happy to provide more details to you if you’d like.

  5. October 24, 2011 at 11:10

    Hi Marco,

    So let me get this straight, the history is kept for ETL purposes, when bulding the DM which depends on it? and to beable to reproduce the DM from scratch, in the exact same way it was before, but NOT to allow analysis directly on multiple (historical) version of a given entity. Is that the case?

    \Rafi

    • October 24, 2011 at 11:24

      Exactly – in SQLBI Methodology the DWH exists just to feed DMs, and users usually query the DMs (through SQL, OLAP cubes, BO, QlikView, PowerPivot, whatever).
      Thus, the data modeling of DWH doesn’t have to be easy and/or fast for queries generated by the end user.

      • October 24, 2011 at 17:06

        Hi Marco,

        OK, just finish to read your article, good stuff!

        Will I be wrong to conclude, that the Configuration DB that your are refering to in the article is to store the Map tables (which Thomas has been blogging about before) and staging keys data (used to support incremental updates) if needed?

      • October 24, 2011 at 17:20

        The general idea is that Configuration DB is the storage area for all data that are generated just for ETL/DWHDM/cube purpose and comes from the outside. Thus, Map tables are good here. But staging keys data are probably generated by the ETL process and should be stored in the DWH. Consider that it is common for us to copy some parameters we have in the Config database in correspondant tables in the DWH (depending on their meaning). For example, if you need to query the DWH for a report (without querying a DM – we strongly suggest to define a view in the DWH for that), then the DWH should contain all necessary information to answer the query and should not depend on data in another database (i.e. the Config database). The Config database should be used just by the ETL processes.
        Hope it helps!

        Marco

  6. October 24, 2011 at 17:33

    Hi Marco

    (Couldn’t reply to the right thread, so I’ll start a new one)

    I was told (not yet got the chance to try it myself) that when trying to do incremental load from a huge staging table of 100+ milion rows, it’s a good idea to work with staging keys. I think the idea was to keep a history of what is allready loaded to the system, so that the ETL will be able to only load the new added data. Shouldn’t this data be stored then in the config DB?

    I guess that waht I’m trying to figure out is how to maintain minimal loads on only new and changed data, with out the need to scan the entire 150+GB table into memory, when the server total memory is 128?

    • October 24, 2011 at 17:51

      The Config database should contain only data generated outside from the ETL and should be very small.
      If I understand well your scenario, you need to detect changes in a large fact table that doesn’t have timestamp or similar information to detect changed rows. I wouldn’t look for a standard way to do that. In order to optimize performance, I would try to keep on the source database some info that helps to extract just the required data, using a lookup on the DWH when you add new rows only to check against duplicates/errors, which shouldn’t be the regular case.
      If you have to move all the data in the DWH, then you should use the loaded data as a staging key source – an index would be sufficient to optimize performance, but as I said, I would analyze the scenario on a case by case basis – large data and optimization often requires tailored solutions.

      • Thomas Kejser
        October 24, 2011 at 23:25

        Marco,

        Why bother with a config database? Why not store those tables in a schema in the DWH/EDW database?

        Sent from my C64

    • Thomas Kejser
      October 24, 2011 at 23:24

      EDIT: Corrected sentense. Lesson Learned: Dont blog while jet lagged

      Scanning a 150GB table should NOT be a big deal in modern hardware. Even when it does not fit memory. It should be done in a matter of minutes (scanning 100GB/min from disk is easy peasy, even on a 2-socket machine)

      As a rule of thumb, I don’t bother with doing incremental load if I expect more than 20% of the destination table to change. But of course, if the source can reliably give me a delta indication I would not throw a gift away 🙂

      Sent from my C64

      • October 25, 2011 at 10:01

        I see that with you Comodor 64 comes a perfect world as well? 😉

        The 150GB table is from a way to normalized DB2 and SAAS databases, which are batch copied to a local SQL Server database. The 150GB table is to be join with multiple other, rather big, tabels.

        Today the process of rebuilding the relational Inmon style DHW takes 8 to 12 hours from scratch, and every week we delete everything and start from new. Thus being unable to track history changes, keep data on customers who left etc. We are nither able to support partial updates of only some of the source data, not too fancy if you ask me.

        The current design of the DWH needs to be changed to a model that will support self service BI and the usage of OLAP cubes and/or Tabular. All respect to Inmon, but in his model one must have some decent SQL query skills to bulid reports on the fly.

        Now when the first version of the Kimball model will be done with in a few days I belive, we will need to look at doing some ETL from the 180 tables to say 10-20 maks! That will probebly mean that for building the DimMember table we will need to touch many tables in one query,

      • Thomas Kejser
        October 25, 2011 at 18:23

        Commodore 64 – 32K RAM, what more do you need? 🙂

        How large is that warehouse you take 8-12 hours to rebuild? Hopefully a LOT larger than 150GB? Or perhaps you are running on a small server?

        One of my keys points is exactly that Inmon is a waste of time, just go directly to Kimball style. That is what the users want anyway, and there really aren’t many good reasons to store in 3NF (though i am sure Marco has some input on storing dimensions as 3NF)

      • October 25, 2011 at 18:25

        I store entities in DWH, not dimensions. And entities are normalized, denormalization is just for historical attributes. We don’t follow 3NF in DWH, even if I understand we don’t have a concise definition of the normalization level we use.

  7. October 26, 2011 at 08:12

    Indeed a long discussion, I thanks you both for sticking a round and taking the time to answer.

    The source database, which is mirrord using batched bulk load of files with BCP, size is 2.27 TB. The “Inmon Style” DWH consist of 2 databases, which together have the size of 700 GB. So it takes 8-12 hours to produce 700 GB of data out of 2.27 TB.

    The design of the DWH is not very smart, as column types are way too wide for what they need to include. Not NVARCHAR(max) on all text columns, but i.e. decimal(18) to store month number with the known range of 1-12 and so on, so I would recon that the true size of the end DWH is much less, maybe 200 GB. Now that tells me that not all data from the source is needed, and with the HW we have we should be able to do incremental loads in much less the effort, not to talk a bout the other gains we will get by not reproducing the DWH from scrach every week (maybe day?!).

    Regarding maps, Thomas, why use a seperate table for maps?
    -Is it performence? I can create a non clustered index on the Key and the AlternateKey and get an automated virtual table that will look and performe like a map tabel, without the overhead of maintaining two tabels.

    I know that alternate keys are usefull for the user in the cube (as member properties), so isn’t maps just another name for normalization?

    \Rafi

    • Thomas Kejser
      October 27, 2011 at 13:00

      Hi Rafi

      Discussions are what this blog is all about, so keep it coming 🙂

      700GB of data produced in that time seems like a more reasonable number. Depends on what you do with the data of course. You highlight a big problem with many Inmon style warehouses, namely that they model and store more than users need, and this directly translates into longer batch windows and more hardware cost. This is one of the big reasons that I advocate against modeling the source just for the sake of it.

      With regards to maps: no, it is not because of performance. You are correct to observe that they are a normalized extension of they entity key. However, the reason I don’t keep them in the dimension itself is:

      1) it would not allow me to handle Dolly keys (see my previous blog) 2) it would not allow me to detect zombie keys 3) it would not allow me to handle the situation where two keys point to the same entity 4) it allows me to have multiple source key sets, each mapping to the same entity key set.

      While I am certainly no proponent of databases with too many tables, I do question the actual cost of “maintaining multiple tables”. As long as you have a good structure of the database, having many tables is not necessarily an overhead. Maps are a necessary normalization in this case, since a single table would not be able to handle the above situation while at the same time having only one row per entity.

      Hope that clarifies…

  8. JC
    April 21, 2012 at 10:52

    Hi Thomas,
    What is your opinion of the DataVault modeling technique. It seems that it can address the questions in this thread. I would like to hear your thoughts. Love the blog, it’s a gem of a find.
    JC

    • Thomas Kejser
      April 23, 2012 at 10:53

      My opinion on Data Vault? 🙂

      Basically, I think it is a poorly founded academic exercise in search of a non existant problem. While it does indeed adress the “key stability” issue I describe, it has several drawbacks.

      First of all, the focus on “super normalization” and the arguments Dan L uses to support them is a recipy for extremely poor performance. The fact of the matter is that optimizers, no matter which engine you use, have time complexities of O(n!), with n being the number of tables in the join, on finding the “best plan”. Having highly normalized models makes the search space enormous. In addition, joins are VERY expensive, even in the best engines, and getting rid of them BEFORE you land the data in the data model is generally a good idea for data warehouses.

      Second: The optimizer complexity problem also has a human face. A much larger issue is the enormous work required to get data OUT of the Vault. It is indeed very easy data INTO the vault there, but that is the simple part, a data model needs to support OUTPUT of data efficiently as that is where the value is.

      Third: the Vault terminology is highly confusing, inveting new words for data modeling techniques that already have perfectly valid dictionaries to support them. I dont think the strict modeling technique (links, hubs, Satelites) shines any light on the problem of history tracking. Standardising the model does NOT solve a problem.

      I keep running into people who make this: “Data Vault addresses this problem” claim, but so far, the only thing I have seen Data Vault do is to RENAME the problem of history tracking. No good suggestions I have seen in the published material are made on how to actually SOLVE it and the hard tradeoffs that involve.

      The excessive use of time dependent tables (forcing BETWEEN joins) in the Vault is also highly problematic (I am actually working on a blog that describes the problem). There is very good reason that Kimball suggests using Type2 dimensions and forcing the history tracking to be part of the fact join on a single key.

      All this being said, history tracking IS a real problem and it needs to be dealt with properly. Treating it requires a lot MORE than standardising the modeling technies and a lot LESS than reinventing new terminologies for old terms. The actual, physical database tradeoffs (for example between UPDATE and INSERT statements) has to be made as the argument CANNOT be kept purely at a “logical” level, whatever this term “logical” model means in the case of warehouses. Of course, this something I have promised I will set out to do (and I am getting there with these blog entries). Stay tuned, next major blog entry will be about BETWEEN joins and why they hurt you.

      • JC
        April 24, 2012 at 23:07

        I appreciate your thoughtful and timely response (after realizing the age of the prior threads). Perfect.

        Please keep up the great blog. It’s priceless.

  9. Thomas Kejser
    April 25, 2012 at 22:38

    JC: All threads remain open for discussion and I am monitoring them. So please feel free to ask questions anywhere on my blog.

  10. ligi
    August 8, 2012 at 16:44

    Hi Thomas,

    sorry for the late and unrelated to the thread above comment, but thought I should share my understanding. The design challenge coming from adding an education value to your customer table is the many-to-many relationship that you are trying to model in one table, not from using surrogate keys. Trying to add independent multi-valued “facts” about the customer leads to the complexity in updating the table. If we were to minimize the number of fields relating to one surrogate key and snowflake a separate Education dimension, uniqueness of surrogate keys would not be violated and update policy would be easier. Thus you would be going to 4th normal form or higher, if needed. A good article about normalization: http://www.bkent.net/Doc/simple5.htm#label3.3. Not even Kimbal denies the use of normalization, as one can see from the latest Design Tip #148 Complementing 3NF EDWs with Dimensional Presentation Areas.

    Cheers!
    Li

    • Thomas Kejser
      August 8, 2012 at 17:03

      Hi Li

      Thanks for the comment. I believe you are referring to my blog about the handling of m-m attributes in the example. I agree that a significant complexity is added here. But the question is: What is the alternative? Would you rather pay the performance penalty at query time? And this is my main theme in this blog: Pay the price of de-nonormalisation up front, during ETL, don’t punish the users with a normalised model and expensive joins.

      It is indeed true that Kimball does not deny normalisation and I am familiar with the design tip. But not denying something and thinking it is a good idea are two very different things. I dont deny normalised warehouse are out there, I just think there are a really poor idea in most cases. And To quote the design tip, it looks like Kimball agrees with my observation:

      “The 3NF data structures are difficult to understand, the required queries tend to be very complex and difficult to develop, and the query response times can be abysmal.”

      Above is pretty far removed from endorsing the ide. And as the design tip goes on to show, materialisation is typically required to work around 3NF models and have them deliver data at a sufficient speed to users.

      Admittedly, there are cases where making things physical are hard and where staying in a normal form makes the problem simpler – as my example shows. But will you base your entire modeling technique on those cases?

  11. ligi
    August 10, 2012 at 22:16

    Hi Thomas,

    I could not agree more that you should move the overhead load of the joins to the ETL! If you maintain a 3/4/5NF layer, though, you eliminate the need of complicated ETL loads with unclear update policy, while maintaining a clear logical relationship among entities. Then you can do the expensive joins over the 3/4/5NF layer in your ETL to the dimensional structure and still maintain much more complex dimensions, which will appeal to the business. As my boss (a very wise applications architect) said today on another occasion – many unnecessary sacrifices have been made in the name of performance, and if I may add – coherence of your data should not be one of them.

    • Thomas Kejser
      August 12, 2012 at 13:23

      Hi Li

      You make a good point about the 3-5NF of the dimension structures. There are clearly cases where it is easier to maintain the source data of dimensions in a normalised form. As you can see in some of my previous posts, I actually say that this is a valid alternative. Master data repositories do exactly this, though with somewhat mixed success. When tables are small (i.e. dimensions), any read performance hit you take from normalisation is pretty much irrelevant. The same is true if what you are trying to do is to read only a few rows at a time (i.e. OLTP). All this being said, when it comes to fact tables and the other large structure in the warehouse, normalisation becomes much more challenging. At the scale of fact tables, co-location is suddenly very important.

      With regards to coherence of data: While normalisation can help you here, it is no guarantee of coherent data. In fact, a lot of the data you might consider perfectly “coherent” in a data warehouse may not make any sense in a normalised shape. As an example, let us consider a specific customer record that has poor data quality. Assume that we know which country the customer lives in, but we may not know which city in that country the customer lives in. If you normalised customer/cities/countries into three tables like this:

      Customer -> City -> Country

      You would then ask: Which city should we assign to the customer given that we know the country, but not the city, how should the foreign keys map? If you denormalise customer into a single dimension table, you would not have this problem, the solution is immediately obvious (assign NULL to city, write something in country column).

      Coherence in a warehouse is fundamentally about ETL and not about the data model. When you bring disparate, temporally different, data sources together in a single place, normalisation is not longer a guarantee that things will behave the right way.

  1. No trackbacks yet.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s