Archive

Archive for the ‘Data Warehouse’ Category

The Curse of Self-Service

August 18, 2013 17 comments

These days, we seem to be high on data and data related trends. My opinion on Big Data should be well known to my readers: it is something that has to be carefully managed and largely a fad for all but a select few companies.

With data being the new black, similar trends grab the attention of modern managers. One of these is Self Service. It seems like such a logical consequence of our advanced data visualisation: democratise the data.

It’s worth noting that the notion of humans making better decisions when well served with information is rather old. Thomas Jefferson said: “whenever the people are well-informed, they can be trusted with their own government”. But what exactly does it mean to be well-informed? Another great statesman, Churchill, said: “The best argument against democracy is a five-minute conversation with the average voter”.

In this blog entry, I will argue that is does not follow that humans will make better decisions if we just give them access to more data. In fact, allowing people to self-service their data can be outright harmful.

Read more…

What is the Best Sort Order for a Column Store?

July 27, 2012 6 comments

As hinted at in my post about how column stores work, the compression you achieve will depend on how many repetitions (or “runs”) exist in the data for the sort order that is used when the index is built. In this post, I will provide you more background on the effect of sort order on compression and give you some heuristics you can use to save space and increase speed of column stores. This is a theory that I am still only developing, but the early results are promising.

Read more…

Modeling Dimensions with History Tracked, Generic Attributes

July 6, 2012 5 comments

Sometimes, you need to model a database in such a way that you can dynamically extend the model without altering any tables. Perhaps the attributes changes faster than you can add new columns or the data you store has a “ragged” structure that does not lend itself well to being described as a traditional table structure.

The typical approach taken in these cases is to “pivot” the required “flexible columns” of a table into rows instead and dynamically reconstruct the schema at runtime. When you model like this, adding new attributes to the model is simply a question of inserting rows into the database.

As I sure you are aware, there are issues with this approach. Tuning columns into rows  can cause interesting issues for database engines. The flexibility requirement and performance loss to be balanced carefully. In this blog, I will walk you through an example of how to handle this generic case in a star schema.

Read more…

The Information Staircase

July 1, 2012 1 comment

With the Big Data wave rolling over us these days, it seems everyone is trying to wrap their heads around how these new components fit into the overall information architecture of the enterprise.

Not only that, there are also organisational challenges on how to staff the systems drinking the big data stream. We are hearing about new job roles such as "Data Scientist” being coined (the banks have had them for a long time, they call them Quants) and old names being brought back like “Data Steward”.

While thinking of these issues, I have tried to put together a visual representation of the different architecture layers and the roles interacting with them:

image

Read more…

Why “Date BETWEEN FromDate AND ToDate” is a dangerous join criteria

June 29, 2012 10 comments

I have been meaning to write this blog post for some time and the discussion about Data Vault finally prompted me to do it.

Sometimes, you find yourself in situations where you have to join a table that has a structure like this:

CREATE TABLE TemporalTracking (
  SomeKey INT
, FromDate DATETIME
, ToDate DATEIME
, <more columns>
)

The join criteria is expressed:

FROM <OtherTable> OT
INNER JOIN TemporalTracking T
  ON OT.SomeTimeColumn BETWEEN T.FromDate AND T.ToDate
  AND OT.SomeKey = T.SomeKey

 

Or more commonly, this variant with a semi open interval:

FROM <OtherTable> OT
INNER JOIN TemporalTracking T
  ON OT.SomeTimeColumn >= T.FromDate
  AND OT.SomeTimeColumn < T.ToDate
  AND OT.SomeKey = T.SomeKey

Data models that promote these types of joins are very dangerous to relational optimizers and you have to step carefully when executing queries with many of these joins. Let us have a look at why this is so.

Read more…

The Data Vault vs. Kimball – Round 2

June 26, 2012 Leave a comment

Here we go again, the discussion about the claimed benefits of the Data Vault. Thomas Christensen has written some great blog posts about his take on the Vault method. Dan Linstedt has been commenting.

The discussions are a good read to track:

Apologies in advance for my spelling errors in the comments, the posts are written while travelling.

As with most complex subjects, it is often hard to have people state with clarity what EXACTLY their claim is and what their supporting arguments are. It seems that the Vault is no exception – I hope the discussions lead somewhere this time.

For your reference, here are some of the posts I have previous done on how to approach the “integration” problem the Vault claims to solve:

Notice that there are some very interesting claims being made about normalization creating more load and query parallelism in the comments on Thomas Christensen’s Blog by Sanjay. I personally look forward to hearing the argument for that.

And here is the big picture architecture I am arguing for:

What Structure does your Data Have?

May 15, 2012 2 comments

I am currently thinking about measuring compression rates for different column store indexing strategies. In order for me to get some realistic data, I am looking for people who can share a very small anonymised data sample with me.

Read more…

Reading Material: Abstractions, Virtualisation and Cloud

May 1, 2012 9 comments

Two SocketsWhen speaking at conferences, I often get asked questions about virtualization and how fast databases will run on it (and even if they are “supported” on virtualised systems).  This is complex question to answer. Because it requires a very deep understanding of CPU caches, memory and I/O systems to fully describe the tradeoffs.

Read more…

Why You Need to Stop Worrying about UPDATE Statements

April 27, 2012 4 comments

There seems to be a myth perpetuated out there in the database community that UPDATE statements are somehow “bad” and should be avoided in data warehouses.

Let us have a look at the facts for a moment and weigh up if this myth has any merit.

Read more…

A DW Venn Diagram

January 28, 2012 3 comments

image