Modeling | Thomas Kejser's Database Blog

The Curse of Self-Service

August 18, 2013 17 comments

These days, we seem to be high on data and data related trends. My opinion on Big Data should be well known to my readers: it is something that has to be carefully managed and largely a fad for all but a select few companies.

With data being the new black, similar trends grab the attention of modern managers. One of these is Self Service. It seems like such a logical consequence of our advanced data visualisation: democratise the data.

It’s worth noting that the notion of humans making better decisions when well served with information is rather old. Thomas Jefferson said: “whenever the people are well-informed, they can be trusted with their own government”. But what exactly does it mean to be well-informed? Another great statesman, Churchill, said: “The best argument against democracy is a five-minute conversation with the average voter”.

In this blog entry, I will argue that is does not follow that humans will make better decisions if we just give them access to more data. In fact, allowing people to self-service their data can be outright harmful.

Myth Busting Query Optimisation

July 1, 2013 8 comments

It’s been some time since I posted here, but recent customer events prompted me to write up some notes I have been taking the past years.

The problem I would like to talk to you about is one that is well examined and well understood, yet continues to create frustrations with customers. Unfortunately, some of these frustrations come from “best practices” that people apply consistently, which in turn consistently gets them into trouble.

When you are done reading this post, I hope you will have a better understanding on how to avoid these pitfalls. It is also my hope that some of your expectations about query optimisation will have been adjusted.

What is the Best Sort Order for a Column Store?

July 27, 2012 6 comments

As hinted at in my post about how column stores work, the compression you achieve will depend on how many repetitions (or “runs”) exist in the data for the sort order that is used when the index is built. In this post, I will provide you more background on the effect of sort order on compression and give you some heuristics you can use to save space and increase speed of column stores. This is a theory that I am still only developing, but the early results are promising.

Big/Big Table Joins

July 19, 2012 7 comments

With the popularity of my last blog entry on Dangerous Joins, I felt inspired to write a bit more about the join strategies. Thanks for participating and reading, there seems to be a large appetite for Data Modeling out there – this blog now has over 7K unique visits every months.

Today, let us look at strategies for joining two big tables together.

How Vertical Partitioning and Deep Joins Kill Parallelism

July 16, 2012 17 comments

Query plans with deep joins trees are often the result of high levels of normalisation in the data model. There are large advantages to normalising data as it minimizes the amount of data that must be written when a change happens. In traditional OLTP systems, this can be a boon.

However, normalisation is not without its costs – especially not in read intensive workloads like data warehouses.

Modeling Dimensions with History Tracked, Generic Attributes

July 6, 2012 5 comments

Sometimes, you need to model a database in such a way that you can dynamically extend the model without altering any tables. Perhaps the attributes changes faster than you can add new columns or the data you store has a “ragged” structure that does not lend itself well to being described as a traditional table structure.

The typical approach taken in these cases is to “pivot” the required “flexible columns” of a table into rows instead and dynamically reconstruct the schema at runtime. When you model like this, adding new attributes to the model is simply a question of inserting rows into the database.

As I sure you are aware, there are issues with this approach. Tuning columns into rows can cause interesting issues for database engines. The flexibility requirement and performance loss to be balanced carefully. In this blog, I will walk you through an example of how to handle this generic case in a star schema.

How do Column Stores Work?

July 4, 2012 9 comments

In this blog, I will provide you with some basic information about column stores. Nothing I am writing here is vendor specific IP, but merely taken from the papers published throughout history. One of the best papers that serves as an introduction is by Stonebraker:

Stonebraker et a, Proceedings of 31st VLDB Conference, 2005: C-Store: “C-Store: A Column-oriented DBMS”

The idea is older than that though, with the first papers published in the 1970’ies.

Shamefully standing on the shoulders of giants, I will walk you through a simple example which illustrate one of the key principles of column stores: Run Length Encoding (RLE).

The Information Staircase

July 1, 2012 1 comment

With the Big Data wave rolling over us these days, it seems everyone is trying to wrap their heads around how these new components fit into the overall information architecture of the enterprise.

Not only that, there are also organisational challenges on how to staff the systems drinking the big data stream. We are hearing about new job roles such as "Data Scientist” being coined (the banks have had them for a long time, they call them Quants) and old names being brought back like “Data Steward”.

While thinking of these issues, I have tried to put together a visual representation of the different architecture layers and the roles interacting with them:

Why “Date BETWEEN FromDate AND ToDate” is a dangerous join criteria

June 29, 2012 10 comments

I have been meaning to write this blog post for some time and the discussion about Data Vault finally prompted me to do it.

Sometimes, you find yourself in situations where you have to join a table that has a structure like this:

CREATE TABLE TemporalTracking (
SomeKey INT
, FromDate DATETIME
, ToDate DATEIME
, <more columns>
)

The join criteria is expressed:

FROM <OtherTable> OT
INNER JOIN TemporalTracking T
ON OT.SomeTimeColumn BETWEEN T.FromDate AND T.ToDate
AND OT.SomeKey = T.SomeKey

Or more commonly, this variant with a semi open interval:

FROM <OtherTable> OT
INNER JOIN TemporalTracking T
ON OT.SomeTimeColumn >= T.FromDate
AND OT.SomeTimeColumn < T.ToDate
AND OT.SomeKey = T.SomeKey

Data models that promote these types of joins are very dangerous to relational optimizers and you have to step carefully when executing queries with many of these joins. Let us have a look at why this is so.

The Data Vault vs. Kimball – Round 2

June 26, 2012 Leave a comment

Here we go again, the discussion about the claimed benefits of the Data Vault. Thomas Christensen has written some great blog posts about his take on the Vault method. Dan Linstedt has been commenting.

The discussions are a good read to track:

Apologies in advance for my spelling errors in the comments, the posts are written while travelling.

As with most complex subjects, it is often hard to have people state with clarity what EXACTLY their claim is and what their supporting arguments are. It seems that the Vault is no exception – I hope the discussions lead somewhere this time.

For your reference, here are some of the posts I have previous done on how to approach the “integration” problem the Vault claims to solve:

Notice that there are some very interesting claims being made about normalization creating more load and query parallelism in the comments on Thomas Christensen’s Blog by Sanjay. I personally look forward to hearing the argument for that.

And here is the big picture architecture I am arguing for:

The Big Picture

Older Entries

Thomas Kejser's Database Blog

Archive

The Curse of Self-Service

Myth Busting Query Optimisation

What is the Best Sort Order for a Column Store?

Big/Big Table Joins

How Vertical Partitioning and Deep Joins Kill Parallelism

Modeling Dimensions with History Tracked, Generic Attributes

How do Column Stores Work?

The Information Staircase

Why “Date BETWEEN FromDate AND ToDate” is a dangerous join criteria

The Data Vault vs. Kimball – Round 2

Categories