BigData | Thomas Kejser's Database Blog

The Curse of Self-Service

August 18, 2013 17 comments

These days, we seem to be high on data and data related trends. My opinion on Big Data should be well known to my readers: it is something that has to be carefully managed and largely a fad for all but a select few companies.

With data being the new black, similar trends grab the attention of modern managers. One of these is Self Service. It seems like such a logical consequence of our advanced data visualisation: democratise the data.

It’s worth noting that the notion of humans making better decisions when well served with information is rather old. Thomas Jefferson said: “whenever the people are well-informed, they can be trusted with their own government”. But what exactly does it mean to be well-informed? Another great statesman, Churchill, said: “The best argument against democracy is a five-minute conversation with the average voter”.

In this blog entry, I will argue that is does not follow that humans will make better decisions if we just give them access to more data. In fact, allowing people to self-service their data can be outright harmful.

What is the Best Sort Order for a Column Store?

July 27, 2012 6 comments

As hinted at in my post about how column stores work, the compression you achieve will depend on how many repetitions (or “runs”) exist in the data for the sort order that is used when the index is built. In this post, I will provide you more background on the effect of sort order on compression and give you some heuristics you can use to save space and increase speed of column stores. This is a theory that I am still only developing, but the early results are promising.

The Information Staircase

July 1, 2012 1 comment

With the Big Data wave rolling over us these days, it seems everyone is trying to wrap their heads around how these new components fit into the overall information architecture of the enterprise.

Not only that, there are also organisational challenges on how to staff the systems drinking the big data stream. We are hearing about new job roles such as "Data Scientist” being coined (the banks have had them for a long time, they call them Quants) and old names being brought back like “Data Steward”.

While thinking of these issues, I have tried to put together a visual representation of the different architecture layers and the roles interacting with them:

Why “Date BETWEEN FromDate AND ToDate” is a dangerous join criteria

June 29, 2012 10 comments

I have been meaning to write this blog post for some time and the discussion about Data Vault finally prompted me to do it.

Sometimes, you find yourself in situations where you have to join a table that has a structure like this:

CREATE TABLE TemporalTracking (
SomeKey INT
, FromDate DATETIME
, ToDate DATEIME
, <more columns>
)

The join criteria is expressed:

FROM <OtherTable> OT
INNER JOIN TemporalTracking T
ON OT.SomeTimeColumn BETWEEN T.FromDate AND T.ToDate
AND OT.SomeKey = T.SomeKey

Or more commonly, this variant with a semi open interval:

FROM <OtherTable> OT
INNER JOIN TemporalTracking T
ON OT.SomeTimeColumn >= T.FromDate
AND OT.SomeTimeColumn < T.ToDate
AND OT.SomeKey = T.SomeKey

Data models that promote these types of joins are very dangerous to relational optimizers and you have to step carefully when executing queries with many of these joins. Let us have a look at why this is so.

The Data Vault vs. Kimball – Round 2

June 26, 2012 Leave a comment

Here we go again, the discussion about the claimed benefits of the Data Vault. Thomas Christensen has written some great blog posts about his take on the Vault method. Dan Linstedt has been commenting.

The discussions are a good read to track:

Apologies in advance for my spelling errors in the comments, the posts are written while travelling.

As with most complex subjects, it is often hard to have people state with clarity what EXACTLY their claim is and what their supporting arguments are. It seems that the Vault is no exception – I hope the discussions lead somewhere this time.

For your reference, here are some of the posts I have previous done on how to approach the “integration” problem the Vault claims to solve:

Notice that there are some very interesting claims being made about normalization creating more load and query parallelism in the comments on Thomas Christensen’s Blog by Sanjay. I personally look forward to hearing the argument for that.

And here is the big picture architecture I am arguing for:

The Big Picture

Reading Material: Abstractions, Virtualisation and Cloud

May 1, 2012 9 comments

When speaking at conferences, I often get asked questions about virtualization and how fast databases will run on it (and even if they are “supported” on virtualised systems). This is complex question to answer. Because it requires a very deep understanding of CPU caches, memory and I/O systems to fully describe the tradeoffs.

Don’t Become a One-trick Architect

December 8, 2011 12 comments

We are near the dawn of a new workload: BigData. While some people say that “it is always darkest just before the dawn”. I beg to differ: I think it is darkest just before it goes pitch black. Have a cup of wakeup coffee, get your eyes adjusted to the new light, and to flying blind a bit, because the next couple of years are going to be really interesting.

In this post, I will be sharing my views on where we have been and a bit about where we are heading in the enterprise architecture space. I will say in advance that my opinions on BigData are just crystalizing, and it is most likely that I will be adjusting them and changing my mind.

The Big Picture – EDW/DW architecture

August 30, 2011 36 comments

Now that the cat is out of the bag on the Kimball forum, I figured it would be a good idea to present the full architecture that I will be arguing for. I was hoping to build up to it slowly, by establishing each premise on its own before moving on to the conclusion.

But perhaps it is better to start from the conclusion and then work my way down to the premises and show each one in turn.

Microsoft Announces Plans to Introduce Hadoop Interoperability

August 12, 2011 2 comments

For those of you who have not yet seen it, Microsoft recently announced that they will be looking at Hadoop connectivity to the database stack:

Parallel Data Warehouse News and Hadoop Interoperability Plans

Some of you may have wondered why I have not yet mentioned the BigData movement as part of my DW articles. In my defense I will say that this a big trend, something I had to give a lot of thought to position correctly. Before I can talk more about how it fits into the full DW/BI architecture – I have to argue a bit more for my warehousing approach.

I can reveal that I see traditional data warehousing (especially dimensional modeling) and BigData compliment each other in a way that solves some of the common complaints of warehouse builders across the world. I hope you will find my thoughts on BigData fit nicely into the picture I will be painting of the warehousing world going forward in this blog.

Thomas Kejser's Database Blog

Archive

The Curse of Self-Service

What is the Best Sort Order for a Column Store?

The Information Staircase

Why “Date BETWEEN FromDate AND ToDate” is a dangerous join criteria

The Data Vault vs. Kimball – Round 2

Don’t Become a One-trick Architect

The Big Picture – EDW/DW architecture

Microsoft Announces Plans to Introduce Hadoop Interoperability

Categories