Analysis Services | Thomas Kejser's Database Blog

Modeling Dimensions with History Tracked, Generic Attributes

July 6, 2012 5 comments

Sometimes, you need to model a database in such a way that you can dynamically extend the model without altering any tables. Perhaps the attributes changes faster than you can add new columns or the data you store has a “ragged” structure that does not lend itself well to being described as a traditional table structure.

The typical approach taken in these cases is to “pivot” the required “flexible columns” of a table into rows instead and dynamically reconstruct the schema at runtime. When you model like this, adding new attributes to the model is simply a question of inserting rows into the database.

As I sure you are aware, there are issues with this approach. Tuning columns into rows can cause interesting issues for database engines. The flexibility requirement and performance loss to be balanced carefully. In this blog, I will walk you through an example of how to handle this generic case in a star schema.

Maestro web page online

November 1, 2011 3 comments

The SSAS Maestro training now has a home at the Microsoft domain. You can find it here: http://www.microsoft.com/learning/en/us/certification/ssas-maestros.aspx

The Analysis Services 2008R2 Performance Guide is Online

October 10, 2011 2 comments

I am happy to announce that companion volumes for performance tuners and operations people are now again, for the first time since Analysis Services 2000, a reality.

As a developer, you can learn about building, tuning and troubleshooting Analysis Services 2005, 2008 and 2008R2 cubes(yes, the guides cover all three editions) in the Performance Guide.

If you are the DBA or operations team, you can read about running such cubes in production in the Operations Guide.

And finally, if you are a consultant, expert developer or cube DBA, you can learn how to build the meanest and largest cubes from the 5 day Analysis Services Maestro Course.

These three artifacts, which I am proud to have contributed to (and for the guides, leading the effort on), are a big milestone in my Analysis Services career. They represent a large amount of knowledge transfer from Microsoft to the field. The publication of the companion volumes also marks my transition into some new and exiting projects at least for the near future. I will busy be digging into more “grade of the steel” work, among it ROLAP UDM testing, and I hope to blog about over at SQLCAT.com very soon.

Thanks to everyone for the incredible feedback during the writing of these guys. The Analysis Services community is very vibrant and these guides are the result of our collaboration in the field, and I am happy to give something back.

References:

New blog about Analysis Services

August 22, 2011 1 comment

This week, I welcome friend, coffee connoisseur, large cube runner and coder Pete Adshead to the blogosphere.

Pete is blogging at: http://peteadshead.wordpress.com/. His first blog is about managing SSAS through AMO – and the “interesting” coding pattern you have to adopt.

Have fun reading

Analysis Services Performance Guide in Final Draft

August 14, 2011 Leave a comment

To the Analysis Services crowd out there: Last week I finished the last section of the Analysis Services 2008R2 Performance Guide. Furthermore, I reviewed my co-author Denny Lee’s sections. Denny is currently polishing off a few minor details and we expect to send the draft to our editor early the coming week. I would like the take the opportunity to give you some background on what happens to a whitepaper before you can read it on MSDN.

That Analysis Services 2008R2 Operations Guide is online

June 1, 2011 10 comments

It is my pleasure to announce that the Operations Guide for SQL Server Analysis Services 2008R2 (and also 2005 and 2008) is now available on MSDN. Written by Denny Lee, John Sirmon (our new SSAS CAT member) and yours sincerely.

The guide describes how to configure, test and operate Analysis Services installations in a production environment. It is more than 100 pages of good information with contributions from a long list of MVP, SSAS specialists and the product group

Here it is: The Analysis Services 2008R2 Operations Guide

It was a pleasure working with you all to get this out there.

Analysis Services Operations Guide in Draft Review

May 4, 2011 Leave a comment

The SQL Server Analysis Services Operations Guide is currently in draft review in NDA form inside Microsoft and with some partners. This means that we will soon be able to publish this long awaited document. I will let you know on this blog when it is out.

Analysis Services Maestro Training–Round 2

May 3, 2011 1 comment

It is official, SQLCAT will be doing another round of the Maestro Training. This is Tech Level 500 content to prepare partners, Microsoft staff and key customers for building the largest and most scalable Analysis Services cubes.

The course is 5 days long (extended from 3 days in the old course). There is an exam that you have to pass, and very limited seating available – only a few will get in. I will let Vidas Matelis give you an an opinion from someone who was there in round 1.

You can nominate yourself here:

The Madrid course will be taught by yours sincerely.

IO Complexity of processing aggregation phase

December 18, 2006 1 comment

In my last post I missed the IO complexities of the aggregation phase. My memory of external sorting has become a bit rusty and I once again needed to bring Knuth from my bookshelf.

First of all – if you have not yet tried running perfmon while processing this would be a good time. It is worth noticing that the aggregations phase does not consume much IO. Even on a less than well tuned IO subsystem the aggregation phase is typically CPU bound. For those of you only interested optimizing the speed of the aggregation phase – take another look at your attribute relationships and properties (you can use some of previous posts as inspiration).

From my mail exchange with Eric Jacobsen I interpret that the external sort used by analysis services is a variant of balanced merge. The basic idea of the algorithm is

Read rows from input (in this case the n_d rows from the read phase) until memory full
Sort records in memory (for example using quicksort)
Write sorted records to a temp file on disk.
Are there still rows left in the input? If so – goto 1
Read all temp files and merge these to a new – sorted file

Steps 1-3 require n_d reads and n_d writes. Steps 4-5 require the same amount of reads and writes – furthermore, there is a small lg(q) CPU overhead for merging, with q being the number of temp files.

From the algorithm above we can also deduce that we can save a significant amount of IO if we have plenty of memory for the sort in phase 2. Not surpring – but it is worth noticing that the records size in analysis services tend to be very big. 64-bit memory spaces clearly have a big advantage in the aggregation phase.

Summarizing, we have:

IOREAD(Aggregation phase) = O(2 n_d)

IOWRITE(Aggregation phase) = O(2 n_d)

Time complexity of processing in Analysis Services

December 15, 2006 Leave a comment

When I build enterprise scale solution it is crucial for me to know what sort of scalability can be expected from a server product. Is it linear? It is exponential? Should I scale-out or scale-up? Are there hard limits to what the architecture can do?

To understand these limitations and provide valid statements about scalability – you have to know the time complexity of the operations performed by your implementation.

A major challenge for large scale Analysis Services installations is to reduce the processing time for a cube. If we are understand to what constitutes a cube processing we must first look at how processing is done. Processing proceeds in three phases

Read phase – Rows are read from the data source and into Analysis Services.
Aggregation phase – Based on the rows read – aggregations are build.
Index build phase – The aggregations are indexed to provide fast query access

I need to get a deeper understanding of the index phase before embarrassing myself completely (as opposed to partially) in my blog.

For now – lets look at the first two phases in more detail:

Read Phase

The purpose of the read phase is – as the name implies – to read the rows from the relational data source. The rows read form what I call the "base aggregations" – I.e. the leaf level (or key level if you like) data of the cube. It is these base aggregations that are used to answer queries at leaf level. The base aggregations also form the "base" of higher level aggregates in the next phase.

Each row from the relational source is read once. Assuming there are n rows in the source. We find that our number of IO reads must be proportional with:

IOREAD(Read Phase) = O(n)

Each row takes about the same amount of CPU time to read. Hence, our CPU usage is:

CPU(Read Phase) = O(n)

How about the writes, once could assume O(n) – but this would not be completely true:

As you may be aware – a relational fact table can contain "duplicate" data at the leaf level. Let me provide you with a few examples:

Example 1: I go into a supermarket today and buy a bottle of coke. Being a caffeine junkie – I later visit the same store – buying yet another bottle of coke. This will generate two rows in the Point Of Sales system – one of each bottle. Depending on your warehouse design – you may see both of these rows in your relational source – hence observing a "duplicate" row. The interesting data from the cube perspective (assuming no drillthrough) is one row with 2 bottles of coke – not two rows with 1 each

Example 2: You have a relational fact table with 5 dimension columns. However, for you analysis services cube you only want to use 4 of these dimensions. When you read the fact table you will almost certainly read duplicate leaf level data. Why? Compare the two statements:

(1) SELECT DISTINCT D1, D2, D3, D4 , D5 FROM <FactTable>

(2) SELECT DISTINCT D1, D2, D3, D4 FROM <FactTable>

The row count of (2) is most likely smaller than (1) – why else would you even bother with D5 in the first place?

Because the two examples happens a lot in the OLAP world, Analysis Services implements a "merge buffer" . While reading rows from the relational source- this buffer is used to merge such duplicate rows into single rows – thus reducing the storage space needed for the leaf level aggregates.

The merge buffer will not eliminate all duplicates. It has a limited size and will thus only on a subset of the fact table. In the ideal world – where the fact rows enter Analysis Services in sorted order and the merge buffer is large enough, we will eliminate all duplicates.

This brings us back to the number of IO operations needed to write base aggregation. Let n_d be the distinct count of the dimension columns we use in the cube. Assuming an optimal reading of the fact table we have:

IOWRITE(Read Phase) = O(n_d)

The reduction of the writes compared with the read can be quite significant for later phases of processing. Essentially we are seeing a kind of "pre aggregation" here. I like to think of n_d as the size of the "non empty cube space".

Also, notice that there is an upper limit to n_d that is unaffected by you fact table size. Assuming your cube dimension, D₁, … , D_n have the following leaf level sizes: |D₁|, … , |D_n|. The upper limit for n_d is:

n_d <= |D₁| *, … ,* |D_n|

For cubes with a small number of dimensions this upper limit can work to your advantage – as we shall see later.

Aggregation Phase

Once analysis services has stored the data read it must build higher level aggregates.

If you have ever tried managing you own aggregate tables in a relational database you will know that there are two ways to build an aggregate:

Directly from the leaf level (in a relational database this would be the fact table)
Derive the aggregate from another aggregate.

This is where things get a little complicated – but I will give it a shot. Let’s start with the first case – leaf level aggregation building:

Every attribute in you cube is a potential candidate for an aggregation. You control which ones you want to aggregate by a combination of:

Setting the attribute properties (see my previous post)
Running the storage design wizard. This will create some "statistically correct choices"
Using a custom aggregation tool to manually specify which aggregates you want

For the moment – lets keep attribute relationships out of the equation. Let’s just assume you have a cube with aggregates designed, one way or the other, for a subset, A_Agg, of your attributes.

During the aggregation phase analysis services has to read your leaf level rows (remember, there were n_d of these). Reading these rows takes:

IOREAD(Read Leaf level data) = O(n_d)

From my conversations with Eric Jacobsen I gather that the aggregation algorithm in AS 2005 is an external disk sort (as compared to the other common alternative – a hash/bucket sort). What does this tell us? Well, for those you that don’t already own it – get a copy of Donald E. Knuths masterpiece: "The Art of Computer Programming". From this work we can lookup the complexity of different sort algoritms and how to implement them. I fair assumption is that the external sort uses quicksort on its memory buffer the time complexiy: O(n lg(n)). Each attribute that is aggregated must be sorted – so we have:

CPU(Aggregation Phase) = O( A_Agg * n_d * lg(n_d))

Aggregation phase – summary so far

Time complexity: Not surprisingly – CPU usage in the aggregation phase is proportional to the number of attributes you are aggregating. Notice however, that it is not proportional to the amount of rows in your relational database. Its proportional to the DISTINCT count of dimension values (assuming your merging in the read phase is optimal).

Possible optimization (untested): This is pure speculation – but there may be some benefit of sorting your fact table using a clustered index to optimize the "pre aggregation" in the read phase.

Cubes with a small number of dimension: If you cube has a very small number of dimension with relatively few manbers you may be able to leverage the fact that n_d has an upper limit. This is pure theory – but it seems to be that some cubes could always be fully aggregated in reasonable time – irrespective of the number of source rows in the underlying relational table

It’s time to post this first part. Next part: Attribute relationships and how they are used to build higher level aggregates.

Older Entries

Thomas Kejser's Database Blog

Archive