Performance | Thomas Kejser's Database Blog

Clustered Indexes vs. Heaps

January 12, 2014 21 comments

At Stack Overflow the other day, I once again found myself trying to debunk a lot of the “revealed wisdom” in the SQL Server community. You can find the post here: Indexing a PK GUID in SQL Server 2012 to read the discussion. However, this post is not about GUID or sequential keys, which I have written about elsewhere, it is about cluster indexes and the love affair that SQL Server DBAs seem to have with them.

Synchronisation in .NET– Part 4: Partitioned Data Structures

January 5, 2014 5 comments

In this final instalment of the synchronisation series, we will look at fully scalable solutions to the problem first stated in Part 1 – adding monitoring that is scalable and minimally intrusive.

Thus far, we have seen how there is an upper limit on how fast you can access cache lines shared between multiple cores. We have tried different synchronisation primitives to get the best possible scale.

Throughput this series, Henk van der Valk has generously lent me his 4 socket machine and been my trusted lab manager and reviewer. Without his help, this blog series would not have been possible.

And now, as is tradition, we are going to show you how to make this thing scale.

Synchronisation in .NET– Part 3: Spin Locks and Interlocks/Atomics

January 4, 2014 2 comments

In the previous instalments (Part 1 and Part 2) of this series, we have drawn some conclusions about both .NET itself and CPU architectures. Here is what we know so far:

When there is contention on a single cache line, the lock() method scales very poorly and you get negative scale the moment you leave a single CPU core.
The scale takes a further dip once you leave a single CPU socket
Even when we remove the lock() and do thread unsafe operations, scalability is still poor
Going from a class to a padded struct gives a scale boost, though not enough to get linear scale
The maximum theoretical scale we can get with the current technique is around 90K operations/ms.

In this blog entry, I will explore other synchronisation primitives to make the implementation safe again, namely the spinlock and Interlocks. As a reminder, we are still running the test on a 4 socket machine with 8 cores on each socket with hyper threading enabled (for a total of 16 logical cores on each socket).

Synchronisation in .NET– Part 2: Unsafe Data Structures and Padding

December 27, 2013 9 comments

In the previous blog post we saw how the lock() statement in .NET scales very poorly when there is a contention on a data structure. It was clear that a performance logging framework that relies on an array with a lock on each member to store data will not scale.

Today, we will try to quantify just how much performance we should expect to get from the data structure if we somehow solve locking. We will also see how the underlying hardware primitives bubble up through the .NET framework and break the pretty object oriented abstraction you might be used to.

Because we have already proven that ConcurrentDictionary adds to much overhead, we will focus on arrays as the backing store for the data structure in all future implementations.

Synchronisation in .NET– Part 1: lock(), Dictionaries and Arrays

December 25, 2013 10 comments

As part of our tuning efforts at Livedrive, I ran into a deceptively simple problem that beautifully illustrates some of the scale principles I have been teaching to the SQL Server community for years.

In this series of blog entries, I will use what appears to be a simple .NET class to explore how modern CPU architectures handle high speed synchronisation. In the first part of the series, I set the stage and explore the .NET lock() method of coordinating data.

Myth Busting Query Optimisation

July 1, 2013 8 comments

It’s been some time since I posted here, but recent customer events prompted me to write up some notes I have been taking the past years.

The problem I would like to talk to you about is one that is well examined and well understood, yet continues to create frustrations with customers. Unfortunately, some of these frustrations come from “best practices” that people apply consistently, which in turn consistently gets them into trouble.

When you are done reading this post, I hope you will have a better understanding on how to avoid these pitfalls. It is also my hope that some of your expectations about query optimisation will have been adjusted.

Bottleneck Diagnosis on SQL Server – New Scripts

April 11, 2013 21 comments

Finally, I have found some time with my good colleagues at Fusion-io to work on some of my old SQL Scripts.

Our first script queries the servers for wait stats – a very common task for nearly all SQL Server DBAs. This is the first thing I do when I meet a new SQL Server installation and I have generally found that I apply the same filters over and over again. Because of this, I worked with Sumeet Bansal to standardise our approach to this.

You can watch Sumeet introduce the script in this YouTube video: http://youtu.be/zb4FsXYvibY. A TV star is born!

We have already used this script at several customers in a before/after Fusion-io situation. As you may know, the tuning game changes a lot when you remove the I/O bottleneck from the server.

Based on our experiences so far, I wanted to share some more exotic waits and latches we have been seeing lately.

Quantifying the Cost of Compression

March 11, 2013 17 comments

Last week, at SQL Saturday Exeter, I did a new Grade of Steel experiment: to quantify just how expensive it is to do PAGE compression.

The presentation is a 45 minute show, but for those of you who were not there, here is a summary of the highlights of the first part.

My tests were all run on the TPC-H LINEITEM table at scale factor 10. That is about 6.5GB of data.

Test: Table Scan of Compressed Data

My initial test is this statement:

SELECT MAX(L_SHIPDATE)
, MAX(L_DISCOUNT)
, MAX(L_EXTENDEDPRICE)
, MAX(L_SUPPKEY)
, MAX(L_QUANTITY)
, MAX(L_RETURNFLAG)
, MAX(L_PARTKEY)
, MAX(L_LINESTATUS)
, MAX(L_TAX)
, MAX(L_COMMITDATE)
, MAX(L_RECEIPTDATA)
FROM LINEITEM

Because the statement only returns one row, the result measured does not drown in client transfer time. Instead, the raw cost of extracting the columns from the compressed format can be quantified.

The result was quite shocking on my home 12 core box:

Even when doing I/O, it still takes quite a bit longer to scan the compressed format. And when scanning from DRAM, the cost is a whopping 2x.

A quick xperf run shows where the time goes when scanning from memory

Indeed, the additional CPU cost explains the effect. The code path is simply longer with compression.

Test: Singleton Row fetch

By sampling some rows from LINEITEM, it is possible to measure the cost of fetching pages in an index seek. This is the test:

SELECT MAX(L_SHIPDATE)
, MAX(L_DISCOUNT)
, MAX(L_EXTENDEDPRICE)
, MAX(L_SUPPKEY)
, MAX(L_QUANTITY)
, MAX(L_RETURNFLAG)
, MAX(L_PARTKEY)
, MAX(L_LINESTATUS)
, MAX(L_TAX)
, MAX(L_COMMITDATE)
, MAX(L_RECEIPTDATA)
FROM LI_SAMPLES S
INNER LOOP JOIN LINEITEM L ON S.L_ORDERKEY = L.L_ORDERKEY
OPTION (MAXDOP 1)

This gives us the plan:

Which has the desired characteristics of having the runtime dominated by the seek into the LINEITEM table.

The numbers again speak for themselves:

And again, the xperf trace shows that this runtime difference can be fully explained from longer code paths.

Test: Singleton UPDATE

Using the now familiar pattern, we can run a test that updates the rows instead of selecting them. By updating a column that is NOT NULL and an INT, we can make sure the update happens in place. This means we pay the price to decompress, change and recompress that row – which should be more expensive than reading. And indeed it is:

Summary

Quoting a few of my tests from my presentation, I have shown you that PAGE compression carries a very heavy CPU cost for each page access. Of course, not every workload is dominated by accessing data in pages – some are more compute heavy on the returned data. However, in these days when I/O bottlenecks can easily be removed, it is worth considering if the extra CPU cycles to save space are worth it.

It turns out that it is possible to also show another expected results: that locks are held longer when updating compressed pages (Thereby limiting scalability if the workload contains heavily contended pages). But that is the subject of a new blog entry.

Configuring Kernel Debugging with WinDbg and a NULL modem

December 5, 2012 6 comments

Lately, I have been digging deep into Windows to get really low level with the the I/O path SQL Server takes (yep, there is an even deeper layer to understand fully).

Once you start playing around with the Windows Kernel, you will at some point need kernel level debugging set up. Traditionally, this is something I have used a Windows machine for, and even there, it can be painful to get working. As you may know, I have switched to Mac as my client machine and my Windows utilities (including WinDbg) now run in VMWare Fusion – it looked like I was heading into an interop nightmare…

Windows 8 allows kernel debugging directly over the network card, but how does one configure kernel debugging with a Windows 2008R2 target from a Mac with VM Ware?

I found a very cheap solution today. You need:

A USB to Serial (RS-232) converter
A NULL modem (I would recommend getting a long one, so you don’t have to sit next to the server)
A target server that you want to debug
A serial port on the target server
WinDbg from the Windows SDK in a virtual machine on the Mac
Symbols set up as per my previous post

A USB converter and a NULL modem is a dirt cheap way to get the required hardware for kernel debugging. I sourced my cables from Maplins (Thanks @SQLServerMonkey) in the UK – for a total of around 20 GBP.

Step 1: Prepare the Client/Debugger

A Macbook air, like most other lightweight laptops, does not have a serial port. So, we have to use a USB/Serial converter. It is possible to debug directly over USB, but good luck with that from a Mac, I didn’t have the courage.

Make sure VMWare routes the USB device to the Windows Guest OS and not the Mac. It will look something like this in VMWare Fusion 5:

Now, install WinDbg and set up your symbol paths if they are not set up already.

Check your Device Manager in the client to see which COM port the USD device created. As you can see below, my laptop mounted the USB/Serial converter as COM3

After installing the device, I had to restart my virtual machine before VMWare would let me mount it – but what can you expect from a 10 GBP component? Your mileage may vary depending on the serial/USB driver you have.

Step 2: Connect Client and Server

Using the NULL modem, connect the USB/Serial converter to the server’s Serial port.

Make sure the server has the serial port enabled in the BIOS (my Dell box had it disabled, had to re-enable).

Step 3: Configure Server/Debugee for kernel debugging

Log into the server, start a command line as administrator

First, copy the default startup options into a new boot option:

BCDEDIT /copy {current} /d DebugMode

This will create a new entry in the boot list when the server starts. Make a note of the GUID returned or copy it to the clipboard (using the ever so annoying copy/paste function in the Windows command prompt)

Next, configure the parameters for serial cable debugging:

BCDEDIT /set <GUID> debugport <port #>
BCDEDIT /set <GUID> debugtype serial
BCDEDIT /set <GUID> baudrate 115200

Replace the <GUID> the guid returned previously. Set <port #> to the COM port the NULL modem is connected to in your server (NOT the COM port of the client). For example, if the server has the NULL modem in COM2, set <port #> to 2.

Finally, enable debugging for on the newly created boot option:

BCDEDIT /debug <GUID> ON

Validate that your configuration works by running:

BCDEDIT /v.

You should get an output somewhat like this (this is also how you find the GUID if you didn’t note it down before):

If you want to make sure debugging is always turned on (great for a sandbox machines where you explore stuff in the kernel) you can use BCDEDIT /default <GUID> to make debugging the default startup option.

Step 3: Start Debugging

You are now ready to start debugging the windows kernel on the server. Here is how:

On the client, start WinDbg and choose File—>Kernel Debug (or CTRL+K) and set up the com port you got in step 1:

Press OK, and reboot the server. If you didn’t select the debug configuration as the default boot option, make sure you pick it when the server starts.

If you have done things right, you will get something like this in WinDbg (below, I broke execution with CTRL+C)

One thing to note when you are debugging the kernel: Not all your typical WinDbg commands work as they normally do (for some good reasons), but that is outside of scope for this blog entry.

Time to dig in even deeper… my Macbook Air to a Windows box :-)…Happy hacking everyone.

Big/Big Table Joins

July 19, 2012 7 comments

With the popularity of my last blog entry on Dangerous Joins, I felt inspired to write a bit more about the join strategies. Thanks for participating and reading, there seems to be a large appetite for Data Modeling out there – this blog now has over 7K unique visits every months.

Today, let us look at strategies for joining two big tables together.

Older Entries

Thomas Kejser's Database Blog

Archive

Clustered Indexes vs. Heaps

Synchronisation in .NET– Part 4: Partitioned Data Structures

Synchronisation in .NET– Part 3: Spin Locks and Interlocks/Atomics

Synchronisation in .NET– Part 2: Unsafe Data Structures and Padding

Synchronisation in .NET– Part 1: lock(), Dictionaries and Arrays

Myth Busting Query Optimisation

Bottleneck Diagnosis on SQL Server – New Scripts

Quantifying the Cost of Compression

Test: Table Scan of Compressed Data

Test: Singleton Row fetch

Test: Singleton UPDATE

Summary

Configuring Kernel Debugging with WinDbg and a NULL modem

Step 1: Prepare the Client/Debugger

Step 2: Connect Client and Server

Step 3: Configure Server/Debugee for kernel debugging

Step 3: Start Debugging

Big/Big Table Joins

Categories