Home > Data Warehouse, Grade of the Steel, SQL Server > Exploring Hash Functions in SQL Server

Exploring Hash Functions in SQL Server

November 6, 2011 Leave a comment Go to comments

Hash distributing rows is a wonderful trick that I often apply. It forms one of the foundations for most scale-out architectures. It is therefore natural to ask which hash functions are most efficient, so we may chose intelligently between them.

In this blog post, I will benchmark the build in function in SQL Server. I will focus on answering two questions:

  • How fast is the hash function?
  • How well does the hash function spread data over a 32-bit integer space

    I know there is also the question about how cryptographically safe the function is, but this is not a necessary property for scale-out purposes – hence, that aspect is out of scope for this blog.

    For the picky people (my dear soul mates): I will be using 1K = 1024 in the following, so 64K = 65536.

        Test data

      Because we sometimes find ourselves hashing dimension keys, I will use a very simply table with a nice key range as my test input. Such a key range could for example have been generated by an IDENTITY(1,1) column or a SEQUENCER. Here is the test script that generates the input:

    CREATE TABLE CustKey (SK INT NOT NULL)

    INSERT INTO CustKey WITH (TABLOCK) (SK)
    SELECT n FROM dbo.fn_nums(10000000)

     

    (see my previous post for the fn_nums source)

    Speed of the Hash function

    SQL Server exposes a series of hash functions that can be used to generate a hash based on one or more columns.

    The most basic functions are CHECKSUM and BINARY_CHECKSUM. These two functions each take a column as input and outputs a 32-bit integer.

    Inside SQL Server, you will also find the HASHBYTES function. This little gem can generate hashes using MD2, MD4, MD5, SHA and SHA1 algorithms. The problem for the purpose of our test is that these function spit out BINARY types, either 128 bits (for SHA) or 160 bits (for MD). However, we can quickly map that value into the integer space by doing a modulo (in SQL: %) with MaxInt (2**31 – 1). Interetingly, Modulo work directly on values of BINARY in SQL Server.

    Before I move on the results, I would like share a little trick. When you need to get the raw CPU speed of the hash, you may be tempted to do something like this:

    SELECT BINARY_CHECKSUM(SK) AS h
    FROM CustKey

     

    But this is wrong. Why? Because you are both measuring the time it takes to hash as well as the time it takes to transfer the rows to the client. This is not very useful, on my laptop it takes over two minutes to run the above query.

    Instead, you should try to measure the raw time the server needs to calculate the data. One way to achieve this (and this trick can be used on most queries you benchmark) is to wrap the SELECT statement in a MAX function. Like this:

    SELECT MAX(h)
    FROM (
        SELECT BINARY_CHECKSUM(SK) AS h
        FROM CustKey
        ) noRows
    OPTION (MAXDOP 1)

    Notice something else interesting, I am forcing parallelism down to 1. I want to get as close to the raw cost of the CPU as possible – so I don’t want to clutter the query with any overhead of parallelism. The query now runs in a little less than 3 seconds on my laptop.

    I addition to the above trick, we also need to quantify the time SQL Server spends on just accessing the table. We can get a good approximation of this by measuring the SELECT statement without any hash function. On my laptop, I measured this to be around 3200 ms. In the results below, I have subtracted this runtime so we are only measuring the cost of the hashing.

    With all this said, let us have a look at the results:

    image

    Some comments on the data:

    • Notice that MD2 is very inefficient. I have not studied this algorithm in great detail and would appreciate a specialist commenting here.
    • Apart from the MD2 anomaly, the HASHBYTES functions all performan pretty much the same. Using around 1 micro sec per hash value calculation
    • BINARY_CHECKSUM and CHECKSUM are MUCH faster, by about an order of magnitude
    • Doing a simple modulo that spreads out the data over the integer space is about the same speed as CHECKSUM and BINARY_CHECKSUM
    • It is not the cast to NVARCHAR that uses CPU time (HASHBYTES requires NVARCHAR input)
    • However, the checksum function do take longer to run when you first cast to NVARCHAR. Costing around 100 ns per INT value

    Of course, the HASHBYTES functions also have other characteristics than speed which may be desired (cryptographic utility for example). but if you you want is speed, it looks like CHECKSUM and BINARY_CHECKSUM are faster.

    But, let us see how good the functions are at spreading the data.

    Spread of the Hash Function

    Another desirable characteristic, apart from speed, of a hash function is how well it spreads values over the target bit space. This is useful not only for cryptographic purposes, but also to “bucket” data values into equal sized portions, for example in scale-out MPP architectures.

    Often, you will not know in advance how many buckets you eventually want to subdivide the integer space into. You may start out with 32K buckets and divide up the integer interval in 128K (32-bit space divided by 128K = 32K) even sized buckets, each with 128K values in them:

    • Bucket0: [MinInt…MinInt + 128K[
    • Bucket1: [MinInt + 128K…MinInt + 128K*2[
    • Bucket2: [MinInt + 128K*2…MinInt + 128K*3[
    • … etc..
    • Bucket 32K: [MinInt + 128K*32K…MaxInt]

    Perhaps you will later want a further subdivision into 64K buckets, each with 64K hash values.

    Wise from the runtimes I measured before, I used only 1M rows with values [1…1M] for this test. However, this is plenty to show some good results, so don’t worry.

    To avoid running chi-square testing (See later) of integer value spread all over the 32-bit space, I calculated the hash of each key and I then bucketed them into 64K bucket ranges (each with 64K values, neat isn’t it?).

    This table comes in handy for that bucketing

    CREATE TABLE ExpectedBuckets 
    (
        BucketID INT
     
    , RangeStart INT
      , RangeEnd INT)


    INSERT INTO ExpectedBuckets
    SELECT
        n
        ,(n – 1 – POWER(2,15)) * POWER(2,16)
        , (n – POWER(2,15)) * POWER(2,16) – 1
    FROM dbo.fn_nums( POWER(2,16) )


    A sample output form this table:

    image

    I can calculate the hash values and put them into a table like this:

    CREATE TABLE HashResult
    (
      ObservationID INT IDENTITY(1,1)
    , HashValue_i INT
    )

    CREATE CLUSTERED INDEX CIX ON HashResult (HashValue_i)

     

    And finally, I can join them all up and count the values in each bucket:

    SELECT
          BucketID
        , CAST(ISNULL(COUNT(HashValue_i), 0) AS FLOAT) AS Observed
    FROM ExpectedBuckets EB
    LEFT JOIN HashResult
    ON HashValue_i BETWEEN EB.RangeStart AND EB.RangeEnd
    GROUP BY BucketID

     

    We now have a nice count of the content in hash bucket, it looks like this:

    image

    How do we test if this is a good result or not? It seems that min/max, averages and standard deviations don’t really suffice here (sorry modern MBA types, I realise I am going to loose you now Smile).

    We need to do some hypothesis testing. The hypothesis I want to test (aka: the NULL hypothesis – all resemblance to relational algebra is purely coincidental).

    “My chosen hash function evenly distributes the integers into the 64K hash buckets”

    Interestingly, this hypothesis makes a prediction (has to, if not we cannot falsify it): namely that in each bucket filled by hashing the 1M rows, we should expect to measure 1M / 64K = 15 members.

    There is also an important twist here: We have to measure the content of each bucket, even when it is zero. This is why I use a left join in the query above, we have to measure all the results we know, not just count the hash buckets that actually get filled (you can see how not bucketing the data would have made the test result very large).

    We can now test the hypothesis against the measured result. If we see correlation, then it gives us confidence in the correctness of the hypothesis, and lets us conclude that our hash function is a good one. It also gives us the ability to compare different hash functions with each other. We can test for correlation between the predicted result and the actual result with the chi-square test. Interestingly, the tools we need for this test is even exposed in Excel as the CHISQ series of functions.

    Without further ado (and insults), here are the results:

    image

    (source data available on request)

    Some observations about the data:

    • Most importantly: the CHECKSUM and BINARY_CHECKSUM functions are very poor hash functions. Even with 1M input and 64K large buckets, they don’t even fill the hash space. The skew is massive on non-skewed input data.
    • We also see that casting the INT to NVARCHAR improves the behavior of both BINARY_CHECKSUM and CHECKSUM, but not enough to make them good hash functions.
    • SHA and SHA1 seem to give us the best spread of the data, but at 1 microsecond per hash, it seems rather expensive.
    • Notice that the MD series do not give a particularly nice spread. I suspect this is related to the way I divide the data with MaxInt to get from the 120bit value to a 4-byte integer
    • We see that only the modulo does a distribution that gives us nearly 100% confidence (not fully though, because 1M does not divide 64K, which is reflected in the 824 value of Chi-squared). This result is expected, and my function here is engineered to “game the test”. The modulo cannot generally be used like this to spread the value all over the 32 bit integer space.
    • For fun, I put in a test where I use random values all over the integer space. Even with a confidence interval of 10%, we cannot say it distributes evenly. This means that SHA functions distribute better than random on this input dataset. Your mileage will of course vary for random values

    On the note of the CHECKUM series, Books Online does (under?) state:

    If one of the values in the expression list changes, the checksum of the list also generally changes. However, there is a small chance that the checksum will not change. For this reason, we do not recommend using CHECKSUM to detect whether values have changed, unless your application can tolerate occasionally missing a change. Consider using HashBytes instead. When an MD5 hash algorithm is specified, the probability of HashBytes returning the same result for two different inputs is much lower than that of CHECKSUM.

    Summary

    In this blog I have explored properties of hash function built into SQL Server. The results indicate some general guidance that could be followed:

    • For best spread over the hash space, use SHA or SHA1 algorithms with HASHBYTES function
    • Be VERY careful about using CHECKSUM and BINARY_CHECKSUM as these functions do not spread a hashed integer value nicely over the full space. Depending on what you use these functions for, they may not be suitable
    • It takes about 1 CPU microsecond to calculate a single call to HASHBYTES, except when hashing with MD2
    • Using HASHBYTES to spread out values over the 32-bit integer space seems rather expensive, since even the best cases uses tens of thousands of CPU instructions.
    • This indicates that it may be possible to implement a better hash function if you do not care about cryptographic properties of the hash

    Reference:

    Advertisements
    1. November 6, 2011 at 23:45

      Awesome stuff, as usual!

      I’d be very interested in seeing a similar analysis (or perhaps just an extension to your table) for the various hash functions in regards to hash clashes for distinct values. When it comes to hash distribution, you’ve nailed it in your conclusion. However, if using hashes to detect changes – as mentioned in your BOL quote – besides speed, the risk of clashes is equally important, whereas the distribution isn’t important.

      I’d assume the page checksum is performed using a high-performance low-clash-risk hash function.

      • Thomas Kejser
        November 7, 2011 at 00:30

        I have not actually looked at the source code for the page checksums, but I would assume that the goal is different than mine (detect changes) and that the hash is tailored for that. Clearly the CHECKSUM and BINARY_CHECKSUM do not have avalance properties, which would be desirable for page checksum. With regards to hash clashes – I need to think about how to set up a good test for that property. Any ideas?

        I am thinking about implementing a CLR version of MurmurHash3 to see how many CPU cycles I can get it down to.

        (Thanks for correcting the interesting type by the way)

    2. December 11, 2012 at 22:11

      Hello Thomas, My name is Diego. I was reading your articke about HASHBYTES function but
      i got the following error ‘HashBytes’ is not a recognized function name. i have copied the same example that MSDN page http://msdn.microsoft.com/en-us/library/ms174415.aspx, do you have any clue why is not working ? Thanks in advance for your advise.

      • Thomas Kejser
        December 12, 2012 at 17:56

        Hi Diego

        As far as I remember, the HASHBYTES function is only available in 2008 and above. Are you are running an older version?

        • December 12, 2012 at 18:16

          That was the problem, I have 2 instances, one with SQL 8 and the other one with 10, I changed and now is working perfectly. You are awesame, Thanks !

        • Thomas Kejser
          December 12, 2012 at 19:06

          Welcome 🙂

    3. May 13, 2013 at 14:29

      Hi

      I personally am not a big fan of the SQL hashing functions (collisions, performance), so I tend to use FNV1a as my goto function. Admittedly, you will need a CLR to use in SQL-Server, but I often use it from SSIS (see here : http://markgstacey.net/2013/05/11/incremental-loads-in-ssis-using-the-lookup-component-and-fnv1a-hash-in-a-synchronous-script-component/ ) or use the hash for sharding, in which case we’re doing it from .NET anyway.

      Currently my research is into combining multiple hash functions to reduce collisions.

      Here is some more reading on hash functions :

      http://programmers.stackexchange.com/questions/49550/which-hashing-algorithm-is-best-for-uniqueness-and-speed

      http://tools.ietf.org/html/draft-eastlake-fnv-03

      • Thomas Kejser
        May 21, 2013 at 15:49

        Great additional context!

        I am not sure combining hashes is a viable path to pursue. I seem to recall from my old CS days that hashing a hash with another actually makes collisions worse.

        • May 21, 2013 at 17:06

          Hi Thomas

          I’m going to have to do some research on this and feed back to you ; my original thoughts have always defaulted to purely never combining hashes, but some of the reading I’ve been doing has indicated otherwise, and my own tests with just FNV-1 and FNV-1a has shown positive results.

          I believe that combining cryptographic hashes is non-productive, but this is a non-cryptographic function. I will report from my tests

        • May 21, 2013 at 17:10
    4. July 10, 2013 at 16:44

      I have one comment on CHECKSUM and that is CHECKSUM is different from the other hashing functions as the other preset the domain to the full extent of the number of bytes used for the result. MD5 and others pre-initialize with a set of data and the apply the user input on top of that. CHECKSUM doesn’t, as the function only relies on the input. See my blog post here. http://weblogs.sqlteam.com/peterl/archive/2010/08/19/checksum-weakness-explained.aspx

      You would most probably get different bucket results for CHECKSUM of you use 1 million values occupying at least 31 bits (32 bits if using negative values).

      • Thomas Kejser
        July 10, 2013 at 17:44

        Even if it only relied on the input – it could still do a good job at the spread (for example: Murmurhash does fine without a salt).

        So I simply think checksum is too simplistic an algorithm

    5. July 10, 2013 at 16:47

      What would happen if you initialized your sample table with this instead?

      CREATE TABLE CustKey (SK VARCHAR(40) NOT NULL);

      INSERT CustKey WITH (TABLOCK) (SK)
      SELECT CAST(NEWID() AS VARCHAR(40)) FROM dbo.fn_nums(10000000);

    1. December 7, 2011 at 21:25
    2. February 5, 2012 at 04:30
    3. July 22, 2012 at 04:04

    Leave a Reply

    Fill in your details below or click an icon to log in:

    WordPress.com Logo

    You are commenting using your WordPress.com account. Log Out / Change )

    Twitter picture

    You are commenting using your Twitter account. Log Out / Change )

    Facebook photo

    You are commenting using your Facebook account. Log Out / Change )

    Google+ photo

    You are commenting using your Google+ account. Log Out / Change )

    Connecting to %s