Home > Data Warehouse, Grade of the Steel > What Structure does your Data Have?

What Structure does your Data Have?

I am currently thinking about measuring compression rates for different column store indexing strategies. In order for me to get some realistic data, I am looking for people who can share a very small anonymised data sample with me.

Specifically, I am looking for samples from Kimball style warehouses from different industries (If you are a 3NF warehouse, I am not interested). Roughly speaking, I would like something like this select statement:

 

SELECT HashFunction(DimensionKey1) AS d1
, HashFunction(DimensionKey2) AS d2
, ….
, HashFunction(DimensionKeyN) AS dN
, HashFunction(Measure1) AS m1
, …
, HashFunction(MeasureM) AS mM

FROM FactTable

 

Where HashFunction is something that yields either a 4 byte or 8 byte integer. If you are using SQL Server, BINARY_CHECKSUM will do (or you can get fancy and use my C# implementation of CRC32).

10MB of data from a fact table would be optimal, preferably sampled from a single time interval in the data. For example: if a full day is around 10MB in the warehouse, a single day would be the  best sample. I don’t need to know any column names, only which columns are dimensions and which ones are measures. If possible, some indication of which industry the data is from would be helpful, but it is not strictly needed either.

I am fully aware that there may be legal reasons you cannot share data, even when anonymised like above. Please only give it to me if you would feel confident to give this data freely away on the Internet. I am specifically trying to create guidance on compression techniques depending on the content of data, and I plan to share this on  my blog, along with distribution statistics on the anonymised data.

Data can be delivered to me via DropBox or Google Drive in CSV format. If needed, I can also HTTP or FTP GET it.

Thanks for anything you can share, I will return the sharing favour here on my blog

Advertisements
  1. May 15, 2012 at 20:29

    Hi, is the fact table with the dimension surrogate keys good enough? Or do you need the full star schema?

    • Thomas Kejser
      May 15, 2012 at 23:10

      I only need the fact table Marcel. And if you already use integer surrogate keys and you dont mind oncloaking the data, you dont even need to apply a hash function.

  1. No trackbacks yet.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s