Monday, 25 August 2014

Update to Link Table rules

Back in 2011, I wrote a blog post entitled "Rules for creating a Key/Link Table in QlikView".  There were 3 basic rules:

1. Create a primary key in all the tables,
2. Rename the foreign keys to break the links,
3. Use a mixture of Concatenate and Join to generate the Key table using the Resident data.

These rules will still create you a Key/Link table that will work, but it might not always be the most efficient.  As I have had to deal with larger and larger data sets in QlikView, I have learned the hard way that things that work quite well on a table with a few hundred thousand records do not work quite as well when your tables with a few hundred million.

It is really the first "rule" here that can be the problem.  In a dimension table, the key that associates to the Key/Link table does need to be the primary key, however, the key that associates fact tables to the Key/Link table does not have to be a primary key!  This key only needs to be unique for the set of keys being linked to in the Key/Link table, not for the whole record.

For example, if my fact table contains CustomerID and DateID, most of the time that combination will not make a primary key in the fact table, but it is all the key that we need to associate to a Key/Link table containing CustomerID and DateID.  By not using a primary key in the fact table, we are vastly reducing the cardinality of the key being used and hence the amount of memory required to store it.

Another change that I would make is that I no longer use AutoNumberHashxxx to generate a key value.  Now, I always use AutoNumber with an AutoID.  This is because AutoNumber with an AutoID will generate a set of sequential integers which do not, in general, get stored in symbol tables, further reducing the memory requirement.  Using AutoNumberHashxxx to generate multiple different keys will result in non-sequential integer key values which will have to be stored in Symbol tables.

Yet another change that I would require is that the Key/Link table is generated using the Distinct keyword.  Nothing kills performance quite like having duplicate key values in a Key/Link table.

Finally, I no longer require to rename the foreign keys and then rename them back when creating the Key/Link table - I just leave them as is and then use the Drop Field xxx From yyy statement to remove the old foreign keys.

So, the new rules become:

1.  Create a key in each table at the correct granularity to associate to the Key/Link table.  Use AutoNumber with an AutoID.

2.  Use a combination of Concatenate, Join and Distinct to load a Key/Link table of distinct values.

3.  Use Drop Field xxx From yyy to remove the old foreign keys from the fact tables.

For example:

I have a simple structure with Customer, Calendar, Orders.  I have a separate fact table with Customer, Calendar and Delivery information.  I get a synthetic key that I want to resolve by using a Key/Link table:




Step 1 - Customer and Calendar (DateID) already have a primary key.  In Order I will create an ID from CustomerID and DateID.  I will do the same in Delivery:

   ...
   AutoNumber(CustomerID & '-' & DateID, 'CD') As LinkID,
   ...


Step 2 - Now I load my key table.  I will begin with data from the Order table:

   Key:
   Load Distinct
LinkID,
CustomerID,
DateID
   Resident
Order;

Now Join in the Product and OrderDetail keys from the OrderDetail table.

   Concatenate(Key)
   Load Distinct
LinkID,
CustomerID,
DateID
   Resident
Delivery;


Step 3 - Drop the old foreign keys:

   Drop Fields CustomerID, DateID From Order;
   Drop Fields CustomerID, DateID From Delivery;

Now all should be good:


Of course, we don't just create link tables to avoid synthetic keys, we usually create them to overcome association difficulties such as loops.

There is a special prize of a free copy of my forthcoming book, Mastering QlikView, for anyone who can give me a good, technical (i.e. not "because Qlik say so"), reason as to why the original, synthetic key containing, data structure is less desirable than the link table structure.


Stephen Redmond is author of QlikView Server and Publisher and the QlikView for Developer's Cookbook
He is CTO of CapricornVentis a QlikView Elite Partner. We are always looking for the right people to join our team.
Follow me on Twitter: @stephencredmond

10 comments:

  1. Hi Stephen,

    a couple of reasons why I prefer link tables to synthetic keys.

    Firstly when previewing data in the table viewer I find it easier to look through the data if I don't have a synthetic key, as browsing the synthetic key table removes the context around the data.

    Secondly if there are issues in the data load, for example an issue with date interpretation, if I am looking at data using a link table then it can be easier to spot which table is the source of the issue.

    Marcus Malinow

    ReplyDelete
  2. A Table Box is a far superior way to "browse" data than the preview option. It is also excellent at revealing issues with data - although I would prefer to have those resolved a long time before I start building a key table!

    ReplyDelete
  3. I can't think of a single "technical" reason the synthetic key solution is less desirable.
    If the data model is correct, and QV generates a synthetic key, then it's is created the correct synthetic key, it's no different to making your own concatenated key. Behind the scenes QV will store the synthetic key is the same manner as it'd store the user created concatenated key. So technically the 2 models are tautologous.
    Now, there may be some other benefits of using the the LinkID user created concatenated key. By using autonumber you may save a minuscule amount of RAM. You could reference the link table within the app if you'd ever want/need to for some more obscure reason.
    But, in my mind, at the technical base level, there is no reason to have to replace the synthetic key in this case. Synthetic keys are no "bad" per se, they're often indications there are underlying problems with the data model though. However in this case it all looks good, so why re-invent the wheel when QV's done all the legwork for you already.

    ReplyDelete
  4. Hi Stephen. Thanks for great post.
    I have one reservation though.
    Given the fact we try to avoid duplicate key values in Key table it looks as code at step 2 of your sample would not achieve that goal.
    Key table would have duplicate rows for all CustomerID, DateID pairs which have both Orders and Deliveries in underliyng fact tables.

    So another DISTINCT load from that Key table is necessary to get rid of that sort of duplicates.

    ReplyDelete
  5. Hi Vadim,

    If you load a table with Dustinct, it stays Distinct.

    Have a search for Barry Harmsen's blog on the subject.

    Regards,

    Stephen

    ReplyDelete
  6. Ups. I've read that Barry's article previously. But that behaviour is so counterintuitive for me that it immediately escape from my mind.
    Thank you fo clarification

    ReplyDelete
  7. Few reasons that come to my mind:
    1. Key generated using AutoNumber will save some ram.
    2. AutoNumber key search will be faster with compare to link table that has string based key
    3. Last but not the least from technical maintenance point of view - Link table is auto generated by QlikView. I don't wish to rely on something that is auto generated like this. I remember reading that Loosely Coupled Tables behavior is changed multiple times over different versions of QlikView. I prefer to avoid landing in that situation…so prefer to use link table over synthetic key – that just makes technical maintenance easy even with QlikView upgrade.

    ReplyDelete
    Replies
    1. One related link worth highlighting:
      http://community.qlik.com/message/432613#432613

      Delete
  8. Memory wise synthetic key, in your case, would probably consume less memory.
    So, creating link table provide clear, owned, expandable and easy to maintain data model which could be a technical advantage over synthetic key version.

    ReplyDelete
  9. I would say that a more flexible and expandable data model would be the following:
    Assumption: Deliveries are associated with orders. The key table in the center becomes a factKey table, which is a concatenation of orders and deliveries, but without any measures, or textual attributes (thus a very lean concatenated fact). Measures and textual attributes are left in the surrounding order and delivery facts, which allows for ultimate reportability. The factKey center table would contain the same joins to customer and calendar as described. Most importantly It would also contain primary keys to both orders and deliveries. Delivery records in this central table would also thus contain an orderID. Thus with this design you could show freight by shipper, which you cannot do in the proposed design. All related data mart fact tables could be concatenated in this design, and none need to have identical dimensionality. Once again, unlimited flexibility and reportability is achieved. Will this use more memory? Yes, but I have found that fast, and inflexible data models are much worse than flexible, expandable but slower data models. I have used this approach at several clients, always with success. When the client has another related fact to add to the model its simple to just concatenate it into the center factKey table, and with and old and new dimensions to join. When you look at this design it will always look like a star schema. I have had many tens of millions of rows, but not hundreds of millions. I would submit that a single Qlikview with hundreds of millions of rows is not a good design fro mthe start, and can be refactored to really what is needed.

    ReplyDelete

Note: only a member of this blog may post a comment.