Friday, 16 April 2010

How QlikView stores data

To become really good at QlikView, you have to really understand how QlikView stores data. This is an introduction that should help you along that road.

If I load a very simple dataset:

Fruit:
LOAD *
INLINE [
Fruit, Color
Banana, Yellow
Blueberry, Blue
Tomato, Red
Tomato, Green
];


And then add both fields as list boxes, I see that there are only 3 fruit listed - even though I have loaded 4 entries. "Tomato" is only stored once by QlikView. Internally, there are binary keys keeping track of the association between "Tomato" along with "Red" and "Green".

In a large data set with a lot of repeating values (especially text values) the fact that each field value is only stored once means that there is a huge saving in the amount of memory needed to store the information. This is why QlikView can store so much data in memory.

Let's add another table of data:

Colors:
LOAD * INLINE [
Color, Red, Green, Blue
Yellow, 255, 255, 0
Blue, 0, 0, 255
Red, 255, 0, 0
Green, 0, 255, 0
Amber, 255, 121, 0
];


I now might think that I have 2 tables of data - both of which have a field called "Color". But I don't! There is only one field called "Color" in our dataset and this field will have 5 values. Internally, the association between each of them and their associated values are maintained with binary keys. It just so happens that one of the values - "Amber" - is not associated with any of the values in the "Fruit" field.

Note that there is probably no direct association between "Fruit" and "Red". The association is indirect. However, because of the smaller data set in memory and the binary keys, the association is very fast.

Hope this is useful!

2 comments:

  1. QV use math science then use minimal space in memory & hi performance with the nucleo machine

    other key is the japanesse phy, do focus (in excel) & to improve on

    by last QV is oriented to TIC (TI + communications)

    all this to make to QV as entrĂ³pic & agil MS, IBM, Oracle, Google etc this know but to lose
    because are fat & no share

    hurray¡ QV

    llauses@llauses.com

    ReplyDelete
  2. There are two types of associations involved when storing data, and I think clarifying them a bit further is helpful.

    The association you are describing is keeping track of the different field values associated with a specific record. You are correct in that they need to be associated in order to perform aggregate data operations.

    The 2nd association is how QlikView presumably stores data within memory. I posit that the most-efficient storage is to store string values with identical values at a single location, and then refer to that location by a memory address. This makes string comparisons trivial since you just need to compare the addresses, and reduces the memory requirements since only unique string values are stored. Note that this solution applies to storing identical strings between different fields (for example look at "Red" in your sample data), which was not explicitly mentioned above.

    One implication of this internal architecture is that it is costly to revise data, which is why QlikView works best as an analytical tool referencing the data in read-only mode.

    My disclaimer is that I do not know the internal QlikView code architecture or code base, this is just supposition based on my experience and knowledge of best practices.

    QV is a great tool, and an understanding of how it works "under the hood" has really helped me leverage its features and create great analytical solutions.

    Matthew Clark
    GORGES
    mclark@gorges.us

    ReplyDelete

Note: only a member of this blog may post a comment.