Thursday, January 24, 2008

OLAP Workshop 6 : Advanced Cube Design

In the previous workshop we looked at creating a cube making use of AWM’s ability to manage the other features. In most cases these default settings will provide good load and query performance. Certainly when looking at the data model that supports the 10g common schema the default settings do a great job and make life much easier. Consequently, you can design and build the analytic workspace from using the data sourced from the SH schema in about 15 minutes.

In some cases you may need to move beyond the default settings and in the next few sections we will look at the other tabs that are part of the Cube wizard. These tabs control sparsity, compression and partitioning features, aggregation rules, and summarization strategies. The tabs and the features they control will be explained in the following order:
  • Implementation Details
  • Rules
  • Summarize To
  • Cache

But before proceeding there is one important thing you should always do before starting to load data into a cube (of a dimension) – review the data in as much detail as possible. Data quality is a subject that most companies often don’t even consider when building cubes, and most consultants just take the data given to them a load it without question.

In any project I would allocate 10-30% of the time looking at the data. The information gained at this stage will provide huge benefits later when you need to determine sparsity patterns (explained later). On a recent customer project I was asked to tune a cube to improve load and aggregation times. When we started to review the data we noticed some very very large numbers in one of the measures, which were simply amazing. After a lot of analysis we determine the ETL that was computing the figure in to the fact table had a mistake. Unfortunately both the developers and business users failed to identify this error. To compound the problem, the data formed a key business metric.

Therefore, NEVER EVER start loading data until you have checked the quality. Ideally you should use the data quality features of Warehouse Builder, which can significantly speed up this process. There are a number of presentations relating to data quality on the Warehouse Builder OTN home page.

Implementation Details Tab
Most of the advanced options for tuning your multidimensional model are found on the Implementation Details tabbed page of the Create Cube wizard. As shown below the Implementation Details tabbed page contains four important tuning features of Oracle OLAP. The correct use of these features ensures that your analytic workspace is very efficient and is implemented in an optimal way.

1.Sparsity: AWM 10g, by default, applies the common best practice in deciding which of your dimensions should be marked as “sparse” when you create a cube. Sparsity refers to the natural phenomenon evident in all multidimensional data to some degree: Not all the cells in the logical cube (the total possible combinations of all the dimension members for each dimension of the cube) will ever contain data. It is very common for a relatively small percentage of the possible combinations to actually store data. By understanding the sparsity of the data you expect to load into your AW, you can tune how it handles that sparsity and improve the performance of data loading and aggregation and reduce the disk-storage requirements for the populated AW.

After you understand which dimensions are sparse and which are not, their order can be important. When there are a large number of empty cells in a cube, the cube is said to be “sparse.” For example, if you are a manufacturer of consumer-packaged goods, you do not sell one or more of every single product you make to every customer, every day, through every sales channel. Different customers buy different products, at different time intervals, and each customer probably has a preferred channel. Different products may display different sparsity patterns: Ice creams and cold drinks tend to sell faster in the summer, whereas warm arctic coats are more popular in the winter (particularly in cold locations).

When using multidimensional technology, pay attention to sparsity so that you can design cubes efficiently. The effect of sparsity in data (and a badly designed cube) can result in tremendous growth in disk usage and a corresponding increase in the time taken to update and recalculate data in the cube. Inefficient sparsity control in any multidimensional data store can result in many empty cells actually being physically stored on the disk. This is something that is less of a concern with relational technology, because it is rare to store a completely null row in a table.
Oracle OLAP automatically deals with sparsity up to a point. But you, as a cube builder, can provide Oracle OLAP with the information that you know about your data (and information that Oracle OLAP needs to know) to deal with that data extremely efficiently.

Cube designers express sparsity in percentage terms. Data is said to be 5% dense (or 95% sparse) if only 5% of the possible combinations of the cells in a multidimensional measure or cube actually contain data. In many cases, data is very sparse, especially sales and marketing data. Only very aggregated data with a fairly small number of dimensions is typically dense enough for you to not consider sparsity.

Sparsity tends to increase with the number of dimensions and with the number of levels and hierarchies in each dimension. As you add dimensions to the definition of a cube, the number of possible cell combinations can increase exponentially. Also, the granularity of data affects sparsity. Low-level, detailed data is much more sparse than aggregate data. Very aggregate data is typically dense. Particular combinations of dimensions typically have different sparsity from others. For example, Time dimensions and Line dimensions are often more dense than dimensions such as Product, Customer, and Channel. This is because combinations of customers and products are sparser than combinations of customers and time or sparser than combinations of products and time. For this reason, AWM 10g asks you to confirm which of the dimensions for your data are sparse dimensions and which ones are dense.

In most cases I would recommend making all dimensions sparse. However, there are some additional considerations. The most important is the use of partitioning and we will look at this in one of the following sections. Sometimes, you may need to build a cube with different sparsity settings to determine the most efficient settings. In some cases making Time dense will generate a highly efficient cube and in other situations it will cause the massively extend the time take to load and aggregate data. The best method is to use an iterative development approach, but as with tuning be careful not to change too many settings at once as it becomes difficult to interpret the results.

A very common mistake I see with many customers is they insist on loading a zero balance into a measure. This is quite pointless, since a zero balance does not impact the overall total. Now it can be important to differentiate between an NA row and zero-row but for 99.9% of analysis it is possible to infer one from the other. Therefore, when loading data into a cube add an additional filter to remove zero and NA rows since this will provide huge savings in load and aggregation times. I was working on a project recently where a fact table contained 75 million rows of data and 65% of those rows contained 0 or NA.

2.Dimension order: It is possible to improve the build and aggregation performance of your AW by tuning the order in which the dimensions are specified in your cube.

When using the compression feature (discussed below), it is usually best to have a relatively small, dense dimension (such as Time) first in the list, followed by a group of all the sparse dimensions. Furthermore, it is generally the best practice to list the sparse dimensions in order of their size: from the one with the least members to the one with the most..

Note 1: Sparsity and dimension order are generally considered at the same time, which is why these choices are grouped together in the AWM 10g user interface:

My recommendation is to try building your cube with Time marked sparse and then try with Time marked dense. The effect on load times varies according to nature of the source data. I recently worked on a project where we marked all the dimensions as sparse and loaded a trial data set in 4 hours. By making Time dense, the same dataset loaded in 1 hour. Therefore, it pays to understand your data. But, most importantly, don’t assume you will get the data model right first time.

3.Compressed cubes and Global Composites: Version 10g of Oracle OLAP provides a new, internationally patented technology for the AW, which is exposed via a simple check box in AWM.

This is an extremely powerful data storage and aggregation algorithm optimized for sparse data. It is a new technology that is often dramatically faster than any previous OLAP server technology when aggregating sparse multidimensional data. The use of this feature can improve aggregation performance by a factor of 5 to 50. At the same time, query performance can improve, and disk storage is often also dramatically reduced. This feature is ideal for large volumes of sparse data but not suitable for all cubes (especially dense cubes).

If the “Use Compression” option is selected, then additional efficiency can often (but not always) be achieved by marking all dimensions (including Time) as sparse, especially for sparse data where there is known seasonality in the data, and especially if your AW is also partitioned on Time. But see my previous notes regarding this subject.

As we use this feature on more and more projects it is becoming clear that just about every cube will benefit from compression. Now there are some exceptions, such as cubes where you plan to use and application to write-back data directly into the cube, but such situations are easily managed by posting the updated data to a relational table and using the normal data load procedures to import and aggregate the data.

Note: Dimension order is unimportant when using compression. The multidimensional engine automatically determines how best to physically order the data after it is loaded.

A composite is an analytic workspace object used to manage sparsity. It maintains a list of all the sparse dimension-value combinations for, which there is data. By ignoring the sparse “empty” combinations in the underlying physical storage, the composite reduces the disk space required for sparse data. When data is added to a measure dimensioned by a composite, the AW automatically maintains the composite with any new values.

A “global” composite is simply a single composite for all data in a cube. Depending on the Compression and Partitioning choices you make, the behaviour of AWM will vary.

When would you opt to create Global Composites? The answer is very rarely. It can be beneficial to select this option in the case of a non-compressed cube that is partitioned. But as stated above, it is probably best to use compression on just about every cube you create, so you should probably leave the option unselected.

4.Partitioned cubes: You can partition your cube along any level in any hierarchy for a dimension. This is another way of improving the build and aggregation performance of your AW, especially if your computer has multiple CPUs. Oracle Database 10g (and thus the OLAP option) can run on single-CPU computers, large multi-CPU computers, and (with Real Application Clusters and Grid technology) clusters of computers that can be harnessed together and used as if they are one large computer. Oracle OLAP is, therefore, perhaps the most scalable OLAP server available.

Using partitioning does have certain knock-on consequences in 10g, but these are resolved in11g. In 10g, when you look at the “Summarize To” tab (this will be explained later) the levels above the partition key cannot be pre-aggregated and have to be solved at query time. Therefore, it is critical to select an appropriate level as the partition key so that query performance is maintained. Let us consider the example of time dimension:

If we use Day as the partition key, each individual partition will be small which should improve load times and aggregation times. But when a user creates a query based on yearly data 365 values have to be aggregated at run time for each cell being referenced within the query. Depending on the hardware this might or might not provide acceptable query performance.

If we use month as the partition key, each individual partition will still be relatively small and load times and aggregation times should still be acceptable. Each partition will hold between 28-31 days worth of data and in this case it would be prudent to make Time sparse within the model. However, when a user creates a query based on yearly data only 12 values have to be aggregated at run time for each cell being referenced within the query.

Partitioning has a big impact on two key areas:
  • Partial Aggregation
  • Parallel Processing
Partial Aggregation – the Oracle OLAP option supports incremental updates to a cube (as we will see in a later workshop). This allows the engine to only aggregate date for just those members where data has been loaded. Which means the aggregation process can work with a substantially reduced set of data. For example, if we are loading data for Dec 2008, then for the time dimension only the members Q4 2008 and 2008 are impacted by any data loaded.

Parallel Processing – By partitioning a cube, it is possible to solve it in parallel assuming data is being loaded into more than one partition. Which brings us to an important point. Most customers will typically partition their cubes by time. Of course if you only load data for one month at a time and use month as the partition key then parallel processing is not going to occur. Which may or may not be a good thing.

Rules Tab
On the Rules tabbed page, you identify aggregation rules for the cube (this is also available within each individual measure). You have many different kinds of aggregations available. This is one of the most powerful features of Oracle OLAP, enabling different dimensions to be independently calculated using different aggregation methods (or not using aggregation at all). In effect, a different aggregation method can be applied each dimension within a cube. The engine itself is also capable of supporting dimension member level aggregation plans through the use of MODELS. However, at this point in time Analytic Workspace Manager 10g does not support this feature. But AWM11g will support the ability to create dimension member aggregation plans in the form of custom aggregates.

In this image below, the aggregation method of SUM is used across all dimensions.

However, as we will see later different aggregation methods are available. For example, if you have costs and price data, you may want to see this data averaged over time, answering such business questions as “What is the average cost over 12 months?” or “What is the average price over 2 years?”

Aggregation Methods
It is common to set the aggregation rules only once for all measures contained in a cube. When you define a cube, you identify an aggregation method and any measures that you create that belong to the cube automatically receive the aggregation methods for that cube. This is the default behaviour, and it is one of the benefits of using a cube: By setting up aggregation rules and sparsity handling for all the measures once at the “cube” level, you save time and reduce the scope for errors or inconsistencies.

The default for aggregation used by AWM is the SUM method (simple additive aggregation) for each dimension. However, you do not have to aggregate data. Some measures have no meaning at aggregate levels of certain dimensions. In such cases, you can specify that the data is non-additive and should not be aggregated over those dimensions at all. Choosing the non-additive aggregation method means that when you view the data in the analytic workspace, you find data only at the leaf levels of the dimensions for which you selected that method.
Understanding Aggregation

AWM allows you to set aggregation rules for each dimension independently for your cubes and measures. That is, each dimension, if required, can use a different mathematical method of generating data for the parent and ancestors.

Here are some examples of different aggregation methods:
  • SUM simply adds up the values of the measure for each child value to compute the value for the parent. This is the default (and most common) behaviour.
  • AVERAGE calculates the average of the values of the measure for each child value to provide the value for the parent.
  • LAST takes the last non-NA (Null) value of the child members and uses that as the value for the parent.

Sales quantities and revenues are usually aggregated over all dimensions using the SUM method, whereas inventory or headcount measures commonly require a different method (such as LAST) on the Time dimension and SUM for the other dimensions. More advanced aggregation methods, such as weighted average, are useful when aggregating measures such as Prices (weighted by Sales revenue).

Different Aggregation for Individual Measures
However, you are not limited to specifying that all measures of a cube have the same aggregation method. When adding measures to the cube, you can specify a different aggregation method, and accept the defaults of all the other measure settings.

For example, it is not uncommon for a single cube to contain measures such as Sales Revenue, Sales Quantity, Order Quantity, and Stock/Inventory Quantity. All these measures will aggregate using the SUM method over all dimensions, except for the Stock/Inventory measure. This requires a LAST method on the Time dimension (and SUM on all the others). Using the Rules tab for the Stock measure you can override the default aggregation method for Time and set the method to LAST, while retaining all the all other default settings from the cube.

Note: The ability to override cube settings for individual measures is not supported in compressed-cubes. If you use compression, and one of your measures requires a different aggregation method, you need to create it in a separate cube.

Aggregation Operators
There are a number of different aggregation operators available to you for summarizing data in your AW. The following is a brief description of each of the operators.

  • Average: Adds data values, and then divides the sum by the number of data values that are added together
  • Hierarchical Average: Adds data values, and then divides the sum by the number of children in the dimension hierarchy. Unlike Average, which counts only non-NA children, Hierarchical Average counts all the logical children of a parent, regardless of whether each child does or does not have a value.
  • Hierarchical Weighted Average: Multiplies non-NA child data values by their corresponding weight values, and then divides the result by the sum of the weight values. Unlike Weighted Average, Hierarchical Weighted Average includes weight values in the denominator sum even when the corresponding child values are NA. You identify the weight object in the Based On field.
  • Weighted Average: Multiplies each data value by a weight factor, adds the data values, and then divides that result by the sum of the weight factors. You identify the weight object in the Based On field.
  • First Non-NA Data Value: The first real data value
  • Hierarchical First Member: The first data value in the hierarchy, even when that value is NA
  • Hierarchical Weighted First: The first data value in the hierarchy multiplied by its corresponding weight value, even when that value is NA. You identify the weight object in the Based On field.
  • Weighted First: The first non-NA data value multiplied by its corresponding weight value. You identify the weight object in the Based On field.
  • Last Non-NA Data Value: The last real data value
  • Hierarchical Last Member: The last data value in the hierarchy, even when that value is NA
  • Hierarchical Weighted Last: The last data value in the hierarchy multiplied by its corresponding weight value, even when that value is NA. You identify the weight object in the Based On field.
  • Weighted Last: The last non-NA data value multiplied by its corresponding weight value. You identify the weight object in the Based On field.
  • Maximum: The largest data value among the children of each parent
  • Minimum: The smallest data value among the children of each parent
  • Non-additive (Do Not Summarize): Do not aggregate any data for this dimension. Use this keyword only in an operator variable. It has no effect otherwise.
  • Sum: Adds data values (default)
  • Scaled Sum: Adds the value of a weight object to each data value, and then adds the data values. You identify the weight object in the Based On field.
  • Weighted Sum: Multiplies each data value by a weight factor, and then adds the data values. You identify the weight object in the Based On field.
Aggregating Across Multiple Hierarchies
Most dimensions within real world models will have multiple hierarchies. In the image below, there are two separate hierarchies on the Time dimension.

On the Aggregation Rules tabbed page, when creating a cube, you can specify which hierarchy or hierarchies should be used for aggregation for that cube’s measures. You should select one or more hierarchies for each dimension being aggregated. If you omit a hierarchy, then no aggregate values are stored in it; they are always calculated in response to a query.

Because this may reduce query performance, generally you should omit a hierarchy only if it is seldom used. The default behaviour of AWM 10g is to select all hierarchies.

Aggregating Measures with Data Coming in at Different Levels
There are other occasions where careful selection of the hierarchies to use in aggregation is important, especially for measure data that arrives into the AW at different levels of aggregation.

Suppose you have an AW that contains Budget and Actuals cubes for the purposes of variance analysis. The leaf level for Actuals is the Day level, but Budgets are set at the Monthly level. Initially, you created a single Time hierarchy in which Year is the highest level and Day is the lowest level:

This is perfect for the aggregation hierarchy for the Actuals measures. However, there is an issue with the Budgets measure. If data is loaded at the Month level, but this hierarchy is used for the aggregation of Budgets, then aggregation may begin at the Day level. All the empty cells for Budget at the Day level would be interpreted as zeros for the purposes of aggregating the data, resulting in new monthly totals being calculated as zero.

To handle this situation, a recommended approach is to create a second hierarchy that stops at the Month level specifically for the purposes of aggregating Budgets:

You must deselect the hierarchy containing the Day level on the Aggregation Rules tab for the cube or measure in question. Use the Day-level hierarchy for the Actuals measures only. The Day-level hierarchy is the primary or default hierarchy for end users because it enables drilling down to the Day level, and Budgets are available at Month, Quarter, and Year, exactly as required.

Summarize To Tab
Within all OLAP models you will need to balance the desire to aggregate absolutely everything and the time taken to load into a cube and then aggregate that data. In general, the less you choose to pre-summarize when loading data into the AW, the higher the load placed on run-time queries. In this scenario, queries are likely to be a bit slower and the load on the server at query time is likely to be greater (for example, each user query is likely to be asking the server to do more calculations at a given point in time). Pre-calculated summaries are instantly available for retrieval and are generally faster to query.

However, it does not necessarily follow that full aggregation across all levels of all dimensions yields the best query performance. In many cases, partial summarization strategies can provide optimal build and aggregation performance with little noticeable impact on query performance.

Many experienced OLAP cube builders make the following recommendations regarding summarization strategies:

  • Large dimensions, and those with many deep levels and/or hierarchies, are typically the most “expensive” to aggregate over. They are also likely to be one of the sparse dimensions in the cube definition. For such dimensions, a common guideline is to decide to summarize using a “skip-level” approach—that is, to precalculate every other level in the hierarchy. This generally gives reasonably good results and a solid basis for further tuning (if required)
  • If there is a small, dense dimension (such as Time) as the first dimension on the list of dimensions for a cube, then it is often a good strategy to leave that dimension to completely summarize on the fly at run time, especially if a large number of sparse data-level combinations have been computed
AWM generally defaults to settings that reflect this advice, but you can tune the settings if you need to. But at least the defaults provide a good starting point for tuning a build if required. But be warned, adding more levels to be pre-summarized will require additional storage space.

When you build and test your AWs, it is a good idea to include time in your project plan to experiment with different summarization strategies. Estimating in advance the exact storage requirements and aggregation times of a multidimensional cube (especially a highly dimensional, sparse one) is extremely difficult. So, it is often the case that some tuning after data is properly understood improves the performance of builds and aggregations.

You can use a database package to help you plan your summarisation strategy. There are two procedures, part of the DBMS_AW package that can provide help and guidance:
  • The SPARSITY_ADVICE_TABLE procedure creates a table for storing the advice generated by the ADVISE_SPARSITY procedure
  • The ADVISE_SPARSITY procedure runs a series of queries against your data and make recommendations about what data to pre-summarize and what to leave for dynamic aggregation. The 11g release of Analytic Workspace Manager leverages this database feature and make recommendations directly inside the tool

Cache Tab
Caching improves run-time performance in sessions that repeatedly access the same data, which is typical in data analysis. Caching temporarily saves calculated values in a session so that you can access them repeatedly without recalculating them each time. You have two options:
  • Cache run-time aggregations using session cache: This is the default behaviour. This option ensures that any run-time aggregations that are completed during a session are cached for the remainder of the session, improving query performance as the session progresses. This setting is ideal for a larger number of OLAP applications, namely those that allow read-only analysis where the underlying data is not changing during a session.
  • Do not cache run-time aggregations: Select this option if the cube would be subject to what-if analysis and, therefore, it would be important that previously calculated summarizations are not reused.

In the next workshop we will review how to quickly and easily load data into a cube and then review some best practices for loading data within a production environment.

No comments: