Thursday, October 7, 2010

Incremental Refresh of Oracle Cubes

One of the more common questions I get is about how cubes are processed, in particular how and when Oracle cubes are processed incrementally. Here is a short summary of how the cube refresh process works, a few scenarios and some suggestions on being smart about solutions.

- The cube will load all data from the source table.

- You can limit the data loaded into the cube by a) presenting new/changed data only via a staging table or filtered view or b) making the cube refreshable using the materialized view refresh system and using a materialized view log table.

- The cube will understand if data has been loaded into a partition. If data has not been loaded into a partition, it will not be processed (aggregated).

- If a partition has had data loaded into it, it will processed (aggregated).

- The cube will understand if a loaded value is new, changed or existing and unchanged. Only new or changed values are processed (the cube attempts to aggregate only new and changed cells and their ancestors).

- Changing parentage of a member in a non-partitioned dimension will trigger a full solve of the cube (across all partitions).

- If a member is added to a non-partitioned dimension, the cube will attempt an incremental aggregation of that dimension (that is, the new member and ancestors only).

Here are two scenarios that illustrate how this works.

1) How to trigger a lot of processing in the cube during a refresh:

- Load from the full fact table rather than a staging table, filtered view or MV log table. The full table will be loaded into the cube.

- Load data into many partitions. Each partition will need to be processed. For example, load data for the last 36 months when the cube is partitioned by time.

- Load data into large partitions. For example, partition by year or quarter rather than quarter or month. Smaller partitions will process more quickly.

- Make frequent hierarchy (parentage) changes in dimensions. For example, realign customers with new regions or reorganize the product dimension during the daily cube refresh. This will trigger a full cube refresh.

2) How to efficiently manage a daily cube refresh:

- Load from staging tables, filtered views or MV log tables where the tables/views contain only new or updated fact data. This will reduce the load step of the build. This becomes more important the larger the fact table is.

- Use as fine grained partitioning strategy as possible. This will result in smaller partitions, which process more efficiently (full or incremental refresh) and offer more opportunity for parallel processing. Also, it is likely that fewer partitions will be processed.

There can be a trade off with query performance. Typically, query performance is better when partitions are at a higher level (e.g., quarter rather than week) because there may be fewer partitions to access and less dynamic aggregation might be required. That said, the gain in refresh performance is typically much greater than the loss in query performance. Building a cube twice as fast is often more important than a small slowdown in query performance.

- Only add new data into the partitioned dimension. For example if the cube is partitioned by time, add data only for new time periods. Only the partitions with those time periods will be refreshed.

Clearly, there are many cases where data must be added to the non-partitioned dimensions. For example, new customers might be added daily. This is ok because a) new members are processed incrementally and b) new customers will likely affect only more recent partitions.

Schedule hierarchy realignments (e.g., changing parentage in product, geography and account type dimensions) weekly instead of daily. This will limit the number of times a full refresh is required. It might also allow you to scheduled the full refresh for a time where the availability of system resources is high and/or the query load is low.

The above scenarios help you understand how to most efficiently refresh a single cube. Also consider business requirements and how model the overall solution Two scenarios follow.

1) Data is available at the day, item and store levels in the fact table. The business requirements are such that all data must be available for query, but in practice most queries (e.g., 95% or more) are at the week, item and store levels.

Consider a solution where data is loaded in the cube at the week, item and city levels and more detailed data (day, item and store levels) are made available by drilling through to the table. This is very easy to do in a product such as Oracle Business Intelligence (OBIEE) or any other tool that has federated query capabilities and will be transparent to the business user.

In this case, the cube is simply smaller and will process more quickly. The compromise is that calculations defined in the cube are only available at the week, item and city levels. This is often a reasonable trade off for faster processing (and perhaps more frequent updates).

2) Data is available at the day, item and store levels in the fact table. The business requirements are such that all data must be available for query, but in practice:

- Longer term trending (e.g., year over year) is done at the week level or higher.

- Analysis of daily data (e.g., same day, week or month ago) is only done for the most recent 90 days.

In this scenario, consider a two cube solution:

- Cube A contains data at the week, item and store levels for all history (e.g., the past 36 months). This might be partitioned at the week level and aggregated to the month, quarter and year levels. Depending on reporting requirements, it might only be refreshed at the end of the week when the full week of data is available.

- Cube B contains data only at the day level for the most recent 90 days. It is not aggregated to the week, month, quarter and year levels (aggregates are serviced by Cube A). This cube is used for the daily sales analysis. This might be partitioned at the day level so that a) any full build of the cube can be highly parallelized and b) the daily update processes only a single and relatively small partition.

Using a tool such as Oracle Business Intelligence, which has federated query capabilities, a single business model can be created that accesses data from both cubes and the table transparently to the business user. When ever the user is querying at the week level or above, data OBIEE queries Cube A. If data is queried at the day level within the most recent 90 days, OBIEE queries Cube B. If data at the day level that is older than 90 days is access, OBIEE queries the table. Again, this can all be transparent to the user in a tool such as Oracle Business Intelligence.

1 comment:

sushant kale said...

I have a question related with reaggrigation.

My scenario:
1) I have Product Dimension (dim) and Sales Cube.
2) In Sales Cube, product dim is used.
3) In first time, i have complete refresh of dim and cube.
4) But if i modify data in Product dim for one of hierarchy and complete refresh of dim. So new data will available in product dim and after that i have to complete refresh of Sales cube to get reflect new dim hierarchy data.
5) My question is, if i have changed data in dim for one of hierarchy and complete refresh. So is there any way that i can get this data in cube without complete refresh? Is there any re-aggregation functionality in oracle OLAP cube to get data reflect in cube without complete refresh?

Please answer soon?
Thanks in adv.