Friday, April 13, 2012

Fine Tuning Incremental Updates using LOAD PRUNE

If you are like most people, you probably use the LOAD_AND_AGGREGATE cube script that is automatically created by Analytic Workspace Manager. Fine tuning the update process simply involves filtering the fact table for new or changed rows.

With a little bit of effort, you can improve update times by writing your own cube processing script. You can also use MV log tables to automatically captured changes made to the fact table and use them as the data sources to cube updates.

AWM defines and makes the LOAD_AND_AGGREGATE script the default script of the cube. If you don’t specify a different script, LOAD_AND_AGGREGATE is automatically used as shown in the following example (note that the script references the OLAPTRAIN.SALES_CUBE but does not including the USING clause).


This script will run the LOAD PARALLEL and SOLVE PARALLEL commands. What this means is that for each partition, the database will LOAD data from the fact table/view and then SOLVE (aggregate) data. If you have specified a value for parallel that is greater than 1, partitions will be processed in parallel (in the example above, 4 processes). AWM also provides the ability to set the refresh method (C, or complete, in the above example).

LOAD_AND_AGGREGATE is a good choice for a full build, but it might not be the best choice for an incremental update. If you are simply updating the cube with changes within a few recent partitions (e.g., yesterday or this month), the LOAD PRUNE command is probably better than LOAD PARALLEL.

LOAD PRUNE will first query the fact table or view to first determine which partition will have new data using a SELECT DISTINCT. It will then only generate LOAD commands for those partitions that will have records loaded into them.
Let’s run through an update scenario. Make the following assumptions:

* The time dimension has months for 2008 through 2012 and the cube is partitioned by month. The cube will have 60 partitions.

* You have loaded data into the cube for January 2008 through March 2012.

* It’s now time to load data for April 2012. This data has been inserted into the fact table.

* You have mapped the cube to a view. For the April 2012 update, you have added a filter to the view so that it returns data only for April.

If you use the LOAD_AND_AGGREGATE script and choose the FAST SOLVE refresh method, the database will really to the following:


With LOAD PARALLEL, the database will process the LOAD command for each partition (all 60). Since it’s selecting from a view that’s filtered out all but April 2012, 59 partitions will have no new or changed data. Although it doesn’t take a long time to load 0 rows and figure out that a SOLVE is not required, it still adds up if there are a lot of partitions.

With LOAD PRUNE, the database will determine that a LOAD is only required for April 2012. The LOAD step is skipped for all other partitions. While you will still see the SOLVE for all partitions, it doesn’t really do any work because no rows were loaded into the partition. An example using LOAD PRUNE follows.


If you would like a script that walked through a complete example using the OLAPTRAIN schema, including the use of an MV log table to automatically capture changes to the fact table, send me an email with a link to this posting.

Tuesday, April 3, 2012

Excel and OLAP: ODBC vs. MDX

A question that I often get is "what is the difference between using ODBC and the MDX Provider for Oracle OLAP (from Simba Technologies) to query Oracle cubes"? Given that the Oracle cube is easily queried with SQL, it's a reasonable question.
The answer really boils down to leveraging meta data and automatic query generation.
With ODBC, it's up to the Excel user to write a SQL query to fetch data from the cube. Data can be returned in tabular format or a pivot table. When the data is viewed in a pivot table Excel will aggregate data, sometimes with unexpected results. For example Excel might choose to aggregate a measure such as Sales with COUNT or might try to SUM a measure such as Sales YTD Percent Change. Neither make any sense. It's up to the user to get it right.
With the MDX Provider, Excel understands what all the columns mean. It understands dimensions, hierarchies and levels. It's understand the difference between a key and a label. It knows what a measure is. It allows the server to calculate the data. Query generation is automatic. Business users just choose hierarchies and measures and the MDX Provider does the rest.
Here's a list of some of the advantages of using the MDX Provider for Oracle OLAP as compared to using ODBC and writing your own SQL.

Monday, April 2, 2012

Oracle OLAP Exadata Performance Demonstration

For a great paper on Oracle OLAP running on Exadata, see:

The Executive Overview section of this paper provides an introduction:

This paper describes a performance demonstration of the OLAP Option to the Oracle Database running on an X2-2 Exadata Database Machine half rack. It shows how Oracle OLAP cubes can be used to enhance the performance and analytic content of the data warehouse and business intelligence solutions, supporting a demanding user community with ultrafast query and rich analytic content.

The demonstration represents users of a business intelligence application using SQL to query an Oracle OLAP cube that has been enhanced with a variety of analytic measures. The cube contains data loaded from a fact table with more than 1 billion rows.

Utilizing Exadata features such as Smart Flash Cache, Oracle Database supported a community of 50 concurrent users querying the cube with queries that are typical of those executed from a business intelligence tool such as Oracle Business Intelligence Enterprise Edition.

With each user querying the database non-stop (without waits between queries) with median query times ranged from .03 to .58 seconds, average query times ranged from .26 to 2.32 seconds, and 95 percent of queries returned in 1.5 to 5.5 seconds, depending on the type of query.

Query performance can be attributed to highly optimized data types and Exadata Smart Flash Cache. Cubes are designed for fast access to random data points, using features such as array-based storage, cost-based aggregation, and joined cube scans. Exadata Smart Flash Cache contributes significantly to cube query performance, virtually eliminating IO wait for the high volume, random IO typically seen with cube queries.