About the Series ...
This is the eleventh tutorial article of the series, MDX
in Analysis Services. The series is designed to provide hands-on
application of the fundamentals of MDX from the perspective of MS SQL Server
2000 Analysis Services ("Analysis Services,");
our primary focus is the manipulation of multidimensional data sources, using
MDX expressions, in a variety of scenarios designed to meet real-world business
intelligence needs.
For more
information on the series, as well as the hardware / software requirements to
prepare for the tutorials we will undertake, please see the first lesson of
this series: MDX
Concepts and Navigation.
Note: At the time of writing, Service
Pack 3 updates
are assumed for MSSQL Server 2000, MSSQL Server 2000 Analysis
Services, and the related Books Online and Samples.
The screen shots that appear in this article were taken from a Windows 2003
Server, and may appear somewhat different from coinciding views in other
operating systems.
Introduction
In our last tutorial, Control
Location of Processing, we began a three-part mini-series, Optimizing MDX. We stated
that the objective of this "triptych" is to explore methods for
optimizing the performance we obtain from our MDX queries. We introduced types
of intervention, or classifications of approaches, that we can
take, to enhance the performance of MDX, with the intent being to cover several
of the types as we progress through the mini-series, while introducing practice
examples to reinforce an awareness that transfers to our daily work with MDX.
In the first lesson of the Optimizing MDX set, we introduced the
first intervention type, control of location of query execution.
We focused on two of
the main ways to effect control of location, the Large Level Threshold
property and the Execution Location parameter, and performed hands-on
practice with the settings involved with each. In this lesson, we will pick up
where we left off, and:
-
Continue in
our considerations of the types of intervention we can use to optimize MDX
queries;
-
Explore
optimization of set operations and syntax arrangement considerations.
More on Location
In our last lesson, we focused on two
of the main ways to control processing location, the Large Level Threshold
property and the Execution Location parameter, as a part of our
discussion of the first intervention type, control of location of query
execution. While these two options provide perhaps the most straightforward
ways to control where a query is evaluated and executed, there are additional,
less direct ways to force processing in a desired location. We will consider
some of these approaches to conclude our discussion of the location control
intervention type.
Other Means of Influencing Execution Location
At the individual query
level, no means is readily available for a client application to direct
execution location. We can, however, mandate that large-level operations
execute at the client through the use of indirect means. The specification of
a named set for use within the query will force processing at the client
level. We can, therefore, create a named set (using either of the CREATE
SET or WITH SET clauses), containing members of a large level, at
the client, and then use the same named set within a query to force
client-based execution.
Other options for
indirect control of the processing location include calculated members
and calculated cells. The manner of creation of a calculated
member is important in determining its location-fixing effects. Using the CREATE
MEMBER or WITH MEMBER clauses, within a query to define a calculated
member at the server, will produce a calculated member that can be processed at
the server or client equally successfully. By contrast, using CREATE MEMBER
to produce a calculated member within a session will result in forced
client-based execution of the query that houses it.
Calculated cells may also force client-based
processing. Again, the manner in which the calculated cells are defined is
important in determining their location-fixing effects. A calculated cell that
is created with the CREATE CELL CALCULATION clause, at either the client
or the server, can be processed at the server. By contrast, the use of the WITH
CELL CALCULATION clause at the client will result in a query whose
processing will be client-based.
The
existence of two conditions can force a query to process on the server: a
reference to a filter operation within the query, and (consistent with our
discussion regarding large levels in our last
lesson) a large dimension level. Let's
take a look at the mechanics behind this in a little more detail.
We
will first fire up the MDX Sample Application, having seen in past
lessons how it provides an excellent platform from which to learn about MDX
and, as in this case, about the data and the metadata in our cube. Many of the
MDX operations that might be performed from a client application can be
simulated here or elsewhere, as we demonstrate in an article in our DatabaseJournal
Analysis Services series, Drilling
Through to Details: From Two Perspectives. The
Sample Application affords us another point of view of the interplay of the
OLAP data source and MDX.
NOTE: It is important to remember that
client applications will differ in many ways. Individual settings, design
characteristics, capabilities, and other considerations will likely mean
differences in operation and performance using the techniques we describe in
our lessons, as in other references.
1.
Start the MDX
Sample Application.
We are
initially greeted by the Connect dialog, shown in Illustration 1.
Illustration 1:
The Connect Dialog for the MDX Sample Application
The
illustration above depicts the name of my server, MOTHER1, and properly
indicates that we will be connecting via the MSOLAP provider (the
default).
The MDX
Sample Application window appears.
3.
Clear the top
area (the Query pane) of any remnants of queries that might appear.
4.
Ensure that FoodMart
2000 is selected as the database name in the DB box of the toolbar.
5.
Select the Warehouse
cube in the Cube drop-down list box.
The MDX
Sample Application window should resemble that depicted in Illustration 2,
complete with the information from the Warehouse cube displaying in the Metadata
tree (left section of the Metadata pane).
Illustration 2: The MDX Sample Application Window
(Compressed View)
We
will create an MDX query that helps us to "qualify" a second query as
containing elements that cause it to fall within the two conditions we have
exposed above that force a query to process on the server.
6.
Type the
following query into the query pane of the Sample Application:
-- MXAS11-1: Qualification through Count
WITH
MEMBER[Measures].[Count] AS
'COUNT({ [Product].[Product Name].Members})'
SELECT
{[Measures].[Count]}ON COLUMNs
FROM Warehouse
Again,
our intent here is to ascertain that an upcoming example expression ([Product].[Product
Name].Members) will "qualify" as meeting conditions that would
force it to process on the server. Therefore, in our first query, we are
simply obtaining a count of the members of the given level.
7.
Execute the
query using the Run Query button
The
results dataset appears as shown in Illustration 3.
Illustration 3: Results Datset, Count Query
8.
Save the query
in a convenient location as MXAS11-1.
We see
that [Product].[Product Name].Members refers to a genuine Large Level,
because the number of members in the Product Name level (1,560) of the Product
dimension exceeds the Large-Level Threshold we set in our last
lesson
(750). For that matter, it exceeds even the default threshold that existed
before our modifications (1000). (The number of members is also verifiable at
the RDBMS level in the FoodMart2000.mdb sample that is installed with
MSSQL Server 2000 Analysis Services).
Let's
use the level whose population we have just quantified in the COUNT
query above to illustrate. The inclusion of a filter within our query will also
be a driver for server-based processing, as we shall see.
9.
Create the
following new query:
-- MXAS11-2: Qualification through Count & Filter
SELECT
{[Measures].[Units Shipped]} ON COLUMNS,
TopCount ([Product].[Product Name].Members,
7, [Measures].[Units Shipped]) ON ROWS
FROM Warehouse
The
use of TopCount() above provides an instance where server-based
execution is likely to be appropriate, for the reasons we have already exposed.
Most of the large level will be pruned away by the filter action before
returning the result. As we noted earlier, existence of a filter operation
within the query is another driver for server processing.
10.
Execute the
query using the Run Query button.
The
results dataset appears as shown in Illustration 4.
Illustration 4: Results Set, TopCount() Query
11.
Save the query
as MXAS11-2.
While
level-size and filter requirements are good criteria to use in most cases for
determining the likelihood of forced server-based processing, there are
scenarios where even meeting or exceeding the parameters of these two criteria
will not force a query to execute on the server. Examples of these situations
include the presence of a function or functions within the query (say a user-defined
function that is registered solely on the client) that cannot execute on
the server. Also, as is somewhat obvious, a query that is executed against a local
cube will not process on a remote server.
We
have seen that options exist in the form of the Execution Location and Large
Level Threshold properties to influence the physical location in which a query
is processed. These two options within the query processing location intervention
type could allow us to take advantage of performance gains based upon our being
able to assign resources appropriately. While the Analysis Server provides for
tuning in its provision for the establishment of a Large Level Threshold,
as we saw in our previous session together, optimal syntax arrangement
in our queries is perhaps a more important consideration. Before we attempt to
optimize performance in this way, we need to gain a solid understanding of how
the PivotTable Service works.
The Importance of Optimal Arrangement
To paraphrase
some of what we have learned so far in the Optimizing MDX
mini-series, queries that contain levels whose populations of members exceed
the Large Level parameter will be processed by the server; queries with
levels whose member populations do not exceed the Large Level Threshold
in number are within the processing capabilities of the client-based PivotTable
Service operation.
The
manner in which the PivotTable Service processes queries can become a
consideration within the determination of the location of processing. The
service acts in such a way that each set (and every combination existing
therein) defined in the query is fully incarnated in memory before proceeding
with operations. Unsurprisingly, the demands on resources can be crippling for
even the most robust machines.
Let's
consider an example:
12.
Create the
following new query:
-- MXAS11-3: Pre-optimized Set Operation
SELECT
{[Measures].[Units Shipped]} ON COLUMNS,
TopCount (
CrossJoin(
[Warehouse].[City].Members,
[Product].[Product Name].Members
),
7,
[Measures].[Units Shipped]
) ON ROWS
FROM Warehouse
We
have expanded our existing query well beyond its last incarnation, adding
another dimension and a CrossJoin() function, as if to enhance it to
meet (as an illustration) a specific need of an information consumer. We know
that the query will be processed on the client if the population of the [Product].[Product
Name] level is less than the Large Level Threshold (1,000 default /
750 the setting from our last
lesson).
An
important fact to remember is that the threshold refers to the level's
members; it does not refer to the tuples that exist within a given set within a
query. Say our Large Level Threshold has been set at 2200. We already
know that we have 1560 named products (from the count we did above); we also
have 23 warehouse cities. While the "memberships" of both levels
fall well below the threshold of 2200, and thus qualify for client processing,
and while our results dataset is the top seven city-product tuples from the
perspective of units shipped, we have far more combinations, in the way of
resource requirements, with which to contend from the scope of our query.
Our
query will assemble over 35,000 combinations (1560 x 23) - all to arrive at the
small results dataset that we obtain in the next step.
13.
Execute the
query using the Run Query button.
The
results dataset appears as shown in Illustration 5.
Illustration 5: Results Dataset, CrossJoin() added
to Query
Our query
executes in a matter of moments, even though it is completing myriad more steps
than we may have considered. However, things might have turned out quite
differently, within the scope of the realities of similar - but much larger -
scenarios that we might have encountered in the real world.
14.
Save the query
as MXAS11-3.
Not
taking the real cost of the actual combinations, and relying upon the large
level and other level-member-based safeguards to protect us, may actually place
a load upon the client that would challenge even a robust server. At the heart
of this resource intensive situation lies our old friend the CrossJoin()
function, although there are many other similar potential participants in
inefficient query construction and operation. We will examine ways to manage
these scenarios in the next section.
The Role of Expression Arrangement in Optimization
As many
of us have learned, both in working with SQL and with MDX, the greatest strides
in performance tuning can often be gained by tuning the expressions
themselves. The simple rearrangement of a query can produce dramatic
performance gains, as we will see in this section. We have discussed the
control of resource use, to some extent, through the direction of a query's
location of execution. We also exposed a scenario in the previous section
where, even though the query might contain level populations that fall below
the threshold for server processing, client processing can be more expensive
than is apparent.
Let's
return to our example above, and see what we can do in the way of arranging
expressions to make our query less resource intensive. To do so, we will
resume where we left off within the Sample Application:
1.
Create a new
query to rephrase the last (saved as MXAS11-3), arranged as follows:
-- MXAS11-4: Optimized Set Operation
SELECT
{[Measures]. [Units Shipped]} ON COLUMNS,
TopCount (
Generate ( [Warehouse].[City].Members,
TopCount ( CrossJoin ({ [Warehouse].CurrentMember},
[Product].[Product Name].Members
), 7, [Measures].[Units Shipped]
)
), 7, [Measures].[Units Shipped]
) ON ROWS
FROM Warehouse
In our
rearrangement of query MXAS11-3, we are exploiting the GENERATE function
to rephrase our query to "determine the top seven tuples that we obtain
from cross-joining the warehouse city and named products, for the current
warehouse city. Then marry the results with the rest of the members of the
warehouse city level."
The
number of combinations that result within this query is 1,721 (1 x 1,560 named
products, plus 7 x 23 cities). This represents a significant reduction over
the number of combinations in the pre-optimized query, and results in a
corresponding reduction of over 95 percent.
As we
will see from the next step, the results will be identical to those in the
pre-optimized version.
2.
Execute the
query using the Run Query button.
The
results dataset appears as shown in Illustration 6.
Illustration 6: Results Dataset, Optimized
CrossJoin() Query
3.
Save the query
as MXAS11-4.
Through
this small exercise, we have illustrated the dramatic performance gains that
can be obtained by simply rephrasing a query. To summarize the approach, we
might keep in mind that we can replace complex CROSSJOIN() statements
with a GENERATE() and CROSSJOIN() combination, as we have
demonstrated above. The following general rule summarizes the approach:
CrossJoin (X,Y) =
GENERATE(X, CrossJoin(X.CurrentMember, Y)
Or
CrossJoin (X,Y) =
GENERATE (CrossJoin(X,Y.CurrentMember), Y)
Let's
take a look at another scenario, where the objective of our query is the
summation of two measures, Units Shipped and Warehouse Sales, for
a given set of products, stores and warehouses (all products, in all non-U.S.
stores and warehouses, in our example). We can approach the query in a couple
of ways, using calculated measures to effect the summations.
Let's
create a straightforward query first, where we create calculated members that
we combine with the measures we seek to obtain the correct calculations.
Now,
let's run the query and note the results.
2.
Execute the query
using the Run Query button.
The
results dataset appears as depicted in Illustration 7.
Illustration 7: Results Dataset, Pre-Optimized
Summation Query
3.
Save the query
as MXAS11-5.
Now
let's consider another approach to obtaining the same objective. This time, we
will place each set of calculation logic into a respective calculated measure.
4.
Create the
following new query:
-- MXAS11-6: Optimized Set Operation # 2
WITH
MEMBER [Measures].[Total Units Shipped] AS
'SUM (
{[Warehouse].[All Warehouses].[Canada],
[Warehouse].[All Warehouses].[Mexico]},
SUM (
{[Store].[All Stores].[Canada],
[Store].[All Stores].[Mexico]},
SUM (
{[Product].[All Products].[Drink],
[Product].[All Products].[Food]},
[Measures].[Units Shipped]
)))'
MEMBER [Measures].[Total Warehouse Sales] AS
'SUM (
{[Warehouse].[All Warehouses].[Canada],
[Warehouse].[All Warehouses].[Mexico]},
SUM (
{[Store].[All Stores].[Canada] ,[Store].[All Stores].[Mexico]},
SUM (
{[Product].[All Products].[Drink],[Product].[All Products].[Food]},
[Measures].[Warehouse Sales]
)))'
SELECT
{ [Measures].[Total Units Shipped],
[Measures].[Total Warehouse Sales] } ON COLUMNS,
{[Time].[1998]} ON ROWS
FROM Warehouse
Again,
we will run the query and observe the results.
5.
Execute the
query using the Run Query button.
The
results dataset appears as shown in Illustration 8.
Illustration 8: Results Dataset, Optimized
Summation Query
6.
Save the query
as MXAS11-6.
We
note that the results are identical for the two approaches to the query (with
the exception of the rounding of the Warehouse Sales amounts and the
measure names, either of which could be adjusted, were they significant
concerns). The second approach, however, performs appreciably better, as would
be evident were the query extracting a much larger amount of data from a
real-world data source. The following general rule summarizes the use of the
commutative nature of the SUM function (and applies equally to the MAX,
MIN, and TopCount() functions):
SUM (CrossJoin (X,Y), b) =
SUM(X, SUM(Y, b))
In
both the examples of performance enhancements we have explored above, and in
any other potential performance enhancement scenario we might contemplate, we
would need to consider the cost / benefit tradeoffs, actual performance metrics
against different sources, and whether the differences in performance are too
negligible to devote much time to attempting to increase performance further.
The point is that tuning of MDX expressions can be accomplished, both in the
manner suggested above or through other approaches, to maximize performance,
and that beginning with the MDX statements themselves is usually the best approach
to any optimization effort. Once the base query is tuned to an optimal state,
we can always consider client / server balancing, hardware upgrades, and
perhaps further measures to increase performance.
Summary and Conclusion ...
In
this lesson, we continued our focus on the use of control of location as
a primary intervention type within the realm of options available to us
for MDX query optimization. We performed a practice exercise to reinforce the
concepts exposed. We then extended our considerations of the available types
of intervention to include the optimization of set operations and syntax
arrangement considerations. Within our exploration of the importance of
the optimization of set operations, we undertook practice examples that
illustrated some ways we can rearrange queries to enhance performance, often
significantly.
In our
next lesson, the final article of the Optimizing MDX mini-series, we
will, expose methods of caching to load a commonly used slice of a cube
into memory, making for faster retrieval in prospective operations. Our
discussion will include various aspects of cache creation, and uses of caching
within MDX. In addition, we will touch upon other performance enhancement
options, including external functions and cube design modifications
and augmentation.
»
See All Articles by Columnist William E. Pearson, III