LU Factorization Case Study Using FAST: Dataflow Parallelism with the Forte Application Scalability Tool
By Brad Lewis and Kris Richards, Sun Microsystems
Traditional parallelization techniques are often limited in their
usefulness by inherent serialism and data dependencies. The data flow approach
provided by the Forte Application Scalability Tool (FAST) enables you to extract
the theoretical maximum parallelism from a problem. This article demonstrates
that while traditional parallelization methods on an LU factorization problem
produce a performance boost of 2.7x on a 4-CPU system for a problem size of N=1000,
scaling is not sustained: using 8 CPUs gives a speedup of only 3.7x. By using
FAST data flow parallelism, scalability increased to 3.37x on 4-CPU systems, and
5.64x on 8-CPU systems. With FAST and some hand tuning, the Sun Performance Library
team was able to achieve 23.8x with the LINPACK benchmark on a 24-CPU system.
Introduction
Starting with an overview of
the LU Factorization algorithm, the problems with using
traditional methods for parallelizing this algorithm are discussed.
The FAST simulator will be introduced which allows you to see these
problems in the design stage. Next we will show how FAST overcomes
these problems and can be used to generate highly parallel code that
delivers more performance. Finally we will show how FAST allows the
user to easily modify an algorithm as we present two simple
modifications that increase the performance above the initial FAST
implementation.
This article highlights the advantages provided by FAST and therefore will
not cover all the implementation details. The FAST data files and source code
will be provided for those interested in exploring the implementation in depth.
Algorithm Detail
We will use the LAPACK routine DGETRF as a baseline implementation.
It computes a factorization of the form:
A = P * L * U, where P is a permutation matrix, L
is lower triangular with unit diagonal elements (lower trapezoidal if m >
n), and U is upper triangular (upper trapezoidal if m < n). This is
often used in the process of solving linear systems (for example, the LINPACK
benchmark). A block algorithm is employed that uses level 3 BLAS.
The main processing is performed by a loop which performs the three
steps described below and shown in Figure 1.
- F2 and SWAP - An vector
factorization(F2) is performed on the upper left block in the
matrix. The column below the block is updated with row swaps(SWAP)
and other elementary row operations.
- Triangular Solve - A triangular solve is performed of the form
A X = B where A is a triangular matrix, B
is a matrix of right hand side vectors and X is the solution matrix.
The upper left block which was reduced to triangular form during the factorization
is the A matrix. The remainder of the top block row of the matrix
holds the B matrix which is overwritten with the solution matrix,
X. Each column is an independent right-hand side and can be processed
independently, providing an easy means to parallelize this step.
- Matrix Multiply- A matrix multiply of the form C = AB
is performed where the A matrix is the part of the first block column
updated with the swap. The B matrix is the top block row updated
with the triangular solve. The remainder of the matrix to the right and
below F2 (shown in blue in Figure 1) holds the C matrix. Each element
in the C matrix can be computed independently providing an easy means
to parallelize this step with a variety of domain decompositions.
- Repeat steps 1 through
3 moving down and to the right until all the matrix has been
processed.
Figure
1: Data Written by LU Factorization Steps.
The algorithm moves down and to the right one block and repeats the procedure.
Figure 2 shows the regions that are processed in the second iteration. The processing
continues down into the lower-right corner.
Figure
2: Data Written for the Second Iteration of the LU Factorization.
LU with Parallel Kernels
The majority of the computation in the LU Factorization is performed in the
matrix multiply and the triangular solve. Over 90% of the computation from the
LAPACK subroutine DGETRF, which performs the LU factorization on a double precision
real matrix, can be attributed to the BLAS3 routines DGEMM(matrix multiply) and
DTRSM(triangular solve). Both of these can be parallelized by writing simple drivers.
Drivers for both these routines are contained in the file PARALLEL_LU1/dgetrf.90.
Parallel BLAS is also available as part of Sun Performance Library. These parallel
drivers are sufficient for a system with a small number of processors. For a 1000
X 1000 matrix the following percent scalability was observed. Perfect scalability,
100 percent, would indicate that a computation run in parallel on N processors
runs N times faster than it does on one processor. Scalability is computed as
(T_1 * N) /T_N where T_1 is the time for a one processor run, N is the number
of processors, and T_N is the time for the run on N processors.
Figure
3: LU Scalability with Parallel Kernels.
This decrease in scalability shows that as more processors are
added they are used less efficiently. This is predicted by Amdahl's Law which
can be stated as
Total_time = Parallel_time / NCPUS + Serial_time
where
the total time for a computation to complete on a given number of
CPUs is a function of the amount of the computation that can be
parallelized, Parallel_time, and the amount that must be done
serially, Serial_time. As the number of processors increase the
serial time comprises more of the total runtime, even if it is much
smaller than the parallel time to begin with. This is exactly the
case in the LU Factorization.
Figure 4 shows the timeline display from the FAST profiler. The timeline shows
what task is being performed at each time during a run. Each horizontal bars represents
the processes happening in one thread of execution. The horizontal axis represents
time. This timeline shows the first second of the run. The color of the bar represents
what task is being performed at each point. For two threads, most of the work
is spent in computation (represented by blue, green, and yellow segments of the
bands). Very little time is spent idly, waiting for work (represented by red).
Figure
4: Two processor Timeline (1 second).
For the eight processor case, shown in Figure 5, red is a dominate color indicating
that a large number of cycles are lost waiting for work. The waits that occur
while a single thread performs the factorization now appear to be much larger.
The duration of the waits appear larger because parallelization has decreased
the duration of the parallel events. Also seven threads are sitting idle instead
of one.
Figure
5: Eight processor Timeline (1 second).
The timelines in Figure 4 and Figure 5 visually affirm the truth
of Amdahl's Law, small serial regions will kill parallel performance on large
number of processors.
FAST Simulations
The FAST Simulator enables the user to experiment with an algorithm
to see if a change would benefit the overall run without going through the work
of implementing it. It can also be used to simulate performance on large multi-processor
systems that may not be available during development.
Before the simulator can be used, a FAST project must be developed with the
FAST Editor. The file FAST_SIM/dgetrf.flo is a project containing
a FAST version of the DGETRF subroutine. From this file FAST can generate source
code with Simulation Instrumentation enabled. Applications that use this routine
will produce a simulation file. FAST_SIM/dgetrf_fast.dag is a simulation
file for a 1000 X 1000 factorization. The file contains data dependency information
so that it can produce dependency graph for the tasks performed during the run.
For this simulation file the graph shows a set of tasks that must execute sequentially
as shown in top of the simulators graph view window in Figure 6. The graph view
also displays the data regions and properties for nodes in the graph.
Figure
6: Serial LU Graph.
Each task that is performed in the algorithm has a node in the Sun ONE Studio
explorer tree with an associated set of properties. These properties are used
as parameters for simulating a parallel run. This allows a user to see the effect
of code modifications in the design stage without going through the effort of
implementing them. Figure 7 shows the properties for the task that performs matrix
multiply.
Figure
7: Task Properties.
The scale property allows the user to multiply the execution time for
a task by a factor to increase or decrease its serial run time. The shatter
property allows the work being done in one node to be broken into a number of
smaller pieces to be performed in parallel. For the LU, shattering the matrix
multiply and the triangular solve allows the user to examine what the performance
would be if execution was based on the dependency graph in Figure 8.
Figure
8: Shattered LU Dependency Graph.
The simulation execution times produced by FAST for the LU accurately predict
the results achieved with parallel BLAS3 drivers as shown in the following graph.
Figure
9: Simulated Results vs. Measured Results.
The simulator allows developers to see the results of parallelizing a section
of code or speeding it up without actually doing the work. This leads to more
efficient development as effort can be concentrated on worthwhile areas.
Parallel FAST LU
The FAST project in FAST_IMP1/dgetrf.flo contains the definition
of a parallel version of the LU. The steps performing the matrix multiply and
and triangular solve were modified to be parallel. Both steps were decomposed
to act on each block of columns independently. This is accomplished by redrawing
the data regions for that step and defining movements for those regions. The code
to perform these tasks did not change from the serial version.
Additionally a loop was added to do the trailing swaps. These
were handled as part of the factorization step in the serial implementation.
Adding a second loop causes these swaps to execute after the main loop has completed.
Removing the computation from the main loop simplifies the data dependencies
for the main computation.
To change the step to a parallel step the user must decide how to decompose
the work that is being done. The matrix multiply step from the serial project
has the three data regions shown in Figure 10 on the left of the arrow.
Figure
10: Matrix Multiply Decomposition.
To decompose this computation so that each column can be computed independently
the regions need to be deleted and redrawn as shown to the right of the arrow
in Figure 10. Then movements are define for the new small regions to move them
across the entire area that needs to be processed. In FAST, movements indicate
parallelism. A similar modification is made to the triangular solve step so that
each column is computed independently. The data dependencies for the LU can then
be expressed with the graph below.
Figure
11: FAST LU Dependency Graph.
This graph shows that the data dependencies allow for more potential parallelism
than used in the traditional approach. The yellow nodes represent the column factorizations
that are the only serial task. The data dependencies indicate that this task can
run concurrently with other tasks. From the second factorization node in the graph,
there is only a data dependency on one of the preceding matrix multiply nodes
shown as blue. Only one matrix multiply and one triangular solve, shown as green,
are required to be completed before the second factorization node can start executing.
The rest of the matrix multiply and triangular solve nodes represent work that
can potentially overlap the serial factorization. By taking advantage of this
the scalability is improved as shown below.
Figure
12: FAST Scalability.
Using the FAST profiler timeline view, it can be seen that this is exactly
the behavior for a run created from this project. The timeline shown in Figure 13
demonstrates FAST parallelization. The previously serial factorizations are running
in parallel with matrix multiply and triangular solve computations. This is seen
at several places in the display where a vertical line through the display would
intersect one band colored yellow and other bands that are blue and green.
Figure
13: FAST Priorities.
FAST LU Priorities
Task
Priority
Factorization
1
Triangular Solve
3
Matrix Multiply
2
Figure 13 shows that though the initial FAST implementation allowed some of
the factorization stage to run in parallel with other computation there is still
a lot of time wasted by threads waiting for work. As the run proceeds, the amount
of parallel work decreases so that there is none left to overlap with the factorizations
and stalls result. If the factorizations could be started sooner so that they
would run while there was more work to overlap with, then the scalability would
increase. FAST allows for a priority to be set for each step that is performed
in an algorithm. The file FAST_IMP2/dgetrf.flo contains a FAST project
with the priorities set as shown in the table opposite.
Using these priorities FAST produced another increase in scalability as shown
in Figure 14.
Figure
14: FAST Scalability with priorities.
The timeline in Figure 15 shows how the improvement was achieved. This run
manages to get more of the serial factorizations done while there is parallel
work to overlap with. From the previous version the factorizations (yellow areas)
have been executed sooner. In the timeline display they have slid to the left
and are closer together.
Figure
15: Timeline for LU with Priorities ( 1 second).
Critical Path Analysis
If we could further compress the factorization steps we could increase scalability
more. As we've seen, for each iteration, before the factorization can start, only
one matrix multiply from the previous iteration must complete. The multiply on
this column is the only multiply that lies on the critical path. We can get this
multiply done sooner by parallelizing it further into square blocks so that each
block row in the column is computed independently. The file FAST_IMP3/dgetrf.flo
contains an implementation using this algorithm. As with the other changes we've
described, no code changes were necessary for this change in the algorithm. FAST
will make the necessary changes when it generates new FORTRAN source. This new
algorithm results in the graph shown in Figure 16.
Figure
16: LU Graph with Super-parallel Matrix Multiply.
The four nodes highlighted in the graph perform the matrix multiply for one
column. Each acts on a block of rows within that column. These are super-parallel
in the sense that they are more parallel than the other nodes that perform the
matrix multiply on columns. Using this approach, a further increase in scalability
was achieved as reflected in the graph below (see Figure 17).
Figure
17: Scalability of LU with Critical Path Modifications.
As shown in the timeline in Figure 18, the yellow factorization bands have
been pushed even further left. This indicates that the factorizations are starting
earlier, allowing more of them to run concurrently with parallel work. This results
in fewer stalls at the end of the run.
Figure
18: Timeline for LU with Critical Path Modifications (1 second).
Conclusion
Using FAST the scalability of this algorithm was greatly increased from under
50% to over 70% on 8 processors. FAST allows for parallelization based on data
dependencies which offers a way to overcome Amdahl's Law by allowing serial tasks
to run concurrently with other tasks. Amdahl's Law limits the number of processors
most real algorithms can run on efficiently to a small number. FAST allows users
to run algorithms more efficiently on parallel system with minimal coding effort.
With additional effort FAST can allow algorithms to be tuned to run efficiently
on larger hardware. Figure 19 shows the run-times for a 2000 X 2000 LU on 16 CPUs
using algorithms discussed in this paper. FAST allows the user to overcome Amdahl's
Law and approach perfect even in scaling algorithms that are not embarrassingly
parallel, such as the LU Factorization.
Figure
19: Times for LU Factorization.
Related Information
About the Authors
Brad Lewis is a Software Engineer in the Sun ONE Tools division of Sun
Microsystems. He has worked in the Sun Performance Library Group for the last
7 years, optimizing and parallelizing linear algebra routines.
Kris Richards is member of the Performance Libraries Group at Sun Microsystems.
He earned his PhD in mathematics from Colorado State University. His research
focussed around numerical partial differential equations and heterogeneous cluster
computing.