LU Factorization Case Study Using FAST: Dataflow Parallelism with the Forte Application Scalability Tool
LU Factorization Case Study Using FAST: Dataflow Parallelism with the Forte Application Scalability Tool   By Brad Lewis and Kris Richards, Sun Microsystems  

Traditional parallelization techniques are often limited in their usefulness by inherent serialism and data dependencies. The data flow approach provided by the Forte Application Scalability Tool (FAST) enables you to extract the theoretical maximum parallelism from a problem. This article demonstrates that while traditional parallelization methods on an LU factorization problem produce a performance boost of 2.7x on a 4-CPU system for a problem size of N=1000, scaling is not sustained: using 8 CPUs gives a speedup of only 3.7x. By using FAST data flow parallelism, scalability increased to 3.37x on 4-CPU systems, and 5.64x on 8-CPU systems. With FAST and some hand tuning, the Sun Performance Library team was able to achieve 23.8x with the LINPACK benchmark on a 24-CPU system.

Introduction

Starting with an overview of the LU Factorization algorithm, the problems with using traditional methods for parallelizing this algorithm are discussed. The FAST simulator will be introduced which allows you to see these problems in the design stage. Next we will show how FAST overcomes these problems and can be used to generate highly parallel code that delivers more performance. Finally we will show how FAST allows the user to easily modify an algorithm as we present two simple modifications that increase the performance above the initial FAST implementation.

This article highlights the advantages provided by FAST and therefore will not cover all the implementation details. The FAST data files and source code will be provided for those interested in exploring the implementation in depth.

Algorithm Detail

We will use the LAPACK routine DGETRF as a baseline implementation. It computes a factorization of the form:
A = P * L * U,  where P is a permutation matrix, L is lower triangular with unit diagonal elements (lower trapezoidal if m > n), and U is upper triangular (upper trapezoidal if m < n). This is often used in the process of solving linear systems (for example, the LINPACK benchmark). A block algorithm is employed that uses level 3 BLAS.

The main processing is performed by a loop which performs the three steps described below and shown in Figure 1.

  1. F2 and SWAP - An vector factorization(F2) is performed on the upper left block in the matrix. The column below the block is updated with row swaps(SWAP) and other elementary row operations.
  2. Triangular Solve - A triangular solve is performed of the form A X = B where A is a triangular matrix, B is a matrix of right hand side vectors and X is the solution matrix. The upper left block which was reduced to triangular form during the factorization is the A matrix. The remainder of the top block row of the matrix holds the B matrix which is overwritten with the solution matrix, X. Each column is an independent right-hand side and can be processed independently, providing an easy means to parallelize this step.
  3. Matrix Multiply- A matrix multiply of the form C = AB is performed where the A matrix is the part of the first block column updated with the swap. The B matrix is the top block row updated with the triangular solve. The remainder of the matrix to the right and below F2 (shown in blue in Figure 1) holds the C matrix. Each element in the C matrix can be computed independently providing an easy means to parallelize this step with a variety of domain decompositions.
  4. Repeat steps 1 through 3 moving down and to the right until all the matrix has been processed.
Figure 1: Data Written by LU Factorization Steps. Figure 1: Data Written by LU Factorization Steps.

The algorithm moves down and to the right one block and repeats the procedure. Figure 2 shows the regions that are processed in the second iteration. The processing continues down into the lower-right corner.

Figure 2: Data Written for the Second Iteration of the LU Factorization. Figure 2: Data Written for the Second Iteration of the LU Factorization.

LU with Parallel Kernels

The majority of the computation in the LU Factorization is performed in the matrix multiply and the triangular solve. Over 90% of the computation from the LAPACK subroutine DGETRF, which performs the LU factorization on a double precision real matrix, can be attributed to the BLAS3 routines DGEMM(matrix multiply) and DTRSM(triangular solve). Both of these can be parallelized by writing simple drivers. Drivers for both these routines are contained in the file PARALLEL_LU1/dgetrf.90. Parallel BLAS is also available as part of Sun Performance Library. These parallel drivers are sufficient for a system with a small number of processors. For a 1000 X 1000 matrix the following percent scalability was observed. Perfect scalability, 100 percent, would indicate that a computation run in parallel on N processors runs N times faster than it does on one processor. Scalability is computed as (T_1 * N) /T_N where T_1 is the time for a one processor run, N is the number of processors, and T_N is the time for the run on N processors.

Figure 3: LU Scalability with Parallel Kernels. Figure 3: LU Scalability with Parallel Kernels.

This decrease in scalability shows that as more processors are added they are used less efficiently. This is predicted by Amdahl's Law which can be stated as

Total_time = Parallel_time / NCPUS + Serial_time

where the total time for a computation to complete on a given number of CPUs is a function of the amount of the computation that can be parallelized, Parallel_time, and the amount that must be done serially, Serial_time. As the number of processors increase the serial time comprises more of the total runtime, even if it is much smaller than the parallel time to begin with. This is exactly the case in the LU Factorization.

Figure 4 shows the timeline display from the FAST profiler. The timeline shows what task is being performed at each time during a run. Each horizontal bars represents the processes happening in one thread of execution. The horizontal axis represents time. This timeline shows the first second of the run. The color of the bar represents what task is being performed at each point. For two threads, most of the work is spent in computation (represented by blue, green, and yellow segments of the bands). Very little time is spent idly, waiting for work (represented by red).

Figure 4: Two processor Timeline (1 second).

For the eight processor case, shown in Figure 5, red is a dominate color indicating that a large number of cycles are lost waiting for work. The waits that occur while a single thread performs the factorization now appear to be much larger. The duration of the waits appear larger because parallelization has decreased the duration of the parallel events. Also seven threads are sitting idle instead of one.

Figure 5: Eight processor Timeline (1 second). Figure 5: Eight processor Timeline (1 second).

The timelines in Figure 4 and Figure 5 visually affirm the truth of Amdahl's Law, small serial regions will kill parallel performance on large number of processors.

FAST Simulations

The FAST Simulator enables the user to experiment with an algorithm to see if a change would benefit the overall run without going through the work of implementing it. It can also be used to simulate performance on large multi-processor systems that may not be available during development.

Before the simulator can be used, a FAST project must be developed with the FAST Editor. The file FAST_SIM/dgetrf.flo is a project containing a FAST version of the DGETRF subroutine. From this file FAST can generate source code with Simulation Instrumentation enabled. Applications that use this routine will produce a simulation file. FAST_SIM/dgetrf_fast.dag is a simulation file for a 1000 X 1000 factorization. The file contains data dependency information so that it can produce dependency graph for the tasks performed during the run. For this simulation file the graph shows a set of tasks that must execute sequentially as shown in top of the simulators graph view window in Figure 6. The graph view also displays the data regions and properties for nodes in the graph.

Figure 6: Serial LU Graph. Figure 6: Serial LU Graph.

Each task that is performed in the algorithm has a node in the Sun ONE Studio explorer tree with an associated set of properties. These properties are used as parameters for simulating a parallel run. This allows a user to see the effect of code modifications in the design stage without going through the effort of implementing them. Figure 7 shows the properties for the task that performs matrix multiply.

Figure 7: Task Properties. Figure 7: Task Properties.

The scale property allows the user to multiply the execution time for a task by a factor to increase or decrease its serial run time. The shatter property allows the work being done in one node to be broken into a number of smaller pieces to be performed in parallel. For the LU, shattering the matrix multiply and the triangular solve allows the user to examine what the performance would be if execution was based on the dependency graph in Figure 8.

Figure 8: Shattered LU Dependency Graph. Figure 8: Shattered LU Dependency Graph.

The simulation execution times produced by FAST for the LU accurately predict the results achieved with parallel BLAS3 drivers as shown in the following graph.

Figure 9: Simulated Results vs. Measured Results. Figure 9: Simulated Results vs. Measured Results.

The simulator allows developers to see the results of parallelizing a section of code or speeding it up without actually doing the work. This leads to more efficient development as effort can be concentrated on worthwhile areas.

Parallel FAST LU

The FAST project in FAST_IMP1/dgetrf.flo contains the definition of a parallel version of the LU. The steps performing the matrix multiply and and triangular solve were modified to be parallel. Both steps were decomposed to act on each block of columns independently. This is accomplished by redrawing the data regions for that step and defining movements for those regions. The code to perform these tasks did not change from the serial version.

Additionally a loop was added to do the trailing swaps. These were handled as part of the factorization step in the serial implementation. Adding a second loop causes these swaps to execute after the main loop has completed. Removing the computation from the main loop simplifies the data dependencies for the main computation.

To change the step to a parallel step the user must decide how to decompose the work that is being done. The matrix multiply step from the serial project has the three data regions shown in Figure 10 on the left of the arrow.

Figure 10: Matrix Multiply Decomposition. Figure 10: Matrix Multiply Decomposition.

To decompose this computation so that each column can be computed independently the regions need to be deleted and redrawn as shown to the right of the arrow in Figure 10. Then movements are define for the new small regions to move them across the entire area that needs to be processed. In FAST, movements indicate parallelism. A similar modification is made to the triangular solve step so that each column is computed independently. The data dependencies for the LU can then be expressed with the graph below.

Figure 11: FAST LU Dependency Graph. Figure 11: FAST LU Dependency Graph.

This graph shows that the data dependencies allow for more potential parallelism than used in the traditional approach. The yellow nodes represent the column factorizations that are the only serial task. The data dependencies indicate that this task can run concurrently with other tasks. From the second factorization node in the graph, there is only a data dependency on one of the preceding matrix multiply nodes shown as blue. Only one matrix multiply and one triangular solve, shown as green, are required to be completed before the second factorization node can start executing. The rest of the matrix multiply and triangular solve nodes represent work that can potentially overlap the serial factorization. By taking advantage of this the scalability is improved as shown below.

Figure 12: FAST Scalability. Figure 12: FAST Scalability.

Using the FAST profiler timeline view, it can be seen that this is exactly the behavior for a run created from this project. The timeline shown in Figure 13 demonstrates FAST parallelization. The previously serial factorizations are running in parallel with matrix multiply and triangular solve computations. This is seen at several places in the display where a vertical line through the display would intersect one band colored yellow and other bands that are blue and green.

Figure 13: FAST Priorities. Figure 13: FAST Priorities.
FAST LU Priorities Task Priority Factorization 1 Triangular Solve 3 Matrix Multiply 2

Figure 13 shows that though the initial FAST implementation allowed some of the factorization stage to run in parallel with other computation there is still a lot of time wasted by threads waiting for work. As the run proceeds, the amount of parallel work decreases so that there is none left to overlap with the factorizations and stalls result. If the factorizations could be started sooner so that they would run while there was more work to overlap with, then the scalability would increase. FAST allows for a priority to be set for each step that is performed in an algorithm. The file FAST_IMP2/dgetrf.flo contains a FAST project with the priorities set as shown in the table opposite.

Using these priorities FAST produced another increase in scalability as shown in Figure 14.

Figure 14: FAST Scalability with priorities. Figure 14: FAST Scalability with priorities.

The timeline in Figure 15 shows how the improvement was achieved. This run manages to get more of the serial factorizations done while there is parallel work to overlap with. From the previous version the factorizations (yellow areas) have been executed sooner. In the timeline display they have slid to the left and are closer together.

Figure 15: Timeline for LU with Priorities ( 1 second). Figure 15: Timeline for LU with Priorities ( 1 second).

Critical Path Analysis

If we could further compress the factorization steps we could increase scalability more. As we've seen, for each iteration, before the factorization can start, only one matrix multiply from the previous iteration must complete. The multiply on this column is the only multiply that lies on the critical path. We can get this multiply done sooner by parallelizing it further into square blocks so that each block row in the column is computed independently. The file FAST_IMP3/dgetrf.flo contains an implementation using this algorithm. As with the other changes we've described, no code changes were necessary for this change in the algorithm. FAST will make the necessary changes when it generates new FORTRAN source. This new algorithm results in the graph shown in Figure 16.

Figure 16: LU Graph with Super-parallel Matrix Multiply. Figure 16: LU Graph with Super-parallel Matrix Multiply.

The four nodes highlighted in the graph perform the matrix multiply for one column. Each acts on a block of rows within that column. These are super-parallel in the sense that they are more parallel than the other nodes that perform the matrix multiply on columns. Using this approach, a further increase in scalability was achieved as reflected in the graph below (see Figure 17).

Figure 17: Scalability of LU with Critical Path Modifications. Figure 17: Scalability of LU with Critical Path Modifications.

As shown in the timeline in Figure 18, the yellow factorization bands have been pushed even further left. This indicates that the factorizations are starting earlier, allowing more of them to run concurrently with parallel work. This results in fewer stalls at the end of the run.

Figure 18: Timeline for LU with Critical Path Modifications (1 second). Figure 18: Timeline for LU with Critical Path Modifications (1 second).

Conclusion

Using FAST the scalability of this algorithm was greatly increased from under 50% to over 70% on 8 processors. FAST allows for parallelization based on data dependencies which offers a way to overcome Amdahl's Law by allowing serial tasks to run concurrently with other tasks. Amdahl's Law limits the number of processors most real algorithms can run on efficiently to a small number. FAST allows users to run algorithms more efficiently on parallel system with minimal coding effort. With additional effort FAST can allow algorithms to be tuned to run efficiently on larger hardware. Figure 19 shows the run-times for a 2000 X 2000 LU on 16 CPUs using algorithms discussed in this paper. FAST allows the user to overcome Amdahl's Law and approach perfect even in scaling algorithms that are not embarrassingly parallel, such as the LU Factorization.

Figure 19: Times for LU Factorization. Figure 19: Times for LU Factorization.

Related Information
About the Authors

Brad Lewis is a Software Engineer in the Sun ONE Tools division of Sun Microsystems. He has worked in the Sun Performance Library Group for the last 7 years, optimizing and parallelizing linear algebra routines.
Kris Richards is member of the Performance Libraries Group at Sun Microsystems. He earned his PhD in mathematics from Colorado State University. His research focussed around numerical partial differential equations and heterogeneous cluster computing.

Close    To Top
  • Prev Article-OS:
  • Next Article-OS:
  • Now: Tutorial for Web and Software Design > OS > Solaris > OS Content
    Photoshop Tutorial
     

    Special Effect

      3D Effect
      Photoshop Articles
    Programming Tutorial
     

    C/C++ Tutorial

      Visual Basic
      C# Tutorial
    Database Tutorial
     

    MySQL Tutorial

      MS SQL Tutorial
      Oracle Tutorial
    Geek Tutorial
     

    Blogging Tutorial

      RSS Tutorial
      Podcasting Tutorial
    Graphic Design Tutorial
      Coreldraw Tutorial
      Illustrator Tutorial
      3D Tutorials
    Webmaster Articles
     

    Domain Service

      Web Hosting
      Site Promotion
    Java Tutorial/ Articles
     

    Java Servlets

      JavaEE Tutorial
     

    JavaBeans Tutorial

    XML Tutorial/ Articles
     

    XML Style

      AJAX Tutorial
      XML Mobile
    Flash Tutorial/ Articles
     

    Flash Video

      Action Script
      Flash Articles
    OS Tutorial/ Articles
      Linux Tutorial
      Symbian Tutorial
      MacOS Tutorial
    Personal Tech
      Hardware Tutorial
      Software Tutorial
      Online Auction