Introducing OpenMP: A Portable, Parallel Programming API for Shared Memory Multiprocessors
Introducing OpenMP: A Portable, Parallel Programming API for Shared Memory Multiprocessors   By Neelakanth Nadgir and Richard Friedman  

Sun ONE Studio compilers (C/C++/Fortran 95) support OpenMP parallelization natively. OpenMP is an emerging standard model for parallel programming in a shared memory environment. It provides a set of pragma's for programmers to easily parallelize their code. This article provides a brief introduction to OpenMP. This article is of particular interest to programmers who are new to OpenMP and parallel programming in Fortran, C, or C++.

What Is OpenMP?

OpenMP is a set of specifications and interfaces for parallelizing programs in a shared memory environment. OpenMP provides a set of pragmas which, when used in a program, directs an OpenMP-aware compiler to generate an executable that will run over multiple processors in parallel. No other source code modifications are necessary (other than fine tuning to get the maximum performance). OpenMP pragmas enable you to use an elegant and uniform interface to parallelize programs on various architectures and systems. OpenMP is a widely accepted specification, and vendors like Sun, KAI, and SGI support it. Currently OpenMP specs for Fortran, C and C++ programming languages are available. (See the Related Information box above for a link to the OpenMP website to find the latest OpenMP specification documents.)

OpenMP takes parallel programming to the next level by creating and synchronizing threads for you. All you need to do is insert appropriate pragmas in the source program, and then build the program with a compiler supporting OpenMP. The compiler interprets these pragmas and parallelizes the code following the pragma. When using compilers that are not OpenMP-aware, the OpenMP pragmas are silently ignored.

(This article gives examples of using OpenMP with C programs. Equivalent pragmas exist for Fortran 95 as well. See the OpenMP User's Guide for details.)

OpenMP Pragmas

The OpenMP specification defines a set of pragmas. These pragmas are compiler directives on how to process the block of code that follows the pragma. The most basic pragma is the #pragma omp parallel. The parallel pragma denotes a parallel region. The main thread of execution is called the master thread. Once the master thread encounters the parallel pragma, it creates a team of worker threads that then distribute the work among themselves and the master thread. The environmental variable OMP_NUM_THREADS controls the number of worker threads that are created. At the end of the parallel region, all threads wait for each other (also accomplished by a barrier pragma) and the program continues executing sequentially with the master thread.

OpenMP supports two basic kinds of parallelism - loops and sections. The #pragma omp for is used for loops, and #pragma omp section is used for sections. Sections are blocks of code that can be executed in parallel. These pragmas can be used in a nested fashion. A combination of parallel for and section pragmas can also be used.

The #pragma omp master instructs the compiler that the following block of code is to be executed by the master thread only. The #pragma omp barrier instructs all threads to wait for each other. There is an implicit barrier pragma at the end of a parallel region. The #pragma omp single indicates that only one thread should execute the following block of code. This thread may not necessarily be the master thread. You can protect blocks of code that are not threadsafe by using the #pragma omp critical pragma. Of course all of these make sense only in the context of a parallel pragma (parallel region).

Using a simple matrix multiplication program you can see how to use OpenMP to parallelize the program. Consider the following small code fragment that multiplies 2 matrices. This is a very simple example and, if you really want a good matrix multiply routine, you will have to consider cache effects, or use a better algorithmn (Strassen's, or Coppersmith and Winograd's, etc.).

for (ii = 0; ii < nrows; ii++){      
  for(jj = 0; jj < ncols; jj++){        
    for (kk = 0; kk < nrows; kk++){           
       array[ii][jj] = array[ii][kk] * array[kk][jj];        
    }     
  }    
}

Parallelizing the above code segment is straightforward: Insert the #pragma omp parallel for pragma before the first loop. It is beneficial to use the pragmas at the highest loop, since it gives the most performance gain. Since there are no inter-loop dependencies, or any conflicting variables, you don't need to declare any shared or private variables. The preceding code now becomes:

    
#pragma omp parallel for    
for (ii = 0; ii < nrows; ii++){      
  for(jj = 0; jj < ncols; jj++){       
    for (kk = 0; kk < nrows; kk++){
       array[ii][jj] = array[ii]kk] * array[kk][jj];
    }
  }
}
As another example, consider the following code fragment that finds the sum of f(x) for 0 <= x < n.
        
for(ii = 0; ii < n; ii++){
   sum = sum + some_complex_long_fuction(a[ii]);        
}

To parallelize the above fragment, the first step could be

        
#pragma omp parallel for shared(sum)
for(ii = 0; ii < n; ii++){
   value = some_complex_long_fuction(a[ii]);          
   #pragma omp critical  
   sum = sum + value;        
}

or better, you can use the reduction clause to get

        
#pragma omp parallel for private(sum) reduction(+: sum)        
for(ii = 0; ii < n; ii++){           
   sum = sum + some_complex_long_fuction(a[ii]);        
}

OpenMP provides a few runtime enviromental variables that can be used to control the behavior of the OpenMP-program. The most important and widely used variable is OMP_NUM_THREADS. OMP_NUM_THREADS determines the number of worker threads that will be created when the master thread encounters a parallel region. The general rule is to make the number of threads equal to the number of processors in the system.

How to Begin

There are several ways to parallelize programs. First, determine if you need parallelization. Sometimes, parallelization requires big machines, and some algorithms are not suitable for parallelizing. If you are starting a new project, you could choose an algorithm that can be parallelized. It is very important to be sure that the code is correct (serially) before trying to parallelize it. Be sure to maintain timings of your serial run, so that you can decide if parallelization is useful.

Compile the serial version with several optimization options. The compiler can generally perform more lower level optimizations than you can. Try using the automatic parallelization options of the compiler. Delegating parallelization to the compiler makes it easier for you to maintain a common source code base. The autoparallelizer can also help you identify pieces of code that can be parallelized, or point out things in the code that could prevent parallelization (for example, a function call inside a for loop). You can accomplish this by compiling your program with the -g flag, and using the er_src(1) utility in the Sun ONE Studio Compiler Collection (formerly Forte Developer).

er_src program_binary_file   function_name

Identify bottlenecks in the program using a profiling tool, such as Forte Performance Analyzer or Rational Quantify. This should help you identify routines (hot routines) where the major amount of time is spent. It is important that this is user CPU time, and not system time, since system time may be sequential time (two threads trying to read a disk segment).

Once you have identified the hot routines, study them to find loops that do much of the computation. Try using the -xautopar option of the Forte C compiler to identify loops that the compiler thinks can be parallelized. Identify shared and private variables by studying the interloop dependencies. Parallelize them using OpenMP pragmas. If you are lucky they should work fine. If not, try setting OMP_NUM_THREADS to 1 and see if the correct results are generated. You can also use dbx's runtime checking or tools like AssureView to find bugs in the program.

Mixing OpenMP With MPI

MPI (Message Passing Interface) is another model for parallelprogramming. Unlike OpenMP, MPI spawns multiple processes that then communicate using TCP/IP. Since these processes do not share the same address space, they can run on remote machines (or a cluster of machines). It is difficult to say whether OpenMP or MPI is better. They both have their advantages and disadvantages. What is more interesting is that OpenMP can be used with MPI. Typically, you would use MPI to coarsely distribute work among several SMP machines, and then use OpenMP to parallelize at a finer level. For more information on using mixed mode OpenMP, see Mixed Mode MPI/OpenMP Programming.

Tools for Using OpenMP

For more information on C, C++, and Fortran support for Sun compilers, please see: Sun Studio 10 software.

You can profile your OpenMP programs using Sun Studio 10 software.

Resources

Close    To Top
  • Prev Article-OS:
  • Next Article-OS:
  • Now: Tutorial for Web and Software Design > OS > Solaris > OS Content
    Photoshop Tutorial
     

    Special Effect

      3D Effect
      Photoshop Articles
    Programming Tutorial
     

    C/C++ Tutorial

      Visual Basic
      C# Tutorial
    Database Tutorial
     

    MySQL Tutorial

      MS SQL Tutorial
      Oracle Tutorial
    Geek Tutorial
     

    Blogging Tutorial

      RSS Tutorial
      Podcasting Tutorial
    Graphic Design Tutorial
      Coreldraw Tutorial
      Illustrator Tutorial
      3D Tutorials
    Webmaster Articles
     

    Domain Service

      Web Hosting
      Site Promotion
    Java Tutorial/ Articles
     

    Java Servlets

      JavaEE Tutorial
     

    JavaBeans Tutorial

    XML Tutorial/ Articles
     

    XML Style

      AJAX Tutorial
      XML Mobile
    Flash Tutorial/ Articles
     

    Flash Video

      Action Script
      Flash Articles
    OS Tutorial/ Articles
      Linux Tutorial
      Symbian Tutorial
      MacOS Tutorial
    Personal Tech
      Hardware Tutorial
      Software Tutorial
      Online Auction