Tip: How Many Threads Does It Take?
Sometimes we can observe an OpenMP program using a different number of threads each time it is run. Why does that happen?
For example, here is a program that appears in the OpenMP
User's Guide to demonstrate
nested parallelism. A team of more than one thread is executing a nested parallel
region:
#include <omp.h>
#include <stdio.h>
void report_num_threads(int level)
{
#pragma omp single
{
printf("Level %d: number of threads in the team - %d\n",
level, omp_get_num_threads());
}
}
int main()
{
omp_set_dynamic(0);
#pragma omp parallel num_threads(2)
{
report_num_threads(1);
#pragma omp parallel num_threads(2)
{
report_num_threads(2);
#pragma omp parallel num_threads(2)
{
report_num_threads(3);
}
}
}
return(0);
}
Compiling and running this program with nested parallelism enabled produces
the following output:
% setenv OMP_NESTED TRUE
% a.out
Level 1: number of threads in the team - 2
Level 2: number of threads in the team - 2
Level 2: number of threads in the team - 2
Level 3: number of threads in the team - 2
Level 3: number of threads in the team - 2
Level 3: number of threads in the team - 2
Level 3: number of threads in the team - 2
At level one two threads are created and each of those threads creates two
more threads, and so on.
Compare this with the result by running the same program with nested
parallelism disabled:
% setenv OMP_NESTED FALSE
% a.out
Level 1: number of threads in the team - 2
Level 2: number of threads in the team - 1
Level 3: number of threads in the team - 1
Level 2: number of threads in the team - 1
Level 3: number of threads in the team - 1
The User Guide goes on to demonstration how setting the SUNW_MP_MAX_POOL_THREADS environment
variable can control the number of threads in the pool:
The thread pool consists of only non-user threads that the runtime library
creates. It does not include the master thread or any thread created explicitly
by the user's program. If this environment variable is set to zero, the thread
pool will be empty and all parallel regions will be executed by one thread.
The following example shows that a parallel region can get fewer threads
if there are not sufficient threads in the pool. The code is the same as
above. The number of threads needed for all the parallel regions to be active
at the same time is eight. The pool needs to contain at least seven idle
threads. If we set
SUNW_MP_MAX_POOL_THREADS to 5, two of the four inner-most parallel
regions may not be able to get all the slave threads they ask for. One possible
result is shown below.
% setenv OMP_NESTED TRUE
% setenv SUNW_MP_MAX_POOL_THREADS 5
% a.out
Level 1: number of threads in the team - 2
Level 2: number of threads in the team - 2
Level 2: number of threads in the team - 2
Level 3: number of threads in the team - 2
Level 3: number of threads in the team - 2
Level 3: number of threads in the team - 1
Level 3: number of threads in the team - 1
But you may run the same program and get the following output:
% a.out
Level 1: number of threads in the team - 2
Level 2: number of threads in the team - 2
Level 2: number of threads in the team - 2
Level 3: number of threads in the team - 2
Level 3: number of threads in the team - 1
Level 3: number of threads in the team - 2
Level 3: number of threads in the team - 2
Note here that there are seven level 3 threads, not the six shown in the first
run. Is this a bug, or expected? And how can it be explained?
Well, note that the program can have at most eight (2x2x2) level 3 threads.
Depending on how the operating system schedules the threads, a user may see
six, seven, or eight level 3 threads.
At level 2, there are four threads, T1, T2, T3, T4. Each wants to create a
parallel region with a team of two threads. The maximum number of threads in
this progress is six (SUNW_MP_MAX_POOL_THREADS+1), so there are
two threads can be used as slave threads at level 3.
If T1, T2, T3, and T4 try to acquire the slave threads at the same time, and
T1 gets one, T2 gets one, but T3 and T4 are not able to get one. Then there
are 2+2+1+1=6 level 3 threads. If T1 gets one and T2 gets one, and T1 finishes
its parallel region and returns the slave thread it gets to the pool just at
the moment that T3 tries to get a slave thread, it may be able to get the one
returned by thread T1. Suppose thread T4 does not get one. Then there are 2+2+2+1=7
level 3 threads. If T4 is also able to get the one returned by T2, then there
will be 2+2+2+2=8 level 3 threads. Any of these scenarios are possible, depending
on the timing of the events and the scheduling of operating system.
And that is why the User Guide uses the phrase "one possible
result".
(Page last updated May 3, 2005)