Selecting the Best Compiler Options
By Darryl Gove, Senior Performance Engineer, Sun Microsystems, June 24, 2005
This article suggests how to get the best performance from an UltraSPARC or x86/AMD64 (x64) processor running on the latest Solaris systems by compiling with the best set of compiler options and the latest compilers. These are suggestions of things you should try, but before you release the final version of your program, you should understand exactly what you have asked the compiler to do.
The fundamental questions
There are two questions that you need to ask when compiling
your
program:
- What do I know about the platforms that this program will
run on?
- What do I know about the assumptions that are made in the
code?
The answers to these two questions determine what compiler
options
you should use.
The target platform
What platforms do you expect your code to run on? The choice
of
platform determines:
- 32-bit or 64-bit instruction set
- Instruction set extensions the compiler can use
- Instruction scheduling depending on instruction execution
times
- Cache configuration
The first three are often the most important ones.
32-bit versus 64-bit code
The UltraSPARC and x64 families of
processors can run both 32-bit and 64-bit code. In general it is not
possible to determine, without testing the application, whether
better performance will be obtained with 32-bit or 64-bit code; there
are several factors which influence performance:
- When moving from 32-bit to 64-bit code, the memory
footprint of the application typically gets bigger, because long,
unsigned long, and pointers all change from being 32-bits in size to
being 64-bits in size. Because of this, some applications will run more
slowly.
- The programming model for the UltraSPARC processor allows
32-bit applications to use the same set of features as 64-bit
applications. As such there is often little to be gained by targeting
64-bit.
- The x64 platform has a number of fundamental architecture
improvements which can enhance performance when using 64-bit code. In
particular the number of registers available for the application to use
has significantly increased -- this fact gives the compiler a number of
opportunities to extract performance out of the application.
- The primary and critical reason to use 64-bit code is if
the application handles a large amount of data in memory.
For additional details about migrating from 32-bit to 64-bit
code,
refer to Converting
32-bit Applications Into 64-bit Applications: Things to Consider
and 64-bit
x86 Migration, Debugging, and Tuning, With the Sun Studio 10 Toolset
Specify
the Target Platform and Architecture as Explicitly as Possible
The target platform specifies the
processor that the application is expected to run on, the minimum
processor that is required, and whether the application is 32-bit or
64-bit. For compiler versions prior to the SunStudio 9 release, the
compiler targeted a pre-UltraSPARC processor; SunStudio 9 and later
compilers target an UltraSPARC processor for the SPARC architecture,
and a generic x86 based processor for the x86 architecture. It is
always a good idea to explicitly specify the target architecture to
avoid the possibility that this could be changed by a change in
compiler flags.
There are a number of compiler flags
that work together to specify the target architecture. The flag -xtarget
sets all the other flags (-xarch,
-xchip, and -xcache)
to appropriate default values for the given target processor. The flag -xarch
sets the instruction set that the processor supports, the flag -xchip
specifies how the compiler should use these instructions. Finally the
flag -xcache specifies the structure of the
caches for this target (however this flag may not have any impact for
many codes). As with all compiler flags, the order is important; flags
accumulate from left to right, in the event that there are conflicting
settings the flag on the right will override the values of flags which
were specified earlier on the command line.
A point to be cautious of is that
specifying a more recent hardware target may mean that older hardware
is no longer able to run the application. In particular specifying the
target as being an UltraSPARC platform means that the application will
no longer run on pre-UltraSPARC processors (however UltraSPARC
processors have been shipping for over 10 years). Similarly specifying
an Opteron processor will mean that the code no longer runs
x86-compatible processors that do not have the SSE2 instruction set
extensions.
Using -xtarget=generic
The compiler supports the options -xtarget=generic
and -xtarget=generic64. These options tell
the compiler to produce code which runs well on as wide a range of
machines as possible. The compiler evolves the meaning of 'generic' as
new processors are introduced, so the flag is the best option if the
same binary has to be run over a range of processors.
One feature of these flags is that they
will be interpreted appropriately on both the SPARC and x64 platforms
-- so using them may mean fewer changes to Makefile flags. The
following table shows how the compiler will interpret the -xtarget=generic
flag on both the SPARC and x64 platforms
FLAG
SPARC
x64
-xtarget=generic
V8plus architecture
386 architecture
-xtarget=generic64
V9 architecture
AMD64 architecture
Specifying
the target platform for the UltraSPARC-III family of processors
Because -xtarget=generic
favours code that runs well on a wide range of processors rather than
on a particular processor, there may be times when it does not produce
the best performance. Consequently it is worth comparing the
performance of the generic code with a build of the application
specifically targeted for a particular processor family.
For UltraSPARC processors, a
generally good option pair to use is -xtarget=ultra3
with -xarch=v8plusa. These options allow the
compiler to generate 32-bit code that can run on all the members of the
UltraSPARC family and their follow-ons (UltraSPARC I, UltraSPARC II,
UltraSPARC III, UltraSPARC IV). The compiler will also schedule the
code especially for the UltraSPARC III. These options represent a good
compromise, since code scheduled for the UltraSPARC III is better at
taking advantage of the new features of the UltraSPARC III
architecture, while still providing good performance on previous
generations of processors.
If the application requires the
capability to address 64-bit memory addresses, then the appropriate
flags to use are -xtarget=ultra3 -xarch=v9a
which adds 64-bit addressing whilst still targeting all the members of
the UltraSPARC family of processors
Recommended compiler flags for the
UltraSPARC platform
32-bit code
-xtarget=ultra3 -xarch=v8plusa
64-bit code
-xtarget=ultra3 -xarch=v9a
Specifying
the target processor for the x64 processor family
By default the compiler targets a
32-bit generic x86 based processor, so the code will run on any x86
processor from a Pentium Pro up to an AMD Opteron architecture. Whilst
this produces code that can run over the widest range of processors,
this does not take advantage of the extensions offered by the Opteron
family of processors. Consequently it is suggested that for 32-bit code
the Opteron processor is targeted, this will generate code that will
run on processors (such as the Pentium 4 and Opteron) which support the
SSE2 instruction set extensions.
To take advantage of the x64
processor family and the advantages of 64-bit code, the appropriate
compiler flags are -xtarget=opteron -xarch=amd64.
Recommended compiler flags for the
x64 platform
32-bit code
-xtarget=opteron
64-bit code
-xtarget=opteron -xarch=amd64
Optimization and debug
The optimization flags chosen alter three important
characteristics; the runtime of the compiled application, the length
of time that the compilation takes, and the amount of debug that is
possible with the final binary. In general the higher the level of
optimization the faster the application runs (and the longer it takes
to compile), but the less debug information that is available; but the
particular
impact of optimization levels will vary from application to
application.
The easiest way of thinking about this is to consider three
degrees of optimization, as outlined in the following table.
Purpose
Flags
Comments
Full debug
[no optimization flags] -g
The application will have full debug capabilities, but
almost no optimization will be performed on the application, leading to
lower performance.
Optimised
-g -O
[-g0 for C++]
The application will have good debug capabilities, and a
reasonable set of optimizations will be performed on the application,
typically leading to significantly better performance.
High optimization
-g -fast
[-g0 for C++]
The application will have good debug capabilities, and a
large set of optimizations will be performed on the application,
typically leading to higher performance.
Note: For C++ the debug flag -g
will inhibit some of the inlining of methods, the flag -g0
will provide debug information without inhibiting the inlining of
these methods. Consequently it is recommend that for higher levels of
optimization that -g0 be
used instead of -g.
Suggestion: In general an optimization level of at least -O
is suggested, however the two situations where lower levels might be
considered are (i) where more detailed debug information is required
and (ii) the semantics of the program require that variables are
treated as volatile, in which case the optimization level should be
lowered to -xO2.
More details on debug information
The compiler will generate information for the debugger if the
-g
flag is present. For lower levels of optimization, the -g
flag disables some minor optimizations (to make the generated code
easier to debug). At higher levels of optimization, the presence of
the flag does not alter the code generated (or its performance) --
but be aware that at high levels of optimization it is not always
possible for the debugger to relate the disassembled code to the
exact line of source, or for it to determine the value of local
variables held in registers rather than stored to memory.
As discussed earlier, the C++ compiler will disable some of
the
inlining performed by the compiler when the -g compiler flag is used,
however the flag -g0 will tell the compiler
to do all the
inlining that it would normally do as well generating the debug
information.
A very strong reason for compiling with the -g
flag is
that the Sun Studio Performance Analyzer can then attribute time
spent in the code directly to lines of source code -- making the
process of finding performance bottlenecks considerably easier.
Suggestion
-
Always compile with -g
since it should not make much (if any) difference to performance. Your
program will be easier to debug and analyze.
-
On x86 platforms, the -xregs=frameptr
allows the compiler to use the framepointer as an unallocated
callee-saves register, which can result in increased run-time
performance. However, this compile option should not be used during
debugging because the debugger and Performance Analyzer will be unable
to correctly determine the contents of the stack.
Using the -fast Option
The compiler option -fast is a
'macro' option, meaning
that it stands for a number of options that generally give good
performance on a range of codes. But there are a number of pros and
cons regarding -fast that you should be aware
of.
Pros:
Cons:
-
The -fast option lets the
compiler assume that the target platform the code will run on is the
same platform on which it was compiled (because it includes -xtarget=native).
Therefore you may need to explicitly set the target platform. For
example:
-fast
-xtarget=ultra3 -xarch=v8plusa or
-fast
-xtarget=opteron -xarch=amd64
-
The meaning of the -fast option
can change with compiler releases.
-
-fast allows the compiler to make
floating-point arithmetic simplifications (for example reordering
floating point expressions), so the resulting code is not IEEE-754
compliant.
-
While -fast gives good
performance on most code, it might not be the best set of options for
your particular application.
Notes
-
Using -fast enables a
number of optimizations. Be sure that you understand all the
optimizations that it uses.
-
Use the flags -# or -xdryrun
for C, and -v for C++ and Fortran to tell the
compiler to list the components of -fast.
Suggestion
-
-fast is a good starting
point when optimizing code. However, it may not necessarily be the set
of optimizations you want for the finished program. It is a better idea
to use the -#, -xdryrun, or -V
options to print out the options that-fast
includes, and to select the appropriate ones for your application from
this list.
Refer to Comparing
the -fast Option Expansion on x86 Platforms and SPARC Platforms
for the expansion of -fast by Sun Studio 10
C, C++, and Fortran
compilers, cc, CC,
and f95, respectively.
The implications for floating-point arithmetic when using the
-fast option
One issue to be aware of is the inclusion of floating-point
arithmetic simplifications in -fast. In
particular, the
options -fns and -fsimple=2
allow the compiler to
do some optimizations that do not comply with the IEEE-754
floating-point arithmetic standard, and also allow the compiler to
relax language standards regarding floating point expression
reordering.
With the flag -fns, subnormal
numbers (that is, very
small numbers that are too small to be represented in normal form)
are flushed to zero.
With -fsimple, the compiler can
treat floating-point
arithmetic as a mathematics textbook might express. For example, the
order additions are performed doesn't matter, and it is safe to
replace a divide operation by multiplication by the reciprocal. These
kinds of transformations seem perfectly acceptable when performed on
paper, but they can result in a loss of precision when algebra
becomes real numerical computation with numbers of limited precision.
Also, -fsimple allows the compiler
to make optimizations
that assume that the data used in floating-point calculations will
not be NaNs (Not a Number). Compiling with -fsimple
is not recommended If you expect computation with NaNs.
Notes
-
The use of the flags -fns
and -fsimple can result in significant
performance gains. However, they may also result in a loss of
precision. Before committing to using them in production code, it is
best to evaluate the performance gain you get from using the flags, and
whether there is any difference in the results of the application.
-
Avoid using -fsimple with
applications that perform calculations on NaNs.
-
For more information on floating-point computation,
see the Numerical
Computation Guide.
Advanced compiler options: Data Prefetch
Often the biggest processor wait time for a code is the time
taken
to fetch data from memory. The UltraSPARC and AMD architectures have
powerful
hardware and software prefetch mechanisms. To get the most out of
this feature of the chip, the compiler needs to insert prefetch
instructions in the code.
Since the release of the Sun Studio 9 compilers, this option
has been enabled by default. However, it is worth discussing the two
flags that control
this, -xprefetch tells the compiler to insert
prefetch
instructions whenever appropriate. -xprefetch_level suggests
to the compiler how aggressively it should insert those prefetch
instructions. In general, prefetch will help codes that do a lot of
floating-point arithmetic, or where the data is fetched from memory
in a predictable order.
Another flag that helps prefetch insertion is -xdepend.
This flag tells the compiler to analyze dependences between loop
iterations, and to determine the memory access pattern. This allows
the compiler to do a better job of analyzing which variables are
fetched from memory, and then more accurately predicting when
variables should be prefetched.
Suggestion
Advanced compiler options: Assertions about C/C++ pointers
There are two flags that you can use to make assertions about the use
of pointers in your program. These flags will tell the compiler
something that it can assume about the use of pointers in your source.
It does not check to see if the assertion is ever violated, so if your
code violates the assertion, then your program might not behave in the
way you intended it to.
Note that lint can help you do some validity
checking of
the code at a particular -xalias_level. (See
Chapter 5 of
the C
User`s
Guide.)
The two assertions are:
-
-xalias_level
Indicates what assumptions can be made about the degree of aliasing
between two different pointers. -xalias_level
can be considered a statement about coding style -- you are telling the
compiler how you treat pointers in the coding style you use (for
example, you can tell the compiler that an int*
will never point to the same memory location as a float*).
A useful piece of terminology is the expression 'alias'. Two
pointers alias if they point to the same location in memory. The
flags -xrestrict and
-xalias_level tell the
compiler what degree of aliasing to assume in the code. For the
compiler, aliasing means that stores to the memory addressed by one
pointer may change the memory addressed by the other pointer -- this
means that the compiler has to be very careful never to reorder
stores and loads in expressions containing pointers, and it may also
have to reload the values of memory accessed through pointers after
new data is stored into memory.
The following table summarizes the options for -xalias_level
for C (cc).
cc
-xalias_level=
Comment
any
Any pointers can alias (default)
basic
Basic types do not alias each other (for example,
int* and float*)
weak
Structure pointers alias by offset. Structure members of
the same type at the same offset (in bytes) from the structure pointer,
may alias.
layout
Structure pointers alias by common fields. If the first
few fields of two structure pointers have identical types, then they
may potentially alias.
strict
Pointers to structures with different variable types in
them do not alias
std
Pointers to differently named structures do not alias
(so even if all the elements in the structures have the same types, if
they have different names, then the structures do not alias).
strong
There are no pointers to the interiors of structures and char*
is considered a basic type (at
lower levels char* is considered as
potentially aliasing with any other pointers)
The following table summarizes the
options for -xalias_level for C++
(CC).
CC
-xalias_level=
Comment
any
Any pointers can alias (default)
simple
basic types do not alias (same as basic
for C)
compatible
corresponds to layout for C
Notes
-
Specifying -xrestrict and -xalias_level
can lead to significant performance gains. But if your code does not
conform to the requirements of the flags, then the results of running
the application may be unpredictable.
-
For C, -xalias_level=std
means that pointers behave in the same way as the 1999 ISO C standard
suggests. Specified for standard-conforming codes.
Advanced compiler options: Crossfile optimization
The -xipo option performs
interprocedural optimizations
over the whole program at link time. This means that the object files
are examined again at link time to see if there are any further
optimization opportunities. The most common opportunity is to inline
one code from one file into code from another file. The term inlining
means that the compiler replaces a call to a routine with the actual
code from that routine.
Inlining is good for two reasons, the most obvious being that
it
eliminates the overhead of calling another routine. A second, less
obvious reason is that inlining may expose additional optimizations
that can now be performed on the object code. For example, imagine
that a routine calculates the color of a particular point in an image
by taking the x and y position of the point and calculating the
location of the point in the block of memory containing the image
(image_offset = y * row_length + x). By inlining
that code in
the routine that works over all the pixels in the image, the compiler
is able generate code to just add one to the current offset to get to
the next point instead of having to do a multiplication and an
addition to calculate each address of each point, resulting in a
performance gain.
The downside of using -xipo is that
it can significantly
increase the compile time of the application and may also increase
the size of the executable.
Suggestion
Advanced compiler options: Profile feedback
When compiling a program, the compiler takes a best guess at
how
the flow of the program might go -- about which branches are taken
and which branches are untaken. For floating-point intensive code,
this generally gives good performance. But programs with many
branching operations might not obtain the best performance.
Profile feedback assists the compiler in optimizing your
application by giving it real information about the paths actually
taken by your program. Knowing the critical routes through the code
allows the compiler to make sure these are the optimized ones.
Profile feedback requires that you compile and execute a
version
of your application with -xprofile=collect
and then run the
application with representative input data to collect a runtime
performance profile. You then recompile with -xprofile=use
and the performance profile data collected. The downside of doing
this is that the compile cycle can be significantly longer (you are
doing two compiles and a run of your application), but the compiler
can produce much more optimal execution paths, which means a faster
runtime.
A representative data set should be one that will exercise the
code in ways similar to the actual data that the application will see
in production; the program can be run multiple times with different
workloads to build up the representative data set. Of course if the
representative data manages to exercise the code in ways which are
not representative of the real workloads, then performance may not be
optimal. However, it is often the case that the code is always
executed through similar routes, and so regardless of whether the
data is representative or not, the performance will improve.
Suggestion
-
Try compiling with profile feedback and see whether
the performance gain is worth the additional compile time.
-
Try compiling with profile feedback and
-xipo, because the profile information will also help the
compiler make better choices about inlining.
Advanced compiler options: Large pages for data
If the program manipulates large data sets, then it may be the
case that it would benefit from using large pages to hold the data.
The idea of a 'page' is a region of contiguous physical memory; the
processor deals in virtual memory, which allows the processor the
freedom to move the data around in physical memory, or even store it
to and load it from disk. Since the processor deals with virtual
memory, it has to look up virtual addresses to find the physical
location of that data in memory; in order to do this it uses the
concept of pages. Every time the processor needs to access a
different page in memory, it has to look up the physical location of
that page. This takes a small amount of time, but if it happens
often the time can become significant. The default size of these
pages is 8KB, however the processor can use a range of page sizes.
The advantage of using a large page size is that the processor will
have to perform fewer lookups, but the disadvantage is that the
processor may not be able to find a sufficiently large chunk of
contiguous memory to allocate the large page on (in which case a set
of 8KB pages will be allocated instead).
The compiler option which controls page size is -xpagesize=size. The options for the size
depend on the platform. On UltraSPARC processors, typical sizes are 8K,
64K, 512K, or 4M. For example, changing the page size from 8K (the
default) to 64K will reduce the number of look ups by a factor of 8. On
the Opteron platform, the choices for page size are 4K (the default) or
2M. Operating system support for large pages became available with the
Solaris 9 OS release on SPARC platforms, and on x86/x64 platforms as
well with the Solaris 10 OS release.
A set of flags to try
The final thing to do is to pull all these points together to
make
a suggestion for a good set of flags. Remember that this set of flags
may not actually be appropriate for your application, but it is hoped
that they will give you a good starting point. (Use of the flags in
square brackets, [..] depends on special circumstances.)
Flags
Comment
-g
Generate debugging information (may use -g0
for C++)
-fast
Aggressive optimization
-xtarget=ultra3 -xarch=v8plusa
Specify target platform
-xprefetch
Enable prefetch instructions (enabled by default in Sun
Studio 9)
-xipo
Enable interprocedural optimization
-xprofile=[collect|use]
Compile with profile feedback
[-fsimple=0 -fns=no]
No floating-point arithmetic optimizations. Use if
IEEE-754 compliance is important
[-xalias_level=val]
Set level of pointer aliasing (for C and C++). Use only
if you know the option to be safe for your program.
[-xrestrict]
Uses restricted pointers (for C). Use only if you know
the option to be safe for your program.
-xpagesize=64K
Change the page size for data
Final remarks
There are many other options that the compilers recognize. The
ones presented here probably give the most noticeable performance
gains for most programs and are relatively easy to use. When
selecting the compiler options for your program:
-
It is important to be aware of just what you are telling
the compiler to do. A program may have unpredictable results if it does
not conform to the requirements of the flags.
-
When using optimization you will often be trading
increased compile time for improved runtime performance.
-
Which leads to the final suggestion that you should only
use the flags which both give you a performance benefit and make
acceptable assertions about the code.
For details on all these options, see the
compiler user guides and man pages.
Further reading
About the Author
Darryl Gove is a senior staff engineer in Compiler
Performance Engineering at Sun Microsystems Inc., analyzing and
optimizing the performance of applications on current and future
UltraSPARC systems. Darryl has an M.Sc. and Ph.D. in Operational
Research from the University of Southampton in the UK. Before joining
Sun, Darryl held various software architecture and development roles
in the UK.
(Page last updated January 3, 2006)