C++ Compile-Time and Run-Time Performance
By Lawrence Crowl, Sun Studio Development Engineering, Revised March 14, 2006
The organization of your C++ software can have a significant impact on both
compile-time and run-time performance. We present a number of techniques for
improving performance. Topics include compilation, build environment, include
files, inline functions, class design, function design, catching errors, memory
allocation, and template design and instantiation. Because inconsiderate
pursuit of performance can lead to unmaintainable software, we also discuss the
engineering and maintenance implications of the techniques.
Introduction
The goal of any software development effort
should be to produce a quality product in a timely manner.
The run-time performance of the product contributes to its quality
by delivering results faster.
The compile-time performance of the product contributes to its timeliness
by shortening the edit-compile-debug cycle.
However, both run-time performance and compile-time performance
are secondary factors in achieving timely quality.
Therefore, you
should consider run-time and compile-time performance improvements
only when justified by improvements in overall product quality and timeliness.
This paper presents a number of techniques for improving the performance of
C++ programs ["Programming languages -- C++", ISO/IEC 14882:2003]
using the Sun C++ compiler. Most of these techniques are not specific to the
Sun C++ compiler and may be more generally applied. However, these techniques
are not a complete list, so also see the references. The techniques presented
are tactical in nature, providing generally local and incremental improvements
in performance. You can often gain significantly more performance by
improvements
in data structures, algorithms, abstractions, goals, and management.
Many of the techniques presented in this paper are not always helpful.
For instance, some run-time performance improvements
will degrade compile time and coding time.
Likewise, some compile-time performance improvements will degrade run time.
Fortunately, while all of a program contributes to compile time and coding time,
usually only a small fraction of the program
contributes most of the execution time.
You can concentrate run-time performance improvements
on the fraction of the program
that contributes significantly to execution time,
and apply compile-time performance improvements to the rest of the program.
The fraction of the program that contributes most of the execution time
is usually not obvious.
Furthermore, evidence indicates that developers are usually
very poor judges of where the time is spent in programs.
Therefore, you should apply performance improvements
only after the performance is measured,
and the hot spots are known, not suspected.
There are a number of tools available to measure performance. We recommend
the Sun Studio Performance Analyzer
[http://docs.sun.com/doc/819-3687],
which is part of our product. For more system-level effects,
truss
and especially
dtrace,
can also be useful tools.
Compilation
We are continually improving our compilers,
both in their compile time and the run time of generated programs.
Therefore, the most cost-effective technique
for improving run-time and compile-time performance
is often to simply use the newest compiler.
Turnaround
To achieve the fastest build times,
compile without -g and without optimization.
Optimization
The compiler flag -O expands to -xO3,
which has good balance between compile time and run time.
For faster compilation, use -xO1 or -xO2.
For faster execution, there are a number of options,
including -xO4, -xO5, -xipo, -xprofile,
-xlinkopt, and -xalias_level.
Finally, a note of caution about the -fast flag.
This flag is carefully interpreted to provide
maximum performance on a single machine.
It is not good choice for programs
that are to run on a wide variety of machines.
Build Environment
The build environment can have a significant affect on product build times.
Build times can easily be reduced by an order of magnitude
by paying careful attention to system-level effects.
The following rules are usually helpful,
but their significance to overall build times can vary quite a bit.
Use file systems that are local to the process. NFS file access is
significantly slower than local file access, and the compiler may
need to read and write many files.
Use make, not scripts fired by make.
Taking the time to write
explicit dependencies on directory creation, etc, will generally
yield faster build times.
Use flat make structures and efficient makefiles. The cost of yet
another invocation of make can be quite large, particularly when the
makefiles use macros that resolve to executing a program.
[Miller 1997]
Use dmake,
which is part of our product. Even on a single-processor machine, dmake
can sometimes halve build times by overlapping computation and I/O.
Install a local compiler. For single compilations, the cost of loading
the compiler from an NFS file system will never be amortized.
Libraries
Most C++ libraries are designed for good performance
over a wide range of uses.
For example,
the C++ standard library embodies good algorithms
with a significant number of services.
However, their general applicability means that such libraries
are not optimal for any given application.
You can achieve better run-time performance,
and often better compile-time performance, with your own custom library.
However,
designing, documenting, coding, and debugging a custom library is detrimental
to timeliness.
Make your own custom library only when you need product quality
more than you need product timeliness.
Include Files
A significant cost in compilation of C++ programs
is simply reading what can be a very large number of include files.
Program organizations that reduce the number of files included
can significantly reduce compile times.
Abstraction
The most effective technique for reducing file inclusion
is to reduce interface complexity.
Reducing interface complexity generally improves
the understandability of the product
and hence increases quality and timeliness,
so good interfaces generally improve both compilation time and debugging time.
You should understand abstractions
that can help organize your program effectively [Gamma 1995].
However, unbridled abstraction can be detrimental to the product.
For example,
when a project group must use an external interface,
the group often wraps the external interface by an internal interface.
This internal interface
is often just a simple renaming of the external interface,
providing none of the decoupling benefits that abstraction normally provides,
but consuming all of the costs in development time, compile time, and run time.
So do not create abstractions that do not provide recognizable services.
As an example of how abstractions can improve compile times,
consider the case of a service provided through
an abstract base class and factory,
rather than through full classes and the new operator.
For example:
class service {
public:
virtual answer * query() = 0;
virtual void statement( data * ) = 0;
static service * generate();
protected:
service();
};
...
service * instance = service::generate;
instance->statement( data1 );
answer * result = instance->query();
In this example, the entire structure of the actual service
is hidden from the clients of the service,
and those clients need not spend the time to compile that structure,
or include other structures implied by the internals of the service.
Just as importantly,
clients need not be recompiled when the internals of the service change,
which can dramatically reduce incremental product build times.
Multiple Classes
Often several class definitions are needed to provide a service.
These classes should all be defined in a single include file,
rather than one-class per include file.
This technique reduces the number of files opened,
which can have a significant effect on compilation time.
Incomplete Classes
Many classes only get used indirectly within an include file.
For instance, the answer and data classes in the above example
are used only through pointers.
In these cases, the compiler does not need the full definition of the classes,
and incomplete definitions suffice.
So, instead of
#include "answer.h"
use
class answer;
Guarded Includes
Because managing the proper inclusion of class definitions
can often become difficult,
programs tend to rely on idempotent, or self-protecting, include files.
For example,
#ifndef SERVICE_H
#define SERVICE_H
class service { ... };
#endif
If the file has not been previously read, the class is defined,
otherwise, the file is essentially empty.
We can avoid the unnecessary read of the file by testing the guard
on the outside of the include as well.
#ifndef SERVICE_H
#include "service.h"
#endif
This technique is most effective
when the compiler is hitting the limits of available main memory
and as a result the file cache is ineffective.
Many compilers, including Sun Studio 11,
will recognize the typical include guards
and never open the file a second time.
Thus, the additional guard at the include is redundant.
So, this technique is only useful with some compilers.
Inline Functions
Inline function expansion can significantly increase run time,
but the cost is significantly increased compilation time.
Do not use inline functions when you anticipate changes to the function
definition and recompiling all callers is expensive.
In particular, an inline function definition
commits you to a much larger ABI.
Do not commit yourself without due consideration.
Cost/Benefit
Calls to small and quick functions can be smaller and quicker when
expanded inline than when called normally. Conversely, calls to large
or slow functions can be larger and slower when expanded inline than
when branched to. Furthermore, all calls to an inline function must be
recompiled whenever the function definition changes. Consequently, the
decision to use inline functions requires considerable care.
The general advice is to use inline functions
when the inlined function results in fewer instructions
than the call to the function,
or when the application performs significantly faster with the function inline.
Usually this advice means that inline functions are either trivial,
or used extensively in the hot areas of the program.
Un-Inlinable Functions
The compiler cannot inline all function calls,
so making the most effective use of function inlining
may require some source changes.
Use the +w option to learn when function inlining does not occur.
The compiler currently will not inline a function below -xO4
under the following circumstances:
- The function contains difficult control constructs, such as loops,
switch statements, and try/catch statements.
- Many times these
functions execute the difficult control constructs infrequently. To
inline such a function, split the function into two parts, an inner
part that contains the difficult control constructs and an outer
part that decides whether or not to call the inner part. This
technique of separating the infrequent part from the frequent part
of a function can improve performance even when the compiler can
inline the full function.
- The inline function body is large or complicated.
- Apparently
simple function bodies may be complicated because of calls to other
inline functions within the body, or because of implicit
constructor and destructor calls (as often occurs in constructors
and destructors for derived classes). For such functions, inline
expansion rarely provides significant performance improvement, and
the function is best left un-inlined. As a general rule of thumb,
constructors should be out-of-line until proven effective inline.
- The arguments to an inline function call are large or complicated.
- The compiler is particularly sensitive when the object for an
inline member function call is itself the result of an inline
function call. To inline functions with complicated arguments,
simply compute the function arguments into local variables and then
pass the variables to the function. This rewrite is particularly
effective when the parameters are references.
- The function contains static local variables.
However, our compiler's inlining effectiveness is growing,
so what is true this release may not be true next release.
Conditional Inlining
Inlining reduces run time, but increases compile time.
Unfortunately, the value of run time versus compile time
is not fixed throughout program development.
During initial development and debugging,
compile time is significantly more important than run time.
Conversely, during final program build and benchmarking,
run time is more important than compile time.
You can control the amount of inlining by making inline definitions
conditional on a preprocessor variable.
The technique has the general structure:
- service.h:
#ifndef SERVICE_H
#define SERVICE_H
// Include files needed by the class definitions.
// Declare incomplete classes needed by the class definitions.
// Define the classes.
// Declare functions that are never inlined.
#ifdef INLINING
#include "service.ii"
#else // INLINING
// Declare other functions when not inlined.
#endif // INLINING
#endif // SERVICE_H
- service.ii
// Include other files needed by inline function definitions.
// Declare incomplete classes needed by inline function definitions.
// For each function definition:
#ifdef INLINING
inline
#endif INLINING
// Provide the function header and body.
- service.cc
#include "service.h"
// Include other files needed for _all_ function definitions.
#ifndef INLINING
#include "service.ii"
#endif // INLINING
// Define the functions that are never inlined.
The following example illustrates the above technique,
as well as some techniques described earlier.
(The code itself is meaningless.)
store.hstore.cc
#ifndef STORE_H
#define STORE_H
class store {
public:
store();
private:
long variable;
};
#endif // STORE_H
#include "store.h"
store::store() { }
type.htype.ii
#ifndef TYPE_H
#define TYPE_H
#include "store.h"
class sym;
class type : store {
public:
sym* mine();
type* strip();
private:
sym* my_var;
};
#ifdef INLINING
#include "type.ii"
#endif // INLINING
#endif // TYPE_H
#ifdef INLINING
inline
#endif
sym* type::mine() {
return my_var;
};
type.cc
#include "type.h"
#include "sym.h"
#ifndef INLINING
#include "type.ii"
#endif // INLINING
type* type::strip() {
return mine()->mine();
}
sym.hsym.ii
#ifndef SYM_H
#define SYM_H
#include "store.h"
class type;
class sym : store {
public:
sym* up();
type* mine();
private:
sym* up_var;
type* my_var;
};
extern sym* inner( sym* );
#ifdef INLINING
#include "sym.ii"
#else
extern sym* outer( sym* );
#endif // INLINING
#endif // SYM_H
#include "type.h"
#ifdef INLINING
inline
#endif // INLINING
type* sym::mine() {
return my_var;
};
#ifdef INLINING
inline
#endif // INLINING
sym* outer( sym* arg ) {
return inner( arg->up() );
}
sym.cc
#include "sym.h"
#include "type.h"
#ifndef INLINING
#include "sym.ii"
#endif // INLINING
sym* sym::up() {
return up_var;
}
sym* inner( sym* arg ) {
return arg->mine()->mine();
}
main.cc
#include "type.h"
#include "sym.h"
int main() {
sym sv;
sym* sp = sv.mine()->strip()->mine();
return &sv == outer( sp );
}
This technique is particularly valuable during debugging
because the compiler does not inline when the -g switch is present.
As a consequence, during debugging,
inline functions consume object space and compile time
without any performance gain.
Likewise, the reduced number of file inclusions
needed by inlining also improves compile time.
However, when inlining is enabled,
compilations will generally include more files
than they would without the technique.
However, because inlining is generally most useful during optimization,
and optimization itself is slow,
the relative cost in compile time due to extra includes is low.
So, for debuggable builds, use -g and do not define INLINING.
For fast builds without debugging,
do not use -g and do not define INLINING.
For run-time optimized builds,
use one of the optimization flags and define INLINING.
There is one strong note of caution in using this technique.
All objects including files testing INLINING
must be compiled with the same definition of INLINING.
In practice, this restriction means that
only application programmers can use the technique.
Class Design
The design of classes and class hierarchies
can have a significant impact on run-time performance,
through increased operation efficiency,
reduced data size, and reduced code size.
Bit-Fields
Convert boolean and small integer values into bit-fields,
and then place these fields adjacent to each other.
This technique can substantially reduce data size,
though it generally requires more instructions to implement.
The cost of reading bit-fields depends primarily on
how much instruction parallelism is available,
while the cost of writing bit-fields
is usually double the cost of writing a non-bit-field.
Field Order
Order data fields to limit the size of unused padding
between fields.
In particular, order fields by their types in the order
long double, double, long long int, pointers, long int,
float, int, short, and char.
This transformation may not be feasible or reliable
when the program uses typedefs that change with the build
environment.
Devirtualizing Methods
Convert a virtual method to a non-virtual method
when the virtualness of the method is not used.
This technique will improve the speed of calls,
and enable inlining for further speed improvements.
In general, this technique should be applied
towards the end of program development,
when the abstractions used by the program have become stable
and program run time is more important than development
flexibility.
Converter Methods
The standard mechanism for dynamic cast is very general, and
consequently is more expensive than most specific needs warrant.
You can achieve similar functionality by providing dynamic converter
methods rather than using dynamic_cast.
For example, instead of
car* car_ptr = dynamic_cast<car*>( vehicle_ptr )
use
car* car_ptr vehicle_ptr->to_car()
where
class vehicle {
virtual car* to_car() { return (car*)0; }
};
class car : vehicle {
virtual car* to_car() { return this; }
};
Duplicate Methods
Rather than duplicate methods in different derived classes,
define them in their least common base class.
This transformation reduces the working set, which improves run time,
and reduces the number of functions, which improves compile time.
Default Operators
Use the default operators. If a class definition does not declare a
parameterless constructor, a copy constructor, a copy assignment
operator, or a destructor, the compiler will implicitly declare
them. These are called default operators. A C-like struct has these
default operators. When the compiler builds a default operator, it
knows a great deal about the work that needs to be done and can
produce very good code. This code is often much faster than
user-written code because the compiler can take advantage of
assembly-level facilities while the programmer usually cannot. So,
when the default operators do what is needed, the program should not
declare user-defined versions of these operators.
Default operators are inline functions, so do not use default
operators when inline functions are inappropriate (see the previous
section). Otherwise, default operators are appropriate when:
-
The user-written parameterless constructor would only call
parameterless constructors for its base objects and member
variables. Primitive types effectively have "do nothing"
parameterless constructors.
-
The user-written copy constructor would simply copy all base
objects and member variables.
-
The user-written copy assignment operator would simply copy all
base objects and member variables.
-
The user-written destructor would be empty.
Some C++ programming texts suggest that class programmers always define
all operators so that any reader of the code will know that the class
programmer did not forget to consider the semantics of the default
operators. Obviously, this advice interferes with the optimization
discussed above. The resolution of the conflict is to place a comment
in the code stating that the class is using the default operator.
Value Classes
C++ classes, including structures and unions, are passed and returned
by value. For Plain-Old-Data (POD) classes, the C++ compiler is
required to pass the struct as would the C compiler. Objects of these
classes are passed directly. For objects of classes with user-defined
copy constructors, the compiler is effectively required to construct a
copy of the object, pass a pointer to the copy, and destruct the copy
after the return. Objects of these classes are passed indirectly. For
classes that fall between these two requirements, the compiler can
choose. However, this choice affects binary compatibility, so the
compiler must choose consistently for every class.
For most compilers, passing objects directly can result in faster
execution. This execution improvement is particularly noticeable with
small value classes, such as complex numbers or probability values. You
can sometimes improve program efficiency by designing classes that are
more likely to be passed directly than indirectly.
In compatibility mode (-compat[=4]),
a class is passed indirectly if it
has any one of the following:
- A user-defined constructor
- A virtual function
- A virtual base class
- A base that is passed indirectly
- A non-static data member that is passed indirectly
Otherwise, the class is passed directly.
In standard mode (the default mode), a class is passed indirectly if it
has any one of the following:
- A user-defined copy constructor
- A user-defined destructor
- A base that is passed indirectly
- A non-static data member that is passed indirectly
Otherwise, the class is passed directly.
To maximize the chance that a class will be passed directly:
- Use default constructors, especially the default copy constructor,
where possible.
- Use the default destructor where possible. The default destructor
is not virtual, therefore a class with a default destructor should
generally not be a base class.
- Avoid virtual functions and virtual bases.
Classes (and unions) that are passed directly by the C++ compiler are
passed exactly as the C compiler would pass a struct (or union).
However, C++ structs and unions are passed differently on different
architectures.
- SPARC V8 (32-bit addresses)
- Structs and unions are passed and returned by allocating storage
within the caller and passing a pointer to that storage. (That is,
all structs and unions are passed by reference.)
- SPARC V9 (64-bit addresses)
- Structs with a size no greater than 16 bytes (32 bytes) are passed
(returned) in registers. Unions and all other structs are passed and
returned by allocating storage within the caller and passing a pointer
to that storage. (That is, small structs are passed in registers;
unions and large structs are passed by reference.) As a consequence,
small value classes are passed as efficiently as primitive types.
- IA32 (32-bit addresses)
- Structs and unions are passed by allocating space on the stack and
copying the argument onto the stack. Structs and unions are returned
by allocating a temporary object in the caller's frame and passing
the address of the temporary object as an implicit first parameter.
Because all parameters are passed on the stack,
small value classes are passed as efficiently as primitive types.
- AMD64 (64-bit addresses)
- Structs with a size no greater than 16 bytes are passed
and returned in registers.
Unions and other structs are passed by allocating space on the stack and
copying the argument onto the stack.
Unions and other structs are returned
by allocating a temporary object in the caller's frame and passing
the address of the temporary object as an implicit first parameter.
(That is, small structs are passed in registers;
unions and large structs are passed on the stack.)
As a consequence,
small value classes are passed as efficiently as primitive types.
Function Design
Function efficiency is the core of good run-time performance,
and C++ functions often have subtle inefficiencies because the
language often hides significant work behind very terse syntax.
Reference Parameters
Programmers often code functions with reference parameters
rather than value parameters
because "it is more efficient",
rather than because the semantics are appropriate.
However, as we saw above,
value parameters may be more efficient,
and even when they are not directly more efficient,
the compiler knows that a value parameter cannot be aliased,
and so can better optimize access to the parameter.
However, switching between reference parameters and value parameters
is not neutral with respect to semantics,
so care must be taken that the switch is appropriate.
In particular, if the class has virtual functions,
passing the class as a value parameter is probably not appropriate.
Temporary Objects
C++ functions often produce implicit temporary objects, each of which
must be created and destroyed. For non-trivial classes, the creation
and destruction of temporary objects can be expensive in terms of
processing time and memory usage. The C++ compiler does eliminate some
temporary objects, but it cannot eliminate all of them.
Write functions to minimize the number of temporary objects as long as
your programs remain comprehensible. Techniques include using explicit
variables rather than implicit temporary objects and using reference
parameters rather than value parameters. Another technique is to
implement and use operations such as += rather than implementing and
using only + and =. For example, the first line below introduces a
temporary object for the result of a + b, while the second line does
not.
T x = a + b;
T x( a ); x += b;
Cache Member Variables
Accessing member variables is a common operation in C++ member functions.
The compiler must often load member variables from memory through the
this pointer. Because values are being loaded through a pointer, the
compiler sometimes cannot determine when a second load must be
performed or whether the value loaded before is still valid. In these
cases, the compiler must choose the safe, but slow, approach and reload
the member variable each time it is accessed.
You can avoid unnecessary memory reloads by explicitly caching the
values of member variables in local variables, as follows:
- Declare a local variable and initialize it with the value of the
member variable.
- Use the local variable in place of the member variable throughout
the function.
- If the local variable changes, assign the final value of the local
variable to the member variable. However, this optimization may
yield undesired results if the member function calls another member
function on that object.
This optimization is most productive when the values can reside in
registers, as is the case with primitive types. The optimization may
also be productive for memory-based values because the reduced aliasing
gives the compiler more opportunity to optimize.
This optimization may be counter-productive if the member variable is
often passed by reference, either explicitly or implicitly.
On occasion, the desired semantics of a class requires explicit caching
of member variables, for instance when there is a potential alias
between the current object and one of the member function's arguments.
For example:
void complex::operator *= ( const complex& right ) {
real = real * right.real + imag * right.imag;
imag = real * right.imag + imag * right.real;
}
will yield unintended results when called with:
x *= x;
The cached version does not exhibit the problem.
void complex::operator *=(const complex& right){
double this_real = real, this_imag = image;
double that_real = right.real, that_imag = right.image;
real = this_real *that_real + this_imag *that_image;
imag = this_real *that_imag + this_imag *that_real;
}
Common Call Elimination
C++ programs usually rely heavily on procedure calls,
and a function often calls the same function with the same data more than once.
While compilers are generally effective
at sharing the results of common subexpressions,
they are not effective at sharing the results of common calls.
For example,
person* grandfather = me->father()->father();
person* grandmother = me->father()->mother();
has a common call, and will generally be more efficient written as
person* dad = me->father();
person* grandfather = dad->father();
person* grandmother = dad->mother();
While this technique may seem obvious, and indeed it is,
it often requires special attention
because the extremely concise nature of C++ calls
often provides insufficient reminders to consider efficiency.
Const Declarations
Many programmers use const reference parameters with
the intent to inform the compiler that the parameter is read-only, and
hence should be better optimized. Unfortunately, this intent is at
odds with the C++ language definition.
The const keyword says that the storage
may not be modified through the given name.
What it does not say is that the storage
cannot be modified through some other name.
With the exception of variables directly declared const,
which means you can only initialize them,
const is basically ineffective a improving run-time performance.
It does, however, catch errors in the programming process.
Catching Run-Time Errors
While errors are generally uncommon,
their very possibility can have a significant effect
on program development time and on program performance.
Exit
In general, you should use assert, abort, or exit
(or an application-specific variant)
when program termination is acceptable.
This technique yields the smallest intrusion on program performance.
Exceptions
When program termination is unacceptable and errors are rare,
use exceptions.
The C++ exception mechanism is designed for effective handling of rare events.
The Sun C++ compiler has an extremely efficient implementation
of exceptions when they are not thrown,
and it is getting faster with each release.
Furthermore, exceptions limit the coding
needed to deal with the problem to the point of detection
and the point of handling,
and avoid handling by intermediates.
However, exceptions have a high cost when actually thrown,
and are thus not suitable when an error is common.
Return Codes
For common errors, the traditional function return codes
provide the most efficient run-time solution.
Memory Allocation
Memory allocation can consume a very large amount of time.
Reducing the amount of allocation
is usually the most effective technique for improving performance.
Deallocating Memory
Once unnecessary allocations are eliminated,
the next most effective technique for improving performance
is to deallocate memory when it is no longer needed.
This is best accomplished with explicit calls to delete.
However, applications may lose control of their memory,
and a conservative garbage collector can sometimes be a critical tool.
The C++ compiler provides a conservative garbage collector
with the option -library=gc.
Allocating Efficiently
Once memory is deallocated,
you must turn to improving the efficiency of allocation.
The option -library=gc provides a faster malloc and free.
If that is not enough,
use class-specific new and delete operators.
These are commonly called pool allocators.
Finally, when processing character strings,
consider using realloc when adjusting string size
rather than separate new and delete.
Templates
Programs that use templates heavily
can benefit quite a bit from careful template design.
Factoring Common Code
The simple intent of templates
is to adapt a general class or function design to a particular type.
This is readily accomplished by a simple template design.
Unfortunately, these simple designs often result in template instances
with large amounts of identical code
because the entire class or function is specialized,
not just the part that must be specialized.
template< typename T >
struct list {
T* chain;
T* next() { return chain; }
void method();
};
You can reduce the amount of code generated,
and hence the working set size of the application,
by defining a common non-template base class
and a derived template class.
The base class contains the common code,
and the derived class contains the part that must be specialized.
struct list_base {
void* chain;
void method();
};
template< typename T >
struct list : list_base {
T* chain;
T* next() { return (T*)chain; }
};
However, there are some costs associated with this technique.
First, debugging may be hindered when the base class uses void*
rather than a pointer to the actual type.
Second, use of a base class will introduce another layer of function calls,
which can cost more than the lower working set size justifies.
Third, some type-based optimizations may no longer be possible,
which is a concern only at very high levels of optimization.
Definitions
You can organize your template definitions in two ways:
with definitions included and with definitions separated.
The definitions-included organization
places the template definitions within the template header file.
- whatever.h
template< typename T >
int function( T ) { return 0; }
The definitions-separated organization
places the template definitions in a separate file,
which is read as-needed during template instantiation.
- whatever.h
template< typename T >
int function( T );
- whatever.cc
template< typename T >
int function( T ) { return 0; }
It is important to note that the template definition file
is essentially an automatic include file.
Do not compile this file.
The definitions-included organization
requires the compiler to store template definitions,
thus increasing compiler data size,
even when no templates are instantiated,
but it reads the least number of files.
The definitions-separated organization
avoids saving templates when none are instantiated,
but requires searching directories and reading additional files
when templates are instantiated.
If a particular compilation usually does not instantiate a template,
the definitions-separated organization is most efficient.
Otherwise, the definitions-included organization is most efficient.
Instantiation
Template instantiation can have a major impact on compile time.
The default instantiation mode of the compiler is -instances=global,
which provides full support of the language
with minimal programming effort,
and good compilation time.
However, it produces the largest object files.
The fastest template instantiation mode is -instances=explicit,
but it requires manual instantiation of all templates needed,
which can require as substantial amount of programmer effort
to identify all the needed instantiations.
A slightly easier mode is -instances=semiexplicit,
which requires manual instantiation of the directly used instances,
but permits the compiler to automatically produce dependent instances.
However, this mode restricts the kinds of templates one can use,
and produces larger programs than -instances=explicit.
See compiler manual for a discussion of the restrictions.
In summary, use the default instantiation mode
when object size and compilation constraints are not severe.
Use -instances=explicit or -instances=semiexplicit
when both are important and you are willing to work for the benefits.
Summary
First and foremost, design your program well.
Well-designed programs will often have good compile-time performance.
If not, local changes will usually bring good compile-time performance.
When the run-time performance is inadequate,
explore compilation options and alternate libraries.
When the run-time performance is still inadequate,
reprogram the small portion of the code
that contributes most to run time.
Further Reading
- C++ User's Guide
(Detailed information on the Sun Studio 11 C++ compiler,
including command-line options.)
- Performance Analyzer
(This guide describes the performance analysis tools
available with Sun Studio 11 software.)
- Jon Louis Bentley, "Writing Efficient Programs", Prentice-Hall, 1982
- Dov Bulka and David Mayhew,
"Efficient C++: Performance Programming Techniques",
Addison-Wesley, 2000
- Tom Cargill, "C++ Programming Style", Addison-Wesley, 1992
- Erich Gamma, et al., "Design Patterns: Elements of Reusable Object-Oriented
Software", Addison-Wesley, 1995
- "Programming languages
-- C++", ISO/IEC 14882:2003
- Scott Meyers, "Effective C++", Addison-Wesley, 1992
- Scott Meyers, "More Effective C++", Addison-Wesley, 1996
- Peter Miller, "Recursive Make Considered Harmful", AUUGN'97