Building Enterprise Applications with Sun Studio Profile Feedback
By Giri Mandalika, Market Development Engineering, Sun Microsystems, April 11, 2006
Large, CPU intensive applications may perform better when built
with profile feedback. Profile feedback optimization requires the
application be built twice, once to collect the profile data, and
again to make use of the profile to generate optimal code. This
requirement may prevent many software vendors from building their
applications with profile feedback Generating a profile for each
software release can be impractical. However it is
possible to use old profiles to minimize the overhead of profile
feedback builds in a development environment and without compromising the
advantages of feedback directed optimization.
This article introduces all the stages of profile feedback with
examples, and offers some tips for making profile feedback builds,
feasible.
Contents:
0. Introduction
In general, compilers generate object code based on the
pre-defined heuristics, and the optimization flags supplied during
the compilation. However since compilers cannot predict the dynamic
behavior of the code, they have to rely on heuristics for the best
possible guess; and hence the generated code may or may not perform
well with typical workloads.
Processor stalls are one of the problems that could occur with
large applications with tons of instructions. Since the processor
cannot hold all the instructions on chip at any given time, it has to
wait while some instructions are being fetched from memory. So it is
up to the developer to lay out the high level instructions carefully
to reduce processor stalls for better performance. As
developers may not be the end users in most cases, it will be a
cumbersome exercise for them to gather the application run-time data,
to identify the hot code (where the application spends most of
the time), and to re-write/re-arrange some blocks of code to improve
the run-time performance. Programmers can be relieved from such tasks
by using the feedback based optimization technique, supported
by Sun Studio compilers. When the run-time behavior of the code is
available in the form of a profile, the object code can be laid out
by the compiler, in a way that the on-chip (Level-1 or L1)
cache and the memory will be used efficiently, during the run time.
Note that the Sun Studio C, C++ & Fortran compilers have the
ability to generate optimal code using profile feedback data. Even
though the examples in this article were written in C, the execution
methodology is the same for all applications, regardless of the
high-level language used to develop them. Also, the steps outlined in
this article can be used to build any kind of application, and
not just enterprise applications as the title suggests.
1. Feedback-Based Optimization
In some situations, the desired code improvements may not be
achieved directly with compiler's classical optimization flags. For
example, a hot routine may not be auto-inlined by the compiler at
optimization level 4 (-O4) or higher, if its
inclusion violates the threshold heuristic defined in the compiler.
In this case, using profile feedback data may help inline the hot
code.
Feedback based optimization (FBO) is the term used to describe any
technique that alters a program based on information gathered at
run-time. It is also widely known as feedback directed
optimization (FDO) and profile feedback optimization
(PFO). The idea behind this technique is to supply some information
about run-time behavior of the program, to the compiler. Upon
instrumentation of the object code by the complier, the program is
profiled and this profile data is used by the compiler to generate
optimal code that would run faster.
When the profile data is available, the compiler's front end reads
the execution counts of each block from the profile feedback file,
and attaches them to the program's intermediate representation (IR).
It will be done, at the beginning of any kind of optimization. Many
compiler optimizations subsequently use the execution counts from IR.
Based on the profile data, compiler can do optimizations of the
following types:
Code Layout: Arrange code in a way that the frequently
executed code in a routine is grouped together. Its goal is to
reduce instruction cache (I$) misses, and to improve instruction
fetch by using profile information to guide the layout of code in
memory. The article Improving
Code Layout Can Improve Application Performance explains code
reordering using profile feedback.
Inlining: Inline routines that are frequently called.
Rarely executed functions may not be inlined even if they are
eligible for auto-inlining. Inlining eliminates the cost of the call
to the routine; and exposes further opportunities for optimization.
Register allocation: utilizes the block counts from
profile data, to determine register spills.
Loop transformations: Loop invariant code motion, loop
fusion, loop interchange, loop peeling etc.
Branch optimizations: Tail splitting, branch
interchange, loop unswitching etc.
Block straightening or outlining moves
infrequently executed blocks to separate sections.
Switch case code generation: Most frequent cases
tested early, to avoid branching.
Global instruction scheduling:
groups instructions with no dependencies together to avoid
processor stalls in the pipeline.
Delay slot scheduling:
for processors with branch/call delay slots, such as SPARC,
an instruction is said to require a delay slot if some instructions
following the instruction are executed as if they were located before it.
With profile data, the scheduler can reduce branch penalties by using
instructions from more frequent blocks for normal branches.
sethi hoisting:
hoists sethi instructions to
reduce register pressure.
Copy loop detection: uses profile data to select frequently
executed copy loops.
Branch prediction: using profile data, the compiler
can minimize pipeline stalls for processors that support
branch prediction (such as SPARC) by setting the branch prediction
bit in the opcodes for conditional branch instructions.
Typical steps involved in using profile feedback mechanism, are as
follows:
You can find a quick introduction to these steps in Use
Profile Feedback To Improve Performance . The following details
these steps with examples.
1.1. Compile for profile data collection
Build the application with -xprofile=collect compiler
option. In this step, the object code is instrumented to gather
profile data. ie., counters are inserted into the object code to
facilitate determining the number of times the code was executed.
Instrumented objects can also be referred as profiled objects.
Instrumented code runs slower compared to non-instrumented code. Use
instrumented code only to collect profile data.
When the instrumented binaries are run, the application may appear
unchanged to the end user, but profile data is collected as a side
effect of execution. This data will be used by the compiler in use
phase of FBO, to generate highly optimized binaries.
FBO requires the code to be compiled at optimization level 2, or
above. If no optimization level is specified on compile line,
compiler uses level 2 optimization (ie., -O2), by
default. Compiler may suppress certain optimizations with
-xprofile=collect option, to record accurate information
about the run-time behavior of the code. However it is recommended to
specify exact compiler flags, except for the value -xprofile, in both
phases of feedback based optimization.
The following example shows the steps involved in generating
instrumented binaries with the Sun Studio C compiler.
% cat bubblesort.c
#include <stdio.h>
#include <stdlib.h>
#define COUNT 50
void swap (int *Array, int i, int j)
{
int temp;
temp = Array[i];
Array[i] = Array[j];
Array[j] = temp;
}
void bubblesort(int *Array, int length)
{
int i, j;
for (i = 0; i < length; ++i)
{
for (j = (i + 1); j < length; ++j)
{
if (Array[i] > Array[j])
{
swap (Array, i, j);
}
}
}
}
int main()
{
int i, *Array;
Array = (int *) malloc (sizeof (int) * COUNT);
for (i = COUNT; i > 0; --i)
Array[COUNT - i] = i;
bubblesort(Array, COUNT);
for (i = 0; i < COUNT; ++i)
printf("\nArray[%d] = %d", i, Array[i]);
return (0);
}
To enable profile data collection, compile this code with
-xprofile=collect and -xO2 options.
% cc -o bubblesort -xO2 -xprofile=collect bubblesort.c
Use -xprofile_ircache[=path] with the
-xprofile=collect|use option, to improve compilation
time during the use phase by reusing compilation data
(Intermediate Representation or IR) saved from the collect
phase. Be aware that the saved data could increase disk space
requirements considerably.
For more information, please have a look at -xprofile=p
and -xprofile_ircache options, under C
compiler options reference in the C User's Guide.
1.2. Training run/profile feedback data
collection run
Run the instrumented binary (that is, the binary compiled with
-xprofile=collect), with one or more representative
workloads. If the workload is representative, then the branches that
are normally taken in the training run are normally taken in the real
workload.
In general, if you run your program with only a single input file,
then you can just run that input file, and you'd have collected good
profile data. However, if you are creating a general purpose
application that can have a variety of inputs which cause execution
of different parts of your program, you should choose different kinds
of representative sample inputs, which your program will receive.
Using only certain kinds of input will bias the compiler in favoring
the executed paths of the program more, than the non-executed paths.
So, it is important to find one or a combination of training
workloads that may give the best possible results in almost all
scenarios.
In this phase, the compiler instrumented code collects the branch
frequencies for all branches, and the counts for all basic blocks. As
a side effect of the execution, a directory named after the program
will be created, with .profile extension. feedbin
file under <program>.profile directory, holds the
execution frequencies of various blocks, for later use by the
optimizer when the source code is compiled again with -xprofile=use
option. feedbin file can be referred as profile feedback
file.
The profile data collection is additive . That is, if you
run the profiled executable more than once, with similar or different
inputs, the data from the recent run will be added to the data
collected from previous runs. Therefore, the profile data will be an
aggregate of all your runs with the profiled executable.
But you do need to observe caution here. If you have profile data
from earlier training runs, and if you recompile the program with
-xprofile=collect and re-run it, the compiler
instrumented code that writes out the profile data will detect it as
a different program, and overwrites the old data.
By default, the <program>.profile directory
will be created in the same directory, from where the executable is
being run. If you wish to change the directory in which the profile
data resides, you can use the SUN_PROFDATA_DIR
environment variable, as shown in the following example.
Let's continue with the bubble sort example:
% ./bubblesort
Array[0] = 1
Array[1] = 2
..
..
Array[48] = 49
Array[49] = 50
% ls -dF *.profile
bubblesort.profile/
% ls -dF /tmp/*.profile
No match
% setenv SUN_PROFDATA_DIR /tmp
% ./bubblesort
Array[0] = 1
Array[1] = 2
..
..
% ls -dF /tmp/*.profile
/tmp/bubblesort.profile/
1.2.1 Single feedbin for
all profiled processes
By default, the profiler thread creates one profile feedback file
(ie., feedbin) for each profiled executable. The default
behavior is good enough for small programs or applications with very
few executables. However for large applications with tens of
executables, having too many profile feedback files, pose slight
inconvenience in use phase, where these feedback files are specified
on compile line with -xprofile=use:<path_to_profdir>
option, to produce optimal binaries.
For example, if the application consists of twenty executables, we
need to have twenty -xprofile=use flags on the compile
line, as shown below:
% cc -xO2 -xprofile=use:feedback1 -xprofile=use:feedback2 [...] \
-xprofile=use:feedback19 -xprofile=use:feedback20 -o optimalbin <sourcefile>.c
There are two major inconveniences with this:
If the make file grabs all compiler options from environment
variables like CFLAGS, it may not be possible to
specify all instances of -xprofile=use in a single
CFLAGS, due to the underlying shell restrictions on the
number of characters per variable.
The compile line may become too long, and look ugly with too
many instances of -xprofile=use.
To get around these inconveniences, it is recommended to use
compiler supported environment variables SUN_PROFDATA_DIR
and SUN_PROFDATA in profile data collection phase, to
request the profiler to write all the profile data from different
profiled processes into a single feedbin file, instead
of creating one per executable. If these environment variables are
set, the profiler writes the profile data into the file pointed by
SUN_PROFDATA, under the directory SUN_PROFDATA_DIR
. That is, the profile data from all processes will be written into
$SUN_PROFDATA_DIR/$SUN_PROFDATA.
The following trivial example illustrates the default behavior, as
well as the behavior with SUN_PROFDATA and
SUN_PROFDATA_DIR environment variables.
Here's an example:
% cat a.c
#include <stdio.h>
int main()
{
printf("In a.c\n");
return (0);
}
% cat b.c
#include <stdio.h>
int main()
{
printf("In b.c\n");
return (0);
}
% cc -xO2 -xprofile=collect -o a a.c
% cc -xO2 -xprofile=collect -o b b.c
1.2.1.1 Default behavior
Note that in the following example there are two profiles, one
per executable. a.profile holds the profile data for
the executable "a"; and b.profile
holds the profile data for the executable "b".
% setenv SUN_PROFDATA_DIR /tmp/default
% ./a
In a.c
% ./b
In b.c
% ls /tmp/default
a.profile/ b.profile/
To use the feedback data, the programs have to be compiled as
follows:
% cc -xO2 -xprofile=use:/tmp/default/a -o a a.c
% cc -xO2 -xprofile=use:/tmp/default/b -o b b.c
1.2.1.2 Requesting a single feedback file
A single feedback file can be requested:
% setenv SUN_PROFDATA_DIR /tmp/consolidate
% setenv SUN_PROFDATA singlefeedbin.profile
During run-time, the profiler thread reads the values of SUN_PROFDATA
and SUN_PROFDATA_DIR and writes all profile feedback
data from different profiled processes into a single feedbin file
under /tmp/consolidate/singlefeedbin.profile directory.
Here's an example:
% mkdir /tmp/consolidate
%./a
In a.c
%./b
In b.c
% ls /tmp/consolidate
singlefeedbin.profile/
Observe that singlefeedbin.profile holds the feedback
data for both executables "a" and "b".
If there are more profiled processes, their profile data will be
appended to this feedbin file.
To use this profile, simply run:
% cc -xO2 -xprofile=use:/tmp/consolidate/singlefeedbin -o a a.c
% cc -xO2 -xprofile=use:/tmp/consolidate/singlefeedbin -o b b.c
But note that writing to a single profile feeback file helps only
when several instrumented objects serve as dependencies for several
profiled processes. The purpose of the above example is only to
show, how to request the profiler to write the profile data into a
single feedback file.
1.2.2 Asynchronous profile data
collection
By default, profile data collection is synchronous. The profiler
thread waits for the shared library finalization (if any), and also
for the process to call exit(), before writing all the
profile data to feedback file. In a way it requires that the process
exit to get the profile data. As a result, multi-threaded
applications may experience some profile data loss due to the
possible race conditions that may occur among multiple threads.
Also, there is no guarantee that all applications, especially
multi-threaded applications, will be designed to terminate
gracefully. If some profiled process does not call exit() but
forces the process to terminate in other ways, for example with a
SIGKILL, it may be unlikely that a usable profile can
be obtained from that process. If the profiled process loads
dynamically and unloads other libraries with the help of dlopen(),
dlclose() system calls, it will lead to indirect call
profiling, with its own share of problems collecting the profile
data.
To alleviate the problems described above, we need some mechanism
to collect the profile data from a running process without requiring
it terminate gracefully. An asynchronous profile data collection
feature was added in the Sun Studio 11 compiler release. It was then
back ported to the Sun Studio 9 and 10 releases. Applying patch
115983-06 (or later) to Studio 9, and 117832-06 (or later) to Studio
10, gives you the ability to control the way the profile data will
be collected. As a result, the chances of getting a good profile
from many single/multi-threaded applications is high, irrespective
of how the profiled processes exit.
1.2.2.1 Enabling asynchronous profile
data collection
Asynchronous profile collection is not enabled by default. To
enable it, set SUN_PROFDATA_ASYNC_INTERVAL environment
variable before running the application. If
SUN_PROFDATA_ASYNC_INTERVAL has been set to a positive
integer value n at the start up of an application, the
profiler thread collects periodic profile data, every n
seconds, and subsequently updates the corresponding feedbin file. n
is the time interval between periodic profile snapshots, in seconds.
When data for a snapshot is collected, the profiler updates a
single profile directory whose name is of the form:
<procname>.<hostname>.<pid>[.profile]
where:
<procname> is the name of the process being
profiled
<hostname> is the host name of the machine
executing the profiled process
<pid> is the process id of
the profiled process
.profile will be appended to the name of the profile
directory unless <dir_name> is specified using the value of
the environment variable SUN_PROFDATA.
Note that the profiler thread collects profile snapshots only for
the process in which it was initiated. Forked processes will not
inherit the profiler thread.
The collected profile data can be used in the use phase of
profile feedback by specifying the compiler option:
-xprofile=use:<procname>.<hostname>.<pid>.
The profile directory can be renamed as you wish before specifying
it in -xprofile=use option.
1.2.2.2 Multiple profile snapshots
per process
Asynchronous profile collection also enables the collection of
profile data more than once per process. If the environment variable
SUN_PROFDATA_ASYNC_SEQUENCE is defined, and set to an
integer value, num_snapshots ≥ 1, the profiler generates a
sequence of distinct profile snapshots whose names are of the form:
<procname>.<hostname>.<pid>.<n>[.profile]
where:
<n> is a positive integer in the range
[1..num_snapshots].
Subsequent profile snapshots are applied to update the
<procname>.<hostname>.<pid>[.profile]
directory for the remaining life time of the process.
The time sequence of profile snapshots generated by setting
SUN_PROFDATA_ASYNC_SEQUENCE might be used to determine
how long profile data should be collected from a given application
in order to obtain good performance with -xprofile=use.
Here's an example:
Let's assume that the program mtserver is compiled
with -xprofile=collect. The async profile data
collection can be done as follows:
% uname -n
v890appserv
% setenv SUN_PROFDATA_ASYNC_INTERVAL 30
% setenv SUN_PROFDATA_ASYNC_SEQUENCE 3
% setenv SUN_PROFDATA_VERBOSE
% setenv SUN_PROFDATA_DIR /tmp/profile
% ./mtserver &
[1] 8529
This example collects a snapshot of profile data from process 8529
every 30 seconds, as long as it runs. The first 3 snapshots will be
saved in their own profile directories:
/tmp/profile/mtserver.v890appserv.8529.1.profile,
/tmp/profile/mtserver.v890appserv.8529.2.profile and
/tmp/profile/mtserver.v890appserv.8529.3.profile. Then
the subsequent snapshots will update the feedback directory:
/tmp/profile/mtserver.v890appserv.8529.profile.
To get any warning messages during profile data collection,
define the environment variable SUN_PROFDATA_VERBOSE.
For multi-threaded programs, observe that the thread count increases
by one if the program is compiled with -xprofile=collect.
The extra thread that you didn't create is the profiler thread –
the compiler adds necessary code to create this thread as part of
its instrumentation.
1.3. Re-build the application with profile
feedback
Once you gather the profile data from the profiled process, feed
it to the compiler with the flag: -xprofile=use:<path_to_profdir>.
The compiler uses this data to do a better job optimizing the
application code. Make sure to give the profile data directory -- if
you only use -xprofile=use, then the compiler does not
know what the profile data directory is called; and therefore looks
for a.out.profile by default. Note that it is not
necessary to add .profile, when specifying the profile
data directory name in -xprofile=use. In the bubble
sort example, it is valid to specify either
-xprofile=use:bubblesort.profile or
-xprofile=use:bubblesort on compile line.
Except for the -xprofile option which changes from
-xprofile=collect to -xprofile=use, the
source files and other compiler options must be exactly the same as
those used for the compilation of profiled objects. The same version
of the compiler must be used for both collect and use builds.
If both -xprofile=collect and -xprofile=use
are specified on the same compile line, the rightmost -xprofile
option in the compile line is applied.
If you are compiling the object file with -xprofile=use
in a directory that is different from the directory in which the
object file was previously compiled with -xprofile=collect,
make sure to add the -xprofile_pathmap=<collect_prefix>:<use_prefix>
option on compile line, so the compiler can find profile data for
the object file. collect-prefix is the prefix of the
pathname of a directory in which object file was compiled using
-xprofile=collect; and use-prefix is the
prefix of the pathname of a directory in which the object file is to
be compiled using -xprofile=use. Refer to C compiler
options reference for detailed information about -xprofile_pathmap
compiler option.
Continuing with the bubble sort example:
% cc -o bubblesort -xO2 -xprofile=use:bubblesort.profile bubblesort.c
% ./bubblesort
Array[0] = 1
Array[1] = 2
..
..
Important Note:
Measure the application performance with
profile feedback, and compare it with baseline numbers before you
put this into a build environment. Because it requires compiling the
entire application code twice, it is intended to be used only after
other debugging and tuning is finished, and as one of the last steps
before putting the application into production or releasing it to
the customers.
1.3.1 Compiling with multiple profiles
Sun studio compilers accept multiple profiles on the compile
line, with multiple -xprofile=use:<path_to_profdir>
options.
-xprofile=use:<path_to_profdir1>:<path_to_profdir2>..<path_to_profdirn>
results in a compilation error.
For example:
% cc -xO2 -xprofile=use:/tmp/prof1.profile -xprofile=use:/tmp/prof2.profile
When the compiler encounters multiple profiles on the compile
line, all the profile data will be merged before any code
transformations are performed, based on the profile feedback data.
1.3.2 Extracting execution counts
If you are curious about the compiler code transformations
performed based on the profile feedback data, use the following code
generator (cg) options, to dump the execution count of each basic
block in an assembly listing.
C (cc):
-xprofile=use:<path_to_profdir>
-Wc,-assembly,-Qcg-V
C++ (CC) and Fortran (f95):
-xprofile=use:<path_to_profdir>
-Qoption cg -assembly,-Qcg-V
The -assembly option will generate a .s
file with the same basename and dirname as the
object file (e.g., bubblesort.o will be accompanied by
bubblesort.s in the same directory). The -Qcg-V
option adds more information as assembler comments to the
generated .s file. If -xprofile=use has
been specified, this information includes execution counts derived
from the <path_to_profdir>
Here's an example:
% cc -xO2 -xprofile=use:bubblesort.profile -Wc,-assembly,-Qcg-V bubblesort.c
% cat bubblesort.s
...
...
! 15 !void bubblesort(int *Array, int length)
! 16 !{
!
! SUBROUTINE bubblesort
!
! OFFSET SOURCE LINE LABEL INSTRUCTION (ISSUE TIME) (COMPLETION TIME)
.global bubblesort
bubblesort: /* frequency 1.0 confidence 1.0 */
/* 000000 16 ( 0 1) */ save %sp,-96,%sp
/* 0x0004 ( 1 2) */ orcc %g0,%i1,%i1
! 17 ! int i, j;
! 19 ! for (i = 0; i < length; ++i)
/* 0x0008 19 ( 1 2) */ ble,pn %icc,.L77000015 ! tprob=0.00
/* 0x000c ( 1 2) */ or %g0,0,%l6 ! const ! hoisted
! Registers live out of bubblesort:
! o2 sp l6 i0 i1 i4 fp i7 gsr
!
! predecessor blocks : bubblesort
.L77000031: /* frequency 1.0 confidence 1.0 */
/* 0x0010 19 ( 0 1) */ or %g0,%i0,%l7
! 20 ! {
! 21 ! for (j = (i + 1); j < length; ++j)
/* 0x0014 21 ( 0 1) */ add %l6,1,%l5 ! no_overflow
! Registers live out of .L77000031:
! o2 sp l5 l6 l7 i0 i1 i4 fp i7 gsr
!
! predecessor blocks : .L77000031 .L900000205
.L900000206: /* frequency 50.0 confidence 1.0 */
/* 0x0018 21 ( 0 1) */ cmp %l5,%i1
/* 0x001c ( 0 1) */ bge,pn %icc,.L77000032 ! tprob=0.02
/* 0x0020 ( 0 1) */ add %l7,4,%i4 ! hoisted
! Registers live out of .L900000206:
! o2 sp l5 l6 l7 i0 i1 i4 fp i7 gsr
!
...
...
1.3.2.1 Alternatives
The code coverage analysis tool, tcov, can be used
to find the frequency of execution of blocks, and instructions. If
the source code is compiled with -g or -g0
debug options, the Sun Studio er_src utility can be
used to read the compiler inserted commentary about the code
transformations.
Please refer to Sun Studio's Performance Analyzer documentation,
for more detailed information about these tools.
2. Building Patches For An Enterprise
Application
There is one frequently asked question to ask when considering to use the
profile feedback mechanism in building applications: Is it necessary to
go through the entire profile feedback life cycle whenever changes are
made to the source code?. The simple answer is: No. The following
explains a simple way to avoid building the entire application with -xprofile=collect
when there aren't many changes in the code base.
If the application is very big and only few objects were
changed, profile only those objects that will be re-built for the
patch. However, in order to collect a meaningful profile, there
needs to be -xprofile=collect versions of all object files
comprising a re-built executable or a shared library. For example,
if the executable mtserver is built by linking the object files a.oand b.o,
re-compile those objects with -xprofile=collect, and
re-link to build a new copy of mtserver. Then: (i) replace the
old binaries in the previously saved collect build with the newly built
binaries; (ii) re-run the training run, and collect the profile data
for the entire build; (iii) finally re-compile all object files
comprising the binary (executable or library) with -xprofile=use,
and then re-link to build the actual binary to be shipped to the
customer, as a patch.
Here's an example:
Assume that a shared library libABC.so was built with profile
feedback, by linking the objects A.o, B.o and C.o. If the objects A
& B were modified/enhanced later, re-build libABC.so with
profile feedback, as outlined below:
Compile the objects A and B with -xprofile=collect.
Link the objects A.o (new), B.o (new) and C.o (old) to build
libABC.so. Make sure to specify -xprofile=collect compiler flag on
link line.
-
Replace libABC.so in the previous full collect build, with
the newly built libABC.so. Here the assumption is that the full
collect build of the application that was used for collecting the
profile data in building the previous version of the application
is still available.
-
Collect profile data for the entire application with the
training run, preferably with the workload used in previous
training run(s).
Compile the objects A and B again with -xprofile=usecompiler flag, and with the new profile data from step #4.
Re-link the objects A.o (new), B.o (new) and C.o (old) to
build libABC.so. Make sure to specify -xprofile=use compiler flag
on link line, along with the new profile data from step #4.
Release libABC.so as a patch, to the customers.
Repeat the above steps for all binaries (executables or shared
libraries) that will be released as a patch. Apparently step #4
will be done only once, even if there are multiple binaries that
need to be re-built, to be released as part of a patch. If there
are several binaries that need to be re-built due to the changes
in source code, consider building the whole application with
-xprofile=collect, instead of building only those binaries (as
explained in the above example) that goes into the patch.
In general, it is desirable to collect profile data whenever
there are some changes in the code base. However doing so may not be
feasible when very large applications were built with profile
feedback. So it is suggested to skip the profile data collection,
and use the existing profile data to reduce the overhead upto some
extent, when the source code changes are limited to very few lines.
Be aware that the gains from profile feedback may diminish over the
time, when the previously collected profile data is used, despite
the large number of changes in code base. So for optimal
performance, collect the profile data again for the whole
application, when the number of source code changes become large
enough to release a bigger patch. That is, when distributing a large
number of modified binaries.
3. Compiling Modified Source With Old Profile
Data
It is important to know how a simple change in source code
affects the feedback based optimization in the presence of old
profile data. Assume that a program was linked with a library
libstrimpl.so, that implements string comparison, __strcmp, and
string length calculation, __strlen.
Example:
% cat strimpl.h
int __strcmp(const char *, const char *);
int __strlen(const char *);
% cat strimpl.c
#include <stdlib.h>
#include "strimpl.h"
int __strcmp(const char *str1, const char *str2)
{
int rc = 0;
for(;;)
{
rc = *str1 - *str2;
if(rc != 0 || *str1 == 0)
{
return (rc);
}
++str1;
++str2;
}
}
int __strlen(const char *str)
{
int length = 0;
for(;;)
{
if (*str == 0)
{
return (length);
}
else
{
++length;
++str;
}
}
}
% cat driver.c
#include <stdio.h>
#include "strimpl.h"
int main()
{
printf("\nstrcmp(pod, podcast) = %d", __strcmp("pod", "podcast"));
printf("\nstrlen(Solaris10) = %d", __strlen("Solaris10"));
return (0);
}
Assume that the shared library, libstrimpl.so, was built with
profile feedback, as shown below:
% cc -xO2 -xprofile=collect -G -o libstrimpl.so strimpl.c
% cc -xO2 -xprofile=collect -lstrimpl -o driver driver.c
% ./driver
% cc -xO2 -xprofile=use:driver -G -o libstrimpl.so strimpl.c
% cc -xO2 -xprofile=use:driver -lstrimpl -o driver driver.c
The library was extended with a new routine for string reversal, __strreverse,
for its next release. Let's see what happens if we skip the profile data
collection for this library after integrating the code for __strreverse routine.
Since the programmer may not care much about the organization of the independent
routines within the source file, the new routine can be placed anywhere (top,
middle or at the end) in the source file.
Case 1: The routine was added at the bottom of the file ie.,
after all existing routines
% cat strimpl.c
#include <stdlib.h>
#include "strimpl.h"
int __strcmp(const char *str1, const char *str2) { ... }
int __strlen(const char *str) { ... }
char *__strreverse(const char *str)
{
int i, length = 0;
char *revstr = NULL;
length = __strlen(str);
revstr = (char *) malloc (sizeof (char) * length);
for (i = length; i > 0; --i)
{
*(revstr + i - 1) = *(str + length - i);
}
return (revstr);
}
% cc -xO2 -xprofile=use:driver -G -o libstrimpl.so strimpl.c
warning: Profile feedback data for function __strreverse is inconsistent. Ignored.
If you do not want to collect the profile data for the new code to
be added, appending the new code at the bottom of the source file is
the recommended way. By doing so, the existing profile data remains
consistent, and can be used by the compiler in optimizing the
untouched (existing) code, as before. Since there is no profile
feedback data available for the new routine, compiler simply
performs other optimizations, as it usually does without -xprofilecompiler option.
Case 2: The routine was added somewhere in the middle of the
source file
% cat strimpl.c
#include <stdlib.h>
#include "strimpl.h"
int __strcmp(const char *str1, const char *str2) { ... }
char *__strreverse(const char *str) { ... }
int __strlen(const char *str) { ... }
% cc -xO2 -xprofile=use:driver -G -o libstrimpl.so strimpl.c
warning: Profile feedback data for function __strreverse is inconsistent. Ignored.
warning: Profile feedback data for function __strlen is inconsistent. Ignored.
Compiler reads the line numbers of the blocks and their execution
counts from the feedback (feedbin) file. As a result, introducing
new code in a routine makes its profile data inconsistent. Also,
since the position of all other routines that are underneath the
newly introduced code may change, their profile data becomes
inconsistent as well. Hence the compiler ignores the profile data of such
routines to avoid introducing functional errors.
Apparently the same explanation holds true even when the new
code was added at the top of the source file above all existing
routines. Such an action leaves all the profile data for this
object, in an unusable (inconsistent) state. Observe the warnings in
the following example, for clear understanding.
Case 3: The routine was added at the top of the source file
% cat strimpl.c
#include <stdlib.h>
#include "strimpl.h"
char *__strreverse(const char *str) { ... }
int __strcmp(const char *str1, const char *str2) { ... }
int __strlen(const char *str) { ... }
% cc -xO2 -xprofile=use:driver -G -o libstrimpl.so strimpl.c
warning: Profile feedback data for function __strreverse is inconsistent. Ignored.
warning: Profile feedback data for function __strcmp is inconsistent. Ignored.
warning: Profile feedback data for function __strlen is inconsistent. Ignored.
The bottom line: If the plan is to skip profile data collection in
favor of using old profile data from previous training run(s),
always add the new code at the bottom of the source file (unless it
needs to be placed somewhere else to avoid compilation errors), to
keep the data consistent at least for majority of the existing code.
4. Other Compiler Options That Could Use
Profile Data
Compiler option -xipo performs crossfile optimization -- optimizations
that extend across multiple source files. One example of this kind of optimization
is inlining a routine from one source file into code from another source
file. In the presence of profile feedback, the compiler has a much better
model of the set of routines that are worth inlining.
Option -xlinkopt causes the compiler to perform link time
optimization. This final phase of compilation uses all the knowledge
of the generated code in order to do some final tweaking of the code
layout. This is useful for large codes where performance can be
gained by laying out the code to keep all the frequently executed
code together.
Refer to the technical article
Improving Code Layout Can Improve Application Performance for more information
about link time optimization.
5. Profile Data Portability Across
Different Platforms
In order to reduce the build time overhead of profile feedback,
it is desirable to use the profile data collected on one platform
in building the application on other
platforms, provided the application code is portable. At the time of this
writing, profile data collected with Sun Studio 11 compilers on SPARC
platforms is not compatible with profile data collected on x86/x64
platforms. That is, profile data collected on one platform cannot be
use on another platform.
6. Alternatives To Feedback-Based
Optimization
Sun introduced a static optimizer, binopt, as part of Sun Studio
11 compiler suite. binopt works directly on binaries. If using
feedback based optimization is either not feasible, or didn't help
much due to the non-representative workloads used in training
run(s), binopt can be used as an alternative, to improve the
performance of the application.
Refer to the Sun Studio Binary Code Optimizer technical article for further
information about using binopt.
7. Summary
The discussion of this article can be summarized as follows:
Depending on the version of your compiler, make sure the
most recent versions of the following patches are installed:
115983
for Studio 9, 117832 for Studio 10; and 120760 for Studio 11 releases.
-
Build the application with compiler flags:
-xprofile=collect, -xO2 or higher, -xipo, -xlinkopt,
and other optimization flags.
For example:
% cc -xO2 -xprofile=collect -xipo -xlinkopt -o application application.c
Collect the profile feedback data asynchronously, by running
the application with one or more representative workloads.
% mkdir /tmp/myapp
% setenv SUN_PROFDATA_ASYNC_INTERVAL 30
% setenv SUN_PROFDATA_DIR /tmp/myapp
% ./application args
Re-build the application with -xprofile=use,
optimization level
-xO2 or above, -xipo,
-xlinkopt and other optimization flags.
% cc -xO2 -xprofile=use:/tmp/myapp.profile -xipo -xlinkopt -o application application.c
References and Further Reading
Acknowledgements
The techniques described in this article are derived from earlier
work done by Vinod Grover and Chris Aoki, and the author wishes to acknowledge
their input.
About The Author
Giri Mandalika is a software engineer in Sun Microsystems Market Development Engineering group, working with independent software vendors to make sure their products run well on Sun platform. He holds a Master's degree in Computer Science from The University of Texas at Dallas.