Improving Code Layout Can Improve Application Performance
By Darryl Gove, Senior Performance Engineer, June 22, 2005
Large applications have a particular problem: they have a lot of instructions, and the processor does not have the capacity to hold the entire application on-chip at any one time. As a consequence, larger applications spend some of their run time stalled with the processor waiting to fetch new instructions from memory.
This paper discusses several techniques that help the processor to hold more useful instructions on-chip, consequently reducing the time wasted fetching data from memory.
but before
doing that, it is important to realise that this doesn't just happen
at the level of instructions. Whole routines are often either heavily
used, or rarely used. Similarly libraries might be full of frequently
used routines, or might be required only because of a single library
call which almost never happens.
Since the compiler has the ability
to change the way the code is laid out in memory, it is possible for
the compiler to use memory more efficiently, but it will need more
information to do this. The remainder of this article covers three
different approaches that can be taken to improve the layout of the
application in memory.
Reordering routines using mapfiles
One approach to improve the
situation is to use mapfiles. Mapfiles are a facility that tell the
linker how to layout routines in memory. To use these to improve the
layout of the code it is necessary to order the routines from the
most frequently used to the least
frequently used. The drawing 2 shows
our original program from drawing 1 laid out from hot routines to
cold using a mapfile.

It is possible to manually generate
mapfiles, but an easier approach is to use the Performance Analyzer:
Build the program using the
flag -xF
Run the program with a
representative workload under collect
Generate the mapfile using
er_print
-mapfile <app> <mapfilename>
Rebuild the application with
the flags -xF
-M <mapfile>
Once a mapfile is generated for an
application, the same mapfile can be used on subsequent compiles
until either the profile of the application changes, routines are
renamed, or additional routines are added.
$ cc -O -xF -o app *.c
$ collect app < test_data
Creating experiment test.1.er ...
$ er_print -mapfile app app.map test.1.er
$ cc -O -xF -M app.map -o app *.c
Table 1 - Creating a
mapfile using the Performance Analyzer tools
Improving the layout of instructions using
profile feedback
Mapfiles work very well at the
routine level to separate frequently executed routines from
infrequently executed routines. However, much of the time is spent at
the instruction level, where the processor has to jump over blocks of
unexecuted code. Profile feedback is a compiler technique for
improving this situation.
The idea with profile feedback is to
give the compiler information about how the code is typically run,
based on this information it can do optimisations of the following
types:
Arrange code so that the
frequently executed code in a routine is grouped together.
Inline routines that are
frequently called, to both remove the cost of calling the routine,
and potentially to enable further optimisation of the inlined code.
Profile feedback works best with
crossfile optimisation (controlled by the flag -xipo)
since this allows the compiler to look at potentially optimisations
between all source files.
The drawing 3 shows how profile
feedback can rearrange code within a routine to put the frequently
executed code together.

Profile feedback is relatively
straightforward to use:
Build the application with
-xprofile=collect
-xipo
Run the application with one or
more representative workloads
Rebuild the application with
-xprofile=use -xipo
Notice the inclusion of the -xipo
flag to enable the compiler to do optimisations across the source
files.
$ cc -O -xprofile=collect:app.profile -xipo -o app *.c
$ app < test_data
$ cc -O -xprofile=use:app.profile -xipo -o app *.c
Table 2 - Using profile
feedback to optimise an application
Link-time optimisation
Mapfiles work at the routine level,
and profile feedback works within routines; it would seem to be a
simple progression to do both optimisations at the same time. This is
possible with link-time optimisation (also called post-optimisation).
The principal of link-time
optimisation is that the compiler has done its work, the code exists,
and all that is necessary is to lay it out appropriately. In laying
the code out appropriately, the link-time optimiser will sort the
routines so that hot routines are placed together (in a similar way
to mapfiles), and also lay out the code within those routines so that
hot instructions are placed together. However, it is possible at
link-time to go beyond this:
Since the hot code has been
identified, it is possible to place all the hot-code together, and
then place all the cold code together. The idea being to remove all
code code from the hot region a€¡° placing code from different
routines into the same region of memory.
It is also possible to do
further optimisations since the addresses of variables and routines
can be calculated exactly. Hence the link-time optimiser can
simplify expressions which calculate the address of variables or
routines -- this further reduces the instruction count.
Drawing 4 shows what an application
will look like after it has been link-time optimised. The hot code
will have been grouped together in one part of the binary, and the
cold code in a separate part.

The link-time optimisation step
requires profile feedback data to work, so the necessary steps are as
follows:
Build the application with the
flags -xprofile=collect
-xipo
Run the application with one or
more representative workloads
Rebuild the application with
-xprofile=use
-xipo -xlinkopt
$ cc -O -xprofile=collect:app.profile -xipo -o app *.c
$ app < test_data
$ cc -O -xprofile=use:app.profile -xipo -o app *.c -xlinkopt
Table 3 - Combining
link-time optimisation with profile feedback
Concluding remarks
Using these techniques on larger
applications can yield significant performance gains. It should be
noted that there is a cost in terms of increased build times, and
increased build complexity; consequently the techniques should be
evaluated as to whether the gain is worth the additional effort in
the build. It should also be observed that not all builds of the
application need to go through the process of optimising the code
layout, development builds can be performed without this process, and
the process only applied to the final product build.
About the Author
Darryl Gove is a senior staff engineer in Compiler
Performance Engineering at Sun Microsystems Inc., analyzing and
optimizing the performance of applications on current and future
UltraSPARC systems. Darryl has an M.Sc. and Ph.D. in Operational
Research from the University of Southampton in the UK. Before joining
Sun, Darryl held various software architecture and development roles
in the UK.
(Last updated June 22, 2005)