Using Inline Templates to Improve Application Performance
By Darryl Gove, Compiler Performance Engineering Team, Sun Microsystems
Inline templates are a mechanism for directly
inserting assembly code into an executable. Typically, this approach
is used to obtain the best performance for a given function, or to
implement an algorithm in a specific way.
Introduction
In general, you should never need to use inline
templates, it is normally possible to do all the coding in a
high-level language, and the compiler is able to do an excellent job
of optimising this. However, in some cases you may either know more
about the target hardware, more about the behaviour of the code, or
perhaps want to do something that the compiler doesn't readily
support. In these rare situations you will find inline templates to
be helpful.
The following are examples where inline templates
are particularly useful:
-
User coded mutex locks. If you want to code a
mutex lock, then you will probably want to use the atomic instructions.
-
Hardware-level access. If you are coding for a
hardware device, or perhaps just accessing the registers already
present on the system, then you may end up wanting to use inline
templates.
-
Precise implementation of algorithms. If you
have a short algorithm which can be implemented optimally using
hand-coding tricks that the compiler is unable to replicate, then you
may wish to use inline templates.
To use inline templates, a regular function call
is placed in the source code, then an inline template is produced
with the appropriate name, and at compile time both the source file
and the file containing the inline template are compiled together.
The compiler will then insert the code from the inline template into
the code generated from the source code.
The documentation for inlining using .il files can
be found under man inline(1). This paper is based on that data.
Figure 1 - inline man page
man -M /opt/SUNWspro/man inline
The inline man page is also available in
HTML
from the documentation index of man pages.
Compiling with Inline Templates
You compile inline templates by placing them on
the same compile line as the file which uses them. The code is
inlined by the code-generator stage of compilation.
Figure 2 - Compiling with an inline template file
cc -O prog.c code.il
The above example will compile prog.c and inline
the code from code.il into the appropriate points.
Layout of Code in Inline Templates
The inline template file can contain a number of
inline templates. Each template starts with a declaration, and ends
with an end statement:
Figure 3 - Layout of an inline template
.inline identifier,argument_size
...instructions...
.end
The identifier is the name of the template, and
the argument_size is the size of the arguments in bytes (this is not
required for the latest compiler versions). Multiple templates of the
same name can be placed in the file, but the compiler will pick the
first one.
There is no need for a return instruction since
your template will be inlined directly into your code without a call.
Note that you must prototype the template in your
high-level source code to ensure that the compiler assigns correct
types for all the parameters.
Figure 4 - Example of a prototype for an inline template
void do_nothing();
Figure 5 - Example of a template
/* The following template does nothing*/
.inline do_nothing,0
nop
.end
Figure 4 shows the prototype as it might end up in
code.h. Figure 5 shows the inline template code as it might end up in
a separate code.il file. Inline templates are always in files
with the suffix .il. In the following examples, the prototype has
been included in the same box as the inline template code, this is to
make the paper more readable - they must go into different files.
Guidelines for Coding Inline Templates
The inline code can only use integer registers %o0
to %o5 and floating point registers %f0 to %f31 for temporary values,
other registers should not be used. These registers are referred to
as the 'caller-saved' registers. Calls can be made to other routines
from the inline template, but these calls are subject to the same
constraint.
The compiler will handle most of the SPARC
instruction set. If the template contains only instructions which the
compiler normally generates, then it will be early inlined (see
below), and the code will be scheduled optimally. If the template
contains instructions that the compiler understands, but does not
typically generate (such as VIS instructions or atomics), then the
code may be late inlined, and consequently the code may not be
optimally scheduled - resulting in a slight loss of performance.
Parameter Passing
Parameter passing obeys the parameter passing
defined in the target architecture - so it is different for 32-bit
and 64-bit codes. It is described by the SPARC ABI which can be
referenced at http://www.sparc.org/standards.html,
SCD 2.3 describes v8 (32-bit code) and SCD 2.4.1 describes v9 (64-bit
code).
On entering the template, arguments will be passed
in %o0-%o5, and will continue on the stack. For 32-bit code, the
offset is [%sp+0x5c] and %sp is guaranteed to be 64-byte aligned; for
64-bit code the offset is [%sp+0x8af] (note that %sp+2037 is aligned
to a 16-byte boundary).
Figure 6 - Example of 32-bit parameter passing using the stack
int add_up(int v1,int v2, int v3, int v4, int v5, int v6, int v7);
/*Add up 7 integer parameters - last one will be passed on stack*/
.inline add_up,28
add %o0,%o1,%o0
ld [%sp+0x5c],%o1
add %o2,%o3,%o2
add %o4,%o5,%o4
add %o0,%o1,%o0
add %o2,%o4,%o2
add %o0,%o2,%o0
.end
Example for 64-bit code, note that when a 32-bit
int register is passed on the stack, the full 64-bits of the register
are saved:
Figure 7 - Example of 64-bit parameter passing using the stack
int add_up(int v1,int v2, int v3, int v4, int v5, int v6, int v7);
/*Add up 7 integer parameters - last one will be passed on stack*/
.inline add_up,28
add %o0,%o1,%o0
ldx [%sp+0x8af],%o1
add %o2,%o3,%o2
add %o4,%o5,%o4
add %o0,%o1,%o0
add %o2,%o4,%o2
add %o0,%o2,%o0
.end
For 32-bit code, floating point values will be
passed in the integer registers, for 64-bit code they will be passed
in the floating point registers.
Figure 8 - Example of 32-bit parameter passing by value
double sum_val(double a, double b);
/*sum of two doubles by value*/
.inline sum_val,16
st %o0,[%sp+0x48]
st %o1,[%sp+0x4c]
ldd [%sp+0x48],%f0
st %o2,[%sp+0x48]
st %o3,[%sp+0x4c]
ldd [%sp+0x48],%f2
faddd %f0,%f2,%f0
.end
Figure 9 - Example of 64-bit floating point parameter passing
double sum(double a, double b);
/*sum of two doubles 64-bit calling convention*/
.inline sum,16
faddd %f0,%f2,%f0
.end
For values passed in memory, single precision
floating point values and integers, are guaranteed to be 4-byte
aligned. Double precision floating point values will be 8-byte
aligned if their offset in the parameters is a multiple of 8-bytes.
Integer return values are passed in %o0. Floating
point return values are passed in %f0/%f1 (single precision values in
%f0, double precision values in the register pair %f0,%f1.
For 32-bit code there are two ways of passing the
floating point registers, the first way is to pass them by value, and
the second is to pass them by reference. Either way, the compiler
will do its best to optimise out the load and store instructions, it
is often more successful at doing this if the floating point
parameters are passed by reference.
Example of 32-bit by reference parameter passing:
Figure 10 - Example of 32-bit parameter passing by value
double sum_ref(double *a, double *b);
/*sum of two doubles by reference*/
.inline sum_ref,16
ldd [%o0],%f0
ldd [%o1],%f2
faddd %f0,%f2,%f0
.end
Stack Space
Sometimes it is necessary to store variables to
the stack in order to load them back later - this is the case for
moving between the int and fp registers. The best way of doing this
is to use the space which is already set aside for the parameters
which are passed into the function.
For example in the v8 code shown in Figure 8, the location %sp+0x48 is
8-byte aligned (%sp is 8-byte aligned), and it corresponds to the
place where the 2nd and 3rd 4-byte integer
parameters would be stored if they were passed on the stack (note
that the first parameter would be stored at a non-8-byte boundary).
Branches and Calls
There is support for branching and calls
available. Every branch or call must be followed by a nop instruction
- this is to fill the branch delay slot. It is possible to put
instructions in the delay slot of branches - this can be useful if
you wish to use the processor support for annulled instructions - but
doing so will cause the code to be late-inlined (described below), and may
result in sub-optimal performance.
Call instructions must have an extra last argument
which indicates the number of registers used to pass arguments in the
the call parameters. In general you should avoid inlining call
instructions.
The destinations of branches must be indicated
with a number, and the branch instructions should use this number to
indicate the appropriate destination together with an f for a forward
branch or a b for a backward branch.
Example:
Figure 11 - Example of using branches in an inline template
int is_true(int i);
/*return whether true*/
.inline is_true,4
cmp %o0,%g0
bne 1f
nop
mov 1,%o0
ba 2f
nop
1:
mov 0,%o0
2:
.end
Late and Early Inlining
Inlining of templates is done by the code
generator part of the compiler, there are two opportunities for
inlining, before and after optimisation. If the inline template is
'complex' then it will end up being inlined after optimisation (ie
late inlined), this means that the code will more-or-less appear
exactly as it appears in the template. If the code is inlined before
optimisation (early inlining), then it will be merged with the other
code around the call site.
Early inlining will lead to better performance.
Things that will cause late inlining are:
You will get information in the compiler
commentary on inlining when the code is compiled with -g, this
information will tell you if a routine is late inlined - if there is
no comment, then the routine will have been early inlined. An example
of this is attempting to inline the following (incorrect) template:
Figure 12 - Incorrect inline template
.inline sum_val,16
st %o0,[%fp+0x48]
st %o1,[%fp+0x4c]
ldd [%fp+0x48],%f0
st %o2,[%fp+0x48]
st %o3,[%fp+0x4c]
ldd [%fp+0x48],%f2
faddd %f0,%f2,%f0
.end
The template in figure 12 is incorrect because the
code uses the frame pointer (%fp) rather than the stack pointer
(%sp). The compiler will still inline the code, but because of this
error it is unable to early inline the code, and will have to late
inline the code.
Figure 13 - Compiling with -g to generated debug information

cc -g -O inline32.il driver32.c
Figure 13 shows the compile line used to generate
a 32-bit executable with debug information. Note that the debug
information is stored in the .o files by default, so it is necessary
to keep these files available.
Figure 14 - Using er_src to output compiler commentary

er_src a.out main
Source file: /home/dg83945/book_code/inline/driver32.c
Object file: /home/dg83945/book_code/inline/driver32.o
Load Object: a.out
1. #include <stdio.h>
2.
3. void do_nothing();
4. int add_up(int v1,int v2, int v3, int v4, int v5, int v6, int v7);
5. double sum_val(double a, double b);
6. double sum_ref(double *a, double *b);
7. int is_true(int i);
8.
9.
10. void main()
11. {
12. double a=3.11,b=7.22;
13. do_nothing();
14. printf("add_up %i\n",add_up(1,2,3,4,5,6,7));
Template could not be early inlined because it references the register %fp
Template could not be early inlined because it references the register %fp
Template could not be early inlined because it references the register %fp
Template could not be early inlined because it references the register %fp
Template could not be early inlined because it references the register %fp
Template could not be early inlined because it references the register %fp
15. printf("sum_val %f\n",sum_val(a,b));
16. printf("sum_ref %f\n",sum_ref(&a,&b));
17. printf("is_true 0=%i,1=%i\n", is_true(0),is_true(1));
18. }
The utility er_src can be used to examine the
compiler commentary for a particular file. It takes two parameters,
the name of the executable and the name of the function which you
wish to examine. In this case the template which cannot be early
inlined is sum_val, each time the compiler comes across the %fp
register it inserts a debug message, so you can tell that there are
six instances of references to %fp in the template.
Decoding the Calling Convention
The calling convention for the architecture can be
a bit tricky to master, the easiest way of dealing with this is to
write a test function, and see how that gets converted into assembly
language.
Figure 15 - Examining the 32-bit calling convention
# more fptest.c
double sum(double d1,double d2, double d3, double d4)
{
return d1 + d2 + d3 + d4;
}
#cc -O -xarch=v8plusa -S fptest.c
# more fptest.s
....
.global sum
sum:
/* 000000 2 */ st %o0,[%sp+68]
/* 0x0004 */ st %o2,[%sp+76]
/* 0x0008 */ st %o1,[%sp+72]
/* 0x000c */ st %o3,[%sp+80]
/* 0x0010 */ st %o4,[%sp+84]
/* 0x0014 */ st %o5,[%sp+88]
! 3 ! return d1 + d2 + d3 + d4;
/* 0x0018 3 */ ld [%sp+68],%f2
/* 0x001c */ ld [%sp+72],%f3
/* 0x0020 */ ld [%sp+76],%f10
/* 0x0024 */ ld [%sp+80],%f11
/* 0x0028 */ ld [%sp+84],%f4
/* 0x002c */ faddd %f2,%f10,%f12
/* 0x0030 */ ld [%sp+88],%f5
/* 0x0034 */ ld [%sp+92],%f6
/* 0x0038 */ ld [%sp+96],%f7
/* 0x003c */ faddd %f12,%f4,%f14
/* 0x0040 */ retl ! Result = %f0
/* 0x0044 */ faddd %f14,%f6,%f0
....
In the example code you can see that the first
three fp parameters are passed in %o0-%o5, and that the fourth fp
parameter is passed on the stack at locations %sp+92 and %sp+96. Note
that this location is 4-byte aligned, so it is not possible to use a
single floating point load double instruction to load it.
Example for 64-bit code:
Figure 16 - Examining the 64-bit calling convention
#more inttest.c
long sum(long v1,long v2, long v3, long v4, long v5, long v6, long v7)
{
return v1 + v2 + v3 + v4 + v5 + v6 + v7;
}
# cc -O -xarch=v9 -S inttest.c
# more inttest.s
...
/* 000000 2 */ ldx [%sp+2223],%g2
/* 0x0004 3 */ add %o0,%o1,%g1
/* 0x0008 */ add %o3,%o2,%g3
/* 0x000c */ add %g3,%g1,%g4
/* 0x0010 */ add %o5,%o4,%g5
/* 0x0014 */ add %g5,%g4,%o1
/* 0x0018 */ retl ! Result = %o0
/* 0x001c */ add %o1,%g2,%o0
...
In the above code you can see that the first
action is to load the seventh integer parameter from the stack.
Other Examples of Templates
Templates are used in libm.il - the inline math
library - and in vis.il - the Visual Instruction Set inline library.
These two files can be found in /opt/SUNWspro/prod/lib/. They are
linked in by the compiler when flags -xlibmil (for the math
templates) or -xvis (for the VIS templates) are specified. The
include files which prototype the functions in the template libraries
are math.h and vis.h.
Complete Source Code for 32-Bit Examples
Figure17 - inline32.il file for 32-bit inline template examples
/* The following template does nothing*/
.inline do_nothing,0
nop
.end
/*Add up 7 integer parameters - last one will be passed on stack*/
.inline add_up,28
add %o0,%o1,%o0
ld [%sp+0x5c],%o1
add %o2,%o3,%o2
add %o4,%o5,%o4
add %o0,%o1,%o0
add %o2,%o4,%o2
add %o0,%o2,%o0
.end
/*sum of two doubles by value*/
.inline sum_val,16
st %o0,[%sp+0x48]
st %o1,[%sp+0x4c]
ldd [%sp+0x48],%f0
st %o2,[%sp+0x48]
st %o3,[%sp+0x4c]
ldd [%sp+0x48],%f2
faddd %f0,%f2,%f0
.end
/*sum of two doubles by reference*/
.inline sum_ref,16
ldd [%o0],%f0
ldd [%o1],%f2
faddd %f0,%f2,%f0
.end
/*return whether true*/
.inline is_true,4
cmp %o0,%g0
bne 1f
nop
mov 1,%o0
ba 2f
nop
1:
mov 0,%o0
2:
.end
Figure 18 - driver32.c source file for 32-bit examples
#include <stdio.h>
void do_nothing();
int add_up(int v1,int v2, int v3, int v4, int v5, int v6, int v7);
double sum_val(double a, double b);
double sum_ref(double *a, double *b);
int is_true(int i);
void main()
{
double a=3.11,b=7.22;
do_nothing();
printf("add_up %i\n",add_up(1,2,3,4,5,6,7));
printf("sum_val %f\n",sum_val(a,b));
printf("sum_ref %f\n",sum_ref(&a,&b));
printf("is_true 0=%i,1=%i\n", is_true(0),is_true(1));
}
Complete Source Code for 64-Bit Examples
Figure 19 - inline64.il template file for 64-bit template examples
/* The following template does nothing*/
.inline do_nothing,0
nop
.end
/*Add up 7 integer parameters - last one will be passed on stack*/
.inline add_up,56
add %o0,%o1,%o0
ldx [%sp+0x8af],%o1
add %o2,%o3,%o2
add %o4,%o5,%o4
add %o0,%o1,%o0
add %o2,%o4,%o2
add %o0,%o2,%o0
.end
/*sum of two doubles 64-bit calling convention*/
.inline sum,16
faddd %f0,%f2,%f0
.end
/*return whether true*/
.inline is_true,4
cmp %o0,%g0
bne 1f
nop
mov 1,%o0
ba 2f
nop
1:
mov 0,%o0
2:
.end
Figure 20 - driver64.c source file for 64-bit examples
#include <stdio.h>
void do_nothing();
int add_up(int v1,int v2, int v3, int v4, int v5, int v6, int v7);
double sum(double a, double b);
int is_true(int i);
void main()
{
double a=3.11,b=7.22;
int v1=1, v2=2, v3=3, v4=4, v5=5, v6=6, v7=7;
do_nothing();
printf("add_up %i\n",add_up(v1,v2,v3,v4,v5,v6,v7));
printf("sum %f\n",sum(a,b));
printf("is_true 0=%i,1=%i\n", is_true(0),is_true(1));
}
Running Examples
Figure 21 - Compile and run sequence for the examples
%cc -O driver32.c inline32.il
% a.out
add_up 28
sum_val 10.330000
sum_ref 10.330000
is_true 0=1,1=0
% cc -O -xarch=v9 driver64.c inline64.il
% a.out
add_up 28
sum 10.330000
is_true 0=1,1=0
About the Author
Darryl Gove
is a staff engineer in the Compiler Performance Engineering group at Sun Microsystems Inc.,
analyzing and optimizing the performance of applications on current and future UltraSPARC systems.
Darryl has an M.Sc. and Ph.D. in Operational Research from the University of Southampton in the UK.
Before joining Sun, Darryl held various software architecture and development roles in the UK.