Using UltraSPARC-IIICu Performance Counters to Improve Application Performance
By Darryl Gove, Compiler Performance Engineering Group, Sun Microsystems
The UltraSPARC-IIICu implements a number of very
useful hardware performance counters. These either count processor events (such
as the number of times that a floating point operation completed),
or they count the number of cycles for which something was true (such as
the number of cycles that the processor was waiting for data
from memory). This article introduces you to the performance
counters, indicating which ones are are of most interest,
and
demonstrates how you might use the Sun ONE Studio Performance Tools to
identify where in your application these events are happening.
You can use this information to improve the
performance of your application.
This is an overview of the
topic and the tools you can use. The commands and
compiler options in these examples may not apply to your code as shown,
and it may be necessary to use different options to get the best results.
Contents
- Performance Counters of Interest
- Reading the Performance Counters
- Using Performance Counters with the Sun ONE Studio
Performance Tools
- Interpreting Analyzer Output
- Common Solutions
- Conclusions
- Links to More Information
Performance Counters of Interest
There are a variety of performance counters
available on the UltraSPARC-IIICu processor. See the processor User's Guide for a
complete overview. The following table shows which performance counters
may be of interest. For clarity the table is broken into
three groups of events.
Interesting UltraSPARC-IIICu Performance Counters
Data layout
DTLB_miss
Number of times that the conversion of a
virtual data address to physical data address was not immediately
available. Each occurrence costs about 100 cycles.
Re_DC_miss
The number of cycles which were lost due to
data not being in the Level-1 cache.
Re_EC_miss
The number of cycles which were lost due to
data not being in the Level-2 cache (these cycles are included in the
Level-1 cache miss cycles, ie Re_EC_miss < Re_DC_miss).
Rstall_storeQ
Number of cycles where the processor was
stalled waiting for store operations to complete.
Code layout
ITLB_miss
Number of times that the conversion of a
virtual instruction address to physical instruction address was not
immediately available. Each occurrence costs about 100 cycles.
Dispatch0_IC_miss
Number cycles where no instructions were
dispatched because of an instruction cache miss.
Re_RAW_miss
Number of cycles stalled due to Read after
Write of data.
Dispatch0_mispredict
Number of cycles where no instructions were
dispatched because of a mispredicted branch.
Load-use stalls
Rstall_FP_use
Number of cycles stalled because the processor
was waiting for a floating point value to be generated.
Rstall_IU_use
Number of cycles stalled because the processor
was waiting for an integer value to be generated.
Reading the Performance Counters
There are three possible tools that you might use to read and interpret
the hardware performance counters.
-
Starting with the Solaris 8 operating environment,
there are two very useful
tools for reading the performance counters - cpustat and cputrack.
cpustat is run as root, and reports the performance counter
statistics on a system-wide basis. cputrack is run on a single
application and reports only those events which occur to that
application. (Links to their man pages appear below.)
-
The tool har is available to provide you
with synthetic system-wide performance metrics (like flops).
(See links below.)
-
The Sun ONE Studio Performance Tools can read
performance counters and attribute the performance counter events to
the location in the code where the events occurred.
My own approach is to use cputrack to
gather high-level summary stats of the application, and then to
collect detailed information on just the top scoring events using the
Sun ONE Studio Performance Tools. I will run through an example of
doing this on a fictitious code:
Our Code Example
$ more sumtest.c
int main()
{
double d[20000];
double total,total1;
int count, rpt;
for (count=0; count<20000; count++)
d[count]=0.01;
total=1;
total1=0.5;
for (rpt=0; rpt<50000; rpt++)
{
for (count=0;count<20000;count++)
total+=total1*d[count];
total1=total/1.776;
}
if (total==0.5) return 1 ; else return 0;
}
The above code was compiled in the following way:
cc -g -O -o sumtest sumtest.c
The -g is necessary to include debug information
which will greatly help with the analysis later.
You can generate debug information for your
application by compiling with the -g flag. Note the
following:
-
At levels of optimisation of higher than -O3
the compiler will perform some transformations which make the code
harder to understand under the debugger. At these levels of
optimisation, the code generated (and the performance) is the same
whether or not debug information is present, at lower levels of
optimisation there is some performance impact.
-
At levels of optimisation -O3 and below, the
compiler will favour clarity of the generated code over performance.
With C++, -g tells the compiler
to avoid doing some inlining of routines. Because C++ code often has many
short methods that benefit from inlining, this can have a significant
performance impact. In order to get the best performance while still
generating debug information, use the C++ compiler flag -g0,
which generates the debug information and does the inlining - however
it will make the resulting code harder to debug.
Both cpustat and cputrack can either
report on just a single pair of performance counters, or they can
rotate through a selection of counters. If your application executes for a
sufficiently long time, and you're happy that the application's behaviour
is reasonable homogenous (i.e. the events are evenly spread through
the entire runtime, and not bunched up), then rotating through the
performance counters is a reasonable approach. If you are not sure,
then running the application a number of times to collect all the
data is always possible.
Invoking cputrack in the following way will
collect information on the most useful counters.
Running cputrack
cputrack -nfe -T 1 \
-c pic0=Cycle_cnt,pic1=DTLB_miss,sys \
-c pic0=Instr_cnt,pic1=Re_DC_miss,sys \
-c pic0=Dispatch0_IC_miss,pic1=Re_EC_miss,sys \
-c pic0=Rstall_storeQ,pic1=Rstall_FP_use,sys \
-c pic0=Rstall_IU_use,pic1=Re_RAW_miss,sys \
<app> <params>
Every second, cputrack will report the statistics
for every thread in the program - these are reported in rows labelled
'tick'. At the end of the application's run, cputrack will report
summary data for each pair of performance counters - these rows are
labelled 'exit'. The following is example data.
Output from cputrack

# cputrack -nfe -T 1 \
-c pic0=Cycle_cnt,pic1=DTLB_miss,sys \
-c pic0=Instr_cnt,pic1=Re_DC_miss,sys \
-c pic0=Dispatch0_IC_miss,pic1=Re_EC_miss,sys \
-c pic0=Rstall_storeQ,pic1=Rstall_FP_use,sys \
-c pic0=Rstall_IU_use,pic1=Re_RAW_miss,sys \
sumtest
1.018 14135 1 tick 1065399307 649 # pic0=Cycle_cnt,pic1=DTLB_miss,sys
2.168 14135 1 tick 752041715 3709955 # pic0=Instr_cnt,pic1=Re_DC_miss,sys
3.128 14135 1 tick 41193 212699 # pic0=Dispatch0_IC_miss,pic1=Re_EC_miss,sys
4.058 14135 1 tick 58242 542371011 # pic0=Rstall_storeQ,pic1=Rstall_FP_use,sys
5.048 14135 1 tick 6621 3203 # pic0=Rstall_IU_use,pic1=Re_RAW_miss,sys
6.058 14135 1 tick 1064502279 13 # pic0=Cycle_cnt,pic1=DTLB_miss,sys
7.058 14135 1 tick 932935082 4051475 # pic0=Instr_cnt,pic1=Re_DC_miss,sys
8.058 14135 1 tick 29433 210045 # pic0=Dispatch0_IC_miss,pic1=Re_EC_miss,sys
6.058 14135 1 exit 2129901586 662 # pic0=Cycle_cnt,pic1=DTLB_miss,sys
7.058 14135 1 exit 1684976797 7761430 # pic0=Instr_cnt,pic1=Re_DC_miss,sys
8.058 14135 1 exit 70626 422744 # pic0=Dispatch0_IC_miss,pic1=Re_EC_miss,sys
8.932 14135 1 exit 122575 1052317277 # pic0=Rstall_storeQ,pic1=Rstall_FP_use,sys
5.048 14135 1 exit 6621 3203 # pic0=Rstall_IU_use,pic1=Re_RAW_miss,sys
The lines of interest are the totals - labelled as
'exit' (highlighted in bold). The columns show:
-
The time that the sample was collected
-
The pid of the process (in this case 14135)
-
The ID of the LWP that the counts refer to (in
this case there is only 1 LWP, and so it is identified as 1).
-
A label which is either tick or exit - tick
indicates the completion of a single second of data collection (notice
that the default interval is a second, but this can be overridden),
exit indicates that the numbers on the line are totals for the counter
observed over the entire run of the application.
-
The event count for the first counter
-
The event count for the second counter
-
A comment indicating what the two counters
were counting.
The next thing to do is to present this data in a
readable way, as shown in the following table.
Data Summary
Cycle_cnt
2,129,901,586
2,129,901,586
N/A
Instr_cnt
1,684,976,797
0
N/A
DTLB_miss
662
66,200
0.00%
Re_DC_miss
7,761,430
7,761,430
0.36%
Re_EC_miss
422,744
422,744
0.02%
Dispatch0_IC_miss
70,626
70,626
0.00%
Rstall_storeQ
122,575
122,575
0.01%
Rstall_FP_use
1,052,317,277
1,052,317,277
49.41%
Rstall_IU_use
6,621
6,621
0.00%
Re_RAW_miss
3,203
3,203
0.00%
Most of these events are already recorded in terms
of the number of cycles. Recall that you need to estimate that each
TLB miss takes about 100 cycles.
Calculating Instructions Per Clock (IPC)
One of the often quoted metrics is IPC - or
instructions per clock. Using the above table we can calculate the IPC
as 0.79. This means that just under one instruction is completed every
cycle. IPC is sometimes used as a measure of performance - the higher
the IPC the better. Unfortunately, it's not a good metric for this task.
IPC can be made higher by either decreasing the
number of cycles that the application takes to run - this is the
situation that you would prefer. Or alternatively, by increasing the
number of instructions issued - and keeping the number of cycles the
same, or making it worse - this is not such a good outcome.
Looking at the example data it is apparent that
this application suffers from FP-use stalls - half the total number
of cycles is spent waiting for FP data. We can use this information
to collect some detailed data using the Sun ONE Studio Analyzer.
Using Performance Counters with the Sun ONE Studio
Performance Tools
The vital component of the Sun ONE Studio
Performance Tools is the Analyzer. See links.
Here is a quick summary of the Analyzer's features.
-
The Analyzer looks at what your application is
doing one hundred times per second (by default), plus it can also look
every few hundred thousand performance counter events. Whenever the
Analyzer looks, it records where in its code the application was executing.
Consequently, at the end of a run, you can determine which routines were hot, and where the events
occurred.
-
The Analyzer can attribute time to lines of
source code - so long as the application was compiled with -g (or -g0
for C++). Compiling with -g will have no effect on performance if the
application was compiled at high levels of optimization (-xO4, -xO5),
but if the application was compiled at low levels of optimisation (-O,
-xO2, -xO3) then some minor optimizations are disabled to make the
output clearer. A suggestion is to compile with -g whenever possible as
it will really make a difference to both debugging your application and to
investigating its performance.
-
The Analyzer has two parts - the tool collect
which collects the data, and the GUI analyzer which displays
the results. You can invoke collect multiple times on the same
application with the same or different parameters, and load all the
experiments into the GUI at the same time - this allows you to get
really good code coverage on your application. If you prefer there is
also a command line version of the Analyzer called er_print.
The examples on this page use er_print to generate the
appropriate output.
Run your application under collect as follows.
Running With collect
collect -p on -h <performance counter 1>,on,<performance counter 2>,on <app> <params>
The flag "-p on" tells collect to use
time based profiling, and selects the default interval (10ms) between
samples. The flag -h tells collect to collect counter overflow events
on the specified performance counters.
Picking the instruction
count and FP-use stalls counters on the application above we can do the following.
Collecting Data with Two Counters
$ collect -p on -h Instr_cnt,on,Rstall_FP_use,on sumtest
Creating experiment database test.1.er ...
Here we have collected the first experiment (test.1.er)
and will look at the results. Rather than showing screen
shots of the GUI, I'll use the command line version of the tool
(er_print) to display text output.
Our first display shows which functions are hot.
Displaying Test Results with er_print
$ er_print -limit 20 -metrics e.user -func test.1.er
current: e.user:name
Functions sorted by metric: Exclusive User CPU Time
Excl. Name
User CPU
sec.
8.830 <Total>
8.830 main
0. <Unknown>
0. __collector_open_experiment
0. __open
0. _audit_objclose
0. _exithandle
....
Here we ask er_print to display the following.
-
Only show the first 20 functions (-limit 20).
Often applications can have hundreds of routines, and telling er_print
to only display the first ones means that the hottest routines have not
scrolled off the screen.
-
Only show the exclusive user time metric
(-metrics e.user). There are a multitude of metrics collected, the most
useful ones are often user time, system time, and wall time. The
metrics are typically available in two flavours - exclusive and
inclusive. Exclusive means attributable to this routine and this
routine alone. Inclusive means attributable to this routine and all the
routines that it calls.
-
Show a list of the time attributed to functions
(-func)
-
Use the experiment test.1.er
Unsurprisingly, we can see that the function main
is hot. There are various other functions that also appear,
but basically no time was spent in them. We need to look at main
in a bit more detail.
Order of Parameters for er_print
er_print, like most applications,
interprets the command line from left to right, which means that
options on the left will apply to those on the right, and that options
on the right will override those on the left. We
would have very different output if I had swapped the position on
the command line of the -func flag and the -metrics <metrics>
flag.
Interpreting Analyzer Output
There are two basic rules for interpreting the
output of the Analyzer:
-
The time attributed to an instruction is the
time that the instruction spent waiting to be executed - not the time
that the instruction spent executing. So if you see an instruction
which has a lot of time attributed to it - look at the previous
instructions to determine which one is really to blame for the time.
(Note that this means that the source code attribution of timing may
not be totally accurate.)
-
The processor hesitates before attributing time
to floating point instructions, this means that in a sequence of
floating point instructions, it is sometimes the case that the first
couple do not get any time attributed to them - even when they really
should.
Lets have a look at a couple of examples:
0. 10980: inc 9, %l6
0.070 10984: ld [%i4 + %g2], %f16
2.590 10988: nop
0. 1098c: cmp %l4, %l3
In the above example, the load instruction is
actually taking all the time, but the 'nop' following it gets
attributed with the 2.5 seconds of time.
0.010 5699f0: ldd [%l0], %f0
0. 5699f4: fsubd %f6, %f4, %f26
2.630 5699f8: ldd [%l0 + %o1], %f2
In the above example, the first floating point
load double actually causes the delay, but the processor hesitates in
reporting the delay while it completes the floating point
subtraction, and it finally reports the delay on second load.
Interpreting the analyzer output can be more of a
skill. However, in most cases it is not too hard if you remember the
above two rules. Note that there are other situations which can be
confusing - for example a branch target might accumulate a large
amount of time. The Analyzer documentation contains more details.
Returning to the example, let's have a look at how the attribution of time
to the source code works:
$ er_print -metrics e.user:e.Rstall_FP_use -src main test.1.er
current: e.user:e.Rstall_FP_use:name
Source file: ./sumtest.c
Object file: ./sumtest.o
Load Object: ./sumtest
Excl. Excl.
User CPU Rstall_FP_use
sec. Events sec.
1. int main()
0. 0. 2. {
3. double d[20000];
4. double total,total1;
5. int count, rpt;
6.
0. 0. 7. for (count=0; count<20000; count++)
0. 0. 8. d[count]=0.01;
9.
0. 0. 10. total=1;
0. 0. 11. total1=0.5;
0. 0. 12. for (rpt=0; rpt<50000; rpt++)
13. {
0. 0. 14. for (count=0;count<20000;count++)
## 8.830 4.742 15. total+=total1*d[count];
0. 0. 16. total1=total/1.776;
17. }
0. 0. 18. if (total==0.5) return 1 ; else return 0;
19. }
In this case I have asked er_print to:
-
Use the metrics exclusive user CPU time, and
exclusive Rstall_FP_use events (-metrics e.user:e.Rstall_FP_use).
Rstall_FP_use is the hot performance counter that we collected the data
for. It is quite useful to limit the number of columns of counter data
reported, because interpreting the output can be tricky, and it's
almost impossible to do if the lines wrap around.
-
Show the source for the routine main (-src
main). To show the source code you need to have compiled your
application with debug information.
So immediately you can see (as you would expect)
that the bulk of the time is spent on line 15 which does the
summation part of the code. So we need to drill down a bit further to
see what's going on at the assembly code level. Here's the
disassembly code for the hottest routine, most of the code has been
omitted for clarity.
$ er_print -metrics e.user:e.Rstall_FP_use -dis main test.1.er
current: e.user:e.Rstall_FP_use:name
Source file: ./sumtest.c
Object file: ./sumtest.o
Load Object: ./sumtest
Excl. Excl.
User CPU Rstall_FP_use
sec. Events sec.
....
0. 0. [12] 106d8: add %i2, 848, %l6
13. {
14. for (count=0;count<20000;count++)
0. 0. [14] 106dc: clr %i1
0. 0. [14] 106e0: mov %l4, %l7
15. total+=total1*d[count];
0.400 0.903 [15] 106e4: ld [%l7], %f0
0.550 0.016 [15] 106e8: inc %i1
0.460 0.943 [15] 106ec: ld [%l7 + 4], %f1
0.720 0. [15] 106f0: cmp %i1, %l5
0. 0. [15] 106f4: fmuld %f6, %f0, %f2
0. 0. [15] 106f8: faddd %f8, %f2, %f8
## 6.240 2.879 [15] 106fc: bl 0x106e4
0.460 0. [15] 10700: inc 8, %l7
16. total1=total/1.776;
0. 0. [16] 10704: inc %i3
...
In this case I have asked er_print to:
-
Use the metrics exclusive user CPU time and
exclusive number of cycles spent waiting for FP data (-metrics
e.user:e.Rstall_FP_use).
-
Disassemble the routine main (-dis main). Note
that for real codes it is probably useful to use the option -outfile
<filename> which will send all the following output to the
specified file. The disassembly can get quite long, and it is probably
best to either view it in an editor, or run the analyzer GUI.
You can see that the bulk of the time appears to
be spent on the branch instruction.... however, from the discussion
above, you'll be looking to see if the time is really caused by other
instructions.
You can see nearly 3 seconds of FP use stalls on
the branch. So we can be sure that the bulk of the time spent
on that branch instruction is due to the proceeding floating point
instructions.
So let's talk through this snippet of code.
Notice that there's two single-precision floating
point load instructions, one to load %f0 and one to load %f1. What is
happening here is that the compiler is assuming that the data is
4-byte aligned and therefore requires two four-byte loads rather than
a single eight byte load. If you look at the fmuld instruction it
consumes the double-precision floating point number %f0 (the double
precision floating point register %f0 is made up of the two single
precision registers %f0 and %f1). The compiler flag -dalign would
help the compiler use a single floating point load double rather than
the two floating point load singles.
The time attributable to the load instructions
appears on the instructions following them. As expected there is
little time here - the data is cache resident. However, if the arrays
were larger, it would be useful for the compiler to insert prefetch
instructions - the compiler will do this if the flags -xprefetch
-xdepend -xtarget=ultra3 -xarch=v8plusa are used on the compile line.
Now we have the two floating point instructions.
First the fmuld. Unfortunately the data that this instruction
requires is provided by the second load instruction - this
instruction will still be completing when the multiply gets issued -
so there will be a delay at this point. Then there's the add. The add
requires the data from the multiply - and it takes a couple of cycles
for the multiply to complete - so once again there's another FP use
stall.
Finally there is a branch instruction.
What can we do about this code? We can
try to recompile with -fast to let the compiler be more aggressive
with its optimizations. This may not be suitable for all codes, but
we can give it a try for this one.
Trying -fast
$ cc -g -fast -o sumtest sumtest.c
$ collect sumtest
Creating experiment database test.2.er ...
$ er_print -metrics e.user -func test.2.er
current: e.user:name
Functions sorted by metric: Exclusive User CPU Time
Excl. Name
User CPU
sec.
4.990 <Total>
4.990 main
0. __collector_open_experiment
0. __open
0. _init
0. _open
0. _private_close
0. _rt_boot
Now you can see that by increasing the optimization
level from -O to -fast we went from nearly 9 seconds runtime down to
about 5 seconds. Lets have a look at what it did to the hot bit of
code to get this performance gain.
$er_print -metrics e.user -dis main test.2.er
current: e.user:name
Source file: ./sumtest.c
Object file: ./sumtest.o
Load Object: ./sumtest
Excl.
User CPU
sec.
...
Loop below pipelined with steady-state cycle count = 1 before unrolling
Loop below unrolled 6 times
Loop below has 1 loads, 0 stores, 0 prefetches, 1 FPadds, 1 FPmuls, and 0 FPdivs per iteration
14. for (count=0;count<20000;count++)
15. total+=total1*d[count];
0. [15] 106d0: sethi %hi(0x27000), %g1
....
0. [14] 10764: add %g1, %fp, %l5
0. [15] 10768: fmuld %f6, %f24, %f2
0. [15] 1076c: faddd %f8, %f16, %f4
0.210 [15] 10770: inc 6, %i5
0.080 [15] 10774: ldd [%l5], %f24
0. [15] 10778: fmuld %f6, %f60, %f30
0. [15] 1077c: faddd %f10, %f18, %f0
## 1.450 [15] 10780: cmp %i5, %i0
0. [15] 10784: ldd [%l5 + 8], %f60
0. [15] 10788: fmuld %f6, %f26, %f16
0. [15] 1078c: faddd %f12, %f20, %f8
0.130 [15] 10790: ldd [%l5 + 16], %f26
## 1.470 [15] 10794: inc 48, %l5
0. [15] 10798: fmuld %f6, %f24, %f18
0. [15] 1079c: faddd %f14, %f22, %f10
0.080 [15] 107a0: ldd [%l5 - 24], %f24
0. [15] 107a4: fmuld %f6, %f60, %f20
0. [15] 107a8: faddd %f4, %f2, %f12
0.150 [15] 107ac: ldd [%l5 - 16], %f60
0. [15] 107b0: fmuld %f6, %f26, %f22
0. [15] 107b4: faddd %f0, %f30, %f14
## 1.440 [15] 107b8: ble,pt %icc,0x10768
0.050 [15] 107bc: ldd [%l5 - 8], %f26
...
Let's discuss the commentary that
the compiler inserts into the output. Here it is telling us:
-
That the loop was unrolled six times. This means
that six iterations of the loop were done back-to-back. This
optimisation reduces the number of times that the processor has to take
a branch.
-
The loop was pipelined. This is a more complex
optimisation where the next iteration of the loop is started before the
current one completes - think of it as taking the six unrolled
iterations of the loop and interleaving them so that one instruction is
done from the first, then an instruction from the second etc. You can
see this looking at the load at 0x10774, this loads %f24 which is used
by the multiply at 0x10798. The multiply generates %f18 which is used
by the add at 0x1077c.
-
The compiler commentary also tells us of the
instruction make up of the loop - the number of each type of
instruction per iteration.
In this case the big gain comes from pipelining
the loop, there's now sufficient time between one instruction
generating a result and the instruction that requires that result.
There are a few other things that it is worth
pointing out about the loop:
-
-fast includes the -dalign flag, so the two
loads of single precision values have now become a single load of a
double precision value.
-
The flag which allows the compiler freedom to do
the floating point optimisations is -fsimple=2, which is included in
-fast. This flag tells the compiler that it is not necessary to do the
floating point calculations in the exact same order that they are
specified in the code. So if you look at the floating point adds in the
above disassembly, you can identify that what was in the source a
simple summation, has been split into two separate summations (which
will be added together after the loop completes). %f4, %f8, and %f12
are one group, and %f0, %f10, and %f14 are another group.
Common Solutions
While the causes for a high number of events for
a given counter are very dependant on the characteristics of the
application, the following table outlines some suggestions for what
you might try depending on the performance counter events.
Suggested Solutions
Data Layout
DTLB_miss
You are using a lot of data in your
application - so the processor needs to be able to map a large amount
of memory. There is the facility in Solaris 9 to use large pages (the
compiler flag -xpagesize is available to assist in doing this).
It may also be the case that you are using
data structures which have a low data-density. For example, you may
only be accessing a single element from a large data structure. Check
the data structures to determine whether they are being used
efficiently or not.
Re_DC_miss
Re_EC_miss
L1 and L2 cache misses indicate that the
application is spending time having to go to memory. The compiler can
do a good job at reducing the time spent in cache misses if the
application is recompiled with -xprefetch enabled - you should also
include the flags -xdepend -xtarget=ultra3 and -xarch=v8plusa.
If the problem persists, take a look at the
data structures in the application and see if they are using memory
efficiently. Can the accesses be made such that adjacent memory
locations are used. This can be done by reordering the elements in
structures so that the fields which are accessed frequently are placed
close together, or by going through arrays one adjacent element at a
time.
Rstall_storeQ
This indicates that stores are being put into
the store queue faster than they can drain to memory. Recompiling with
prefetch can improve this. Also changing the way data is stored so that
stores to adjacent memory locations appear together in the code will also help.
Code Layout
ITLB_miss
Instruction TLB misses probably indicate that
the application has a lot of code in it. Using large pages to map the
application into memory will help. It is also possible that the
compiler can generate a more optimal code layout through one of the
following - profile feedback (-xprofile=[collect:|use:]),
interprocedural optimisation (-xipo), use the link time optimiser
(-xlinkopt), or mapfiles (-xmapfile=). mapfiles can be generated from
the analyzer, and tell the compiler the best order in which to layout
the routines in memory.
Dispatch0_IC_miss
Instruction cache misses are similar to ITLB
misses, but less severe. The solutions are broadly similar. Use
mapfiles to organise the layout of the hot routines in memory. Use
profile feedback to optimise branches in order to make the normal case
the one with the linear code-path. Add -xipo to get crossfile
optimisations, which will inline short routines. In S1S8, use -xlinkopt
to invoke the link time optimiser to improve instruction cache
utilisation.
Re_RAW_miss
This means that some locations in memory are
being accessed in such a way that the processor is finding it hard to
determine whether the stored data should be passed directly to the
following load. The easiest solution to this is to recompile with
latest compiler and specify -xtarget=ultra3 to get scheduling for
UltraSPARC-IIICu. If the condition persists, alter source code to avoid
stores and loads from locations that are very close in memory.
Dispatch0_mispredict
If this counter is high, it means that there
are many mispredicted branches. Use profile feedback to improve the
scheduling of the branches so as to improve branch predictability.
Load-Use Stalls
Rstall_FP_use
This counter indicates that there are floating
point instructions that are being delayed whilst they wait for the
results of previous floating point operations. Recompile the
application with -fsimple=2 (if possible) to substitute simpler (but
non-ieee754 compliant) code sequences. Or if this is not appropriate,
locate the hot spots in the code and see if there is an alternative way
of coding them to avoid the problem. In particular watch out for FP
divides and square roots which are time consuming operations.
Rstall_IU_use
This counter indicates that there are integer
instructions waiting for the completion of previous integer operations.
The simplest solution is to recompile with latest compiler, specify
-xchip=ultra3 to get appropriate instruction latencies for the
UltraSPARC-IIICu platform. However for C and C++ appliations it may be
possible to recompile with aliasing information (-xalias_level or
-xrestrict) if appropriate - these flags are quite complex, and you
should be sure that you understand what you are telling the compiler
before you use them.
Conclusions
You should have a much better idea about how to use the
performance counters to highlight performance opportunities in your
application. You should also understand how to use this knowledge to
drill down in more detail using the Sun ONE Studio Analyzer. If you do
follow this procedure, you should not be surprised to find that you
can make significant performance improvements for your application.
Obviously the performance gains you make will depend on how well
optimised the original code was, how recent the compiler you are
using, and what your application is trying to do. Having said that,
we often find that performance gains of 10% to orders of magnitude,
are possible just by looking at what the counters are telling us, and
what the Analyzer tells us.
-
cputrack Command
-
cpustat Command
-
Sun ONE Studio Performance Analysis Tools
-
Technical Article:
Performance Analysis and Monitoring Using Hardware Counters in Solaris.
-
UltraSPARC-IIICu User's Guide
-
Compiling for the UltraSPARC-IIICu Processor
About the Author
Darryl Gove
is a staff engineer in the Compiler Performance Engineering group at Sun Microsystems Inc.,
analyzing and optimizing the performance of applications on current and future UltraSPARC systems.
Darryl has an M.Sc. and Ph.D. in Operational Research from the University of Southampton in the UK.
Before joining Sun, Darryl held various software architecture and development roles in the UK.