Using Intel Vtune

Compile your code adding the flag -g: Example "-O3 -g"

The command to use Vtune is vtl, Please look at this web page to find out more information on options and usage. vtl -help

To start Vtune (make sure vtl is in your path)

vtl activity act1 -c sampling \ -o "-ec en='CPU_CYCLES',en='IA64_INST_RETIRED-THIS',en='L3_MISSES'" \ -d 360 0 -app /home/$USER/Source/a.out run


  1. act1 is the name of this actvity
  2. -c indicates the collector type with name 'sampling'
  3. -o indicates the options given to Vtune, by default (no options given) vtune will calculate the Instructions retired and the cpu_cycles. In this example there is an extra option to calculate cache (L3) misses as well.
  4. -d indicates the duration, cputime for vtune to run and collect samples
  5. -app is the application name (path if you are working on a different directory, where the object files are)
  6. run is the command to tell vtune to run this activity, otherwise it will not be executed.

After the activity has been run you can check its name:
vtl show

Next find out the process-id number for your application.
vtl view a1::r1 -processes > OUT-proc

look at file OUT-proc and find out the process number, the find out what modules are associated with this pid.
vtl view a1::r1 -modules -pid 7023 > OUT-modules

Once you select the module, you can find out where the hotspots in your code are.
vtl view a1::r1 -hf -mn a.out > OUT-hotspots


  1. -hf indicates hot functions
  2. -mn indicates module name (path)
  3. a1::r1 indicates actvity-1, result-1

Now you can look at the output for the previous step and find out the subroutines or functions where your code spends most of its time, or where most of the L3 misses take place according to your options.
vtl view a1::r1 -code -mn a.out -fn sub1_ > OUT-sub1 vtl view a1::r1 -code -mn a.out -fn sub2_ > OUT-sub2 vtl view a1::r1 -code -mn a.out -fn sub3_ > OUT-sub3

The output files show the events and the lines in the code where there are hotspots. Users should be able to analyze this information and test possible solutions to improve preformance. For example for this particular code I found that the L3_MISSES can be reduced by adding the compiler flags -pad (pad arrays) and unrolling do loops -unroll8

Likewise, the main do loops where most of the time is consumed in subroutines sub1 and sub2 can be parallelized by using a simple parallel reduction directive in OpenMP, as seen in OUT-sub1 and OUT-sub2

©2010 NSCEE