Compile your code adding the flag -g: Example "-O3 -g"
The command to use Vtune is vtl, Please look at this web page to find out more information on options and usage. vtl -help
To start Vtune (make sure vtl is in your path)vtl activity act1 -c sampling \ -o "-ec en='CPU_CYCLES',en='IA64_INST_RETIRED-THIS',en='L3_MISSES'" \ -d 360 0 -app /home/$USER/Source/a.out run
After the activity has been run you can check its name:
Next find out the process-id number for your application.
vtl view a1::r1 -processes > OUT-proc
look at file OUT-proc and find out the process number, the find out
what modules are associated with this pid.
vtl view a1::r1 -modules -pid 7023 > OUT-modules
Once you select the module, you can find out where the hotspots
in your code are.
vtl view a1::r1 -hf -mn a.out > OUT-hotspots
Now you can look at the output for the previous step and find out
the subroutines or functions where your code spends most of its time,
or where most of the L3 misses take place according to your options.
vtl view a1::r1 -code -mn a.out -fn sub1_ > OUT-sub1 vtl view a1::r1 -code -mn a.out -fn sub2_ > OUT-sub2 vtl view a1::r1 -code -mn a.out -fn sub3_ > OUT-sub3
The output files show the events and the lines in the code where there are hotspots. Users should be able to analyze this information and test possible solutions to improve preformance. For example for this particular code I found that the L3_MISSES can be reduced by adding the compiler flags -pad (pad arrays) and unrolling do loops -unroll8
Likewise, the main do loops where most of the time is consumed in subroutines sub1 and sub2 can be parallelized by using a simple parallel reduction directive in OpenMP, as seen in OUT-sub1 and OUT-sub2