Hi,
We are trying to profile an application with Vtune ( Intel(R) VTune(TM) Amplifier XE 2013 (build 353306) ). It is an MPI application and for now we are running it as a single process mpi job.
We tried snb-access-contention profile with-call-stack(11 GB) and without-call-stack(18GB).
I ran them as
Without call stack: amplxe-cl -r snb-access-contention -collect snb-access-contention -data-limit=0
With call stack: amplxe-cl -r snb-access-contention_cs -collect snb-access-contention -knob enable-stack-collection=true -data-limit=0
The log shows it uses the performance counters as follows with sampling rate in brackets
CPU_CLK_UNHALTED.REF_TSC(2000003)
CPU_CLK_UNHALTED.THREAD(2000003)
INST_RETIRED.ANY(2000003)
MEM_UOPS_RETIRED.ALL_STORES_PS(2000003)
MEM_UOPS_RETIRED.LOCK_LOADS_PS(100007)
I also used Hpctoolkit(http://hpctoolkit.org/) with similar sampling rates. E.g. CPU_CLK_UNHALTED:REF_P(2000000)
The data collected is only around 1.5 MB and if I enable tracing which gives a timeline view it goes to 15MB
Then I need to create a program structure file which is around 25Mb this can be kept as a common file for different counters.
Is there some sort of hint why data collected is comparably so huge for vtune? Ours is a sandybridge machine.
I cannot use the pause/resume API because I cannot change the source code. https://software.intel.com/en-us/articles/how-to-call-resume-and-pause-api-from-fortran-code
Thank you
Sriraj