Hi all,
I am trying to use VTune Amplifier (Linux version) to profile memory access latency. I was using it to get familiar with it by profiling a toy program that just loads a big array of data. I use the command line version like this.
amplxe-cl -collect-with runsa -knob event-config=MEM_TRANS_RETIRED.LOAD_LATENCY_GT_32,MEM_TRANS_RETIRED.LOAD_LATENCY_GT_64 ./load The result I get is the following.
============================================================================
CPU
---
Parameter r000runsa
----------------- -------------------------------
Name Intel(R) Xeon(R) E5v2 processor
Frequency 2394229995
Logical CPU Count 48
Summary
-------
Elapsed Time: 7.757
CPU Usage: 1.000
Event summary
-------------
Hardware Event Type Hardware Event Count:Self Hardware Event Sample Count:Self Events Per Sample
------------------------------------ ------------------------- -------------------------------- -----------------
CPU_CLK_UNHALTED.REF_TSC 18538027807 9269 2000003
MEM_TRANS_RETIRED.LOAD_LATENCY_GT_32 0 0 100007
MEM_TRANS_RETIRED.LOAD_LATENCY_GT_64 24036 6 2003
amplxe: Executing actions 100 % done
=======================================================================
From the explanation of the MEM_TRANS_RETIRED.LOAD_LATENCY_GT_* events, the count of *_GT_32 must be greater that *_GT_64. In this case it is not, and this behavior is reproducible.
I checked the errata published at the specification update and stumbled upon the paragraph BT241 which mentions that "The affected events may undercount, resulting in inaccurate memory profiles" and the list of events contains MEM_TRANS_RETIRED.LOAD_LATENCY.
Can somebody explain why the count of MEM_TRANS_RETIRED.LOAD_LATENCY_GT_32 is less than MEM_TRANS_RETIRED.LOAD_LATENCY_GT_64 please?
Thank you,
Best Regards, ARam