Quantcast
Channel: Intel® VTune™ Profiler (Intel® VTune™ Amplifier)
Viewing all articles
Browse latest Browse all 1347

Vtune Data microarchitecture analysis metrics vs varying CPU frequency

$
0
0

Hi,
I ran a code with different frequencies and collected vtune data (on 8280 processor, rhel 7) using microarchitecture analysis. I understand that vtune(v2020) can be used to identify the portions of codes which are underutilizing the given hardware resources on a processor. I did this experiment in order to see how the application responds on variation of a particular component of hardware or , which hardware component limits the scaling of this application (example - memory frequency/cpu frequency etc.)? 

So, i gathered the data with various frequencies (acpi-cpufreq) and followed the metrics breakdown trail of the numbers shown in red color on vtune GUI as -
  1: Back End Bound --> 2: (Memory Bound, Core Bound) --> 3: DRAM Bound --> 4: (Memory Bandwidth, Memory Latency) --> 5: Local DRAM.

I noticed that - 
a) Back-End Bound: = Memory Bound + Core Bound , example (62% of clock ticks = 42 % + 20 %)
b) Memory Bound ~= L1 Bound + L2 Bound + L3 Bound + DRam Bound + Store Bound(42 ~= 8% + 3% + 2% + 20% + 6%)
c) DRam Bound  <  Memory Bandwidth Bound + Memory Latency (20 < 28 + 10)
d) Memory Latency << Local DRAM + Remote DRAM + Remote Cache (10 << 97 + 2 + 1)

Q1:  What could be the reason behing the subcategory total exceeding the category value for c & d ?
      for c and d i was expecting something like DRam Bound  = Memory Bandwidth Bound + Memory Latency.

Q2: On increasing the CPU frequency i got following from vtune for DRAM Memory Bandwidth
    1GHz    - 28   % of Clockticks
    1.4GHz - 37 %
    1.8GHz -  42 %
    2GHz    -  42.5 %
    2.6GHz -  42.8 %
    2.7GHz -  42.9 %
    2.7+boost enabled - 41.7 %
- The number of CPU stalls (for DRAM) are not increasing when the frequency exceeds 1.8 GHz. Now i am looking for the reason behind this behaviour.
I expected that with higher frequency, stalls would grow as more CPU cycles/ pipeline slots will be wasted due to data unavailability.
I am focusing on metrics highlighted in red. As cache bound clock cycles were almost constant (.2/.4% increase in each of L1,L2,L3,Store) for all the frequencies mentioned above, could i say that larger cache will not help here? - contrary to what is mentioned here

 

Q3: I noted that on varying the frequencies, the Vector Capacity Usage (FPU) stays constant at around 70%. Which from the explanation here means that 70% of my floating point computations executed on VPU units (rest were scalar).
also, here  i can see that there are different types of execution units which can process 256 but data. Is it possible to see the break up of the Floating point applications like - how many used 256-FP MUL, how many used 256 FP Add etc ?

 

Q4: Are 256 FP Add/256-FP MUL and FMA  are different ? If yes then on which port the front end unit dispatches the uOPs for FMA? as i can't see the FMA unit in the block diagram 
 
 please let me now if some more information is required from my end or any of the questions mentioned above are vague / unclear.


Viewing all articles
Browse latest Browse all 1347

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>