Hi,
I ran a code with different frequencies and collected vtune data (on 8280 processor, rhel 7) using microarchitecture analysis. I understand that vtune(v2020) can be used to identify the portions of codes which are underutilizing the given hardware resources on a processor. I did this experiment in order to see how the application responds on variation of a particular component of hardware or , which hardware component limits the scaling of this application (example - memory frequency/cpu frequency etc.)?
So, i gathered the data with various frequencies (acpi-cpufreq) and followed the metrics breakdown trail of the numbers shown in red color on vtune GUI as -
1: Back End Bound --> 2: (Memory Bound, Core Bound) --> 3: DRAM Bound --> 4: (Memory Bandwidth, Memory Latency) --> 5: Local DRAM.
I noticed that -
a) Back-End Bound: = Memory Bound + Core Bound , example (62% of clock ticks = 42 % + 20 %)
b) Memory Bound ~= L1 Bound + L2 Bound + L3 Bound + DRam Bound + Store Bound(42 ~= 8% + 3% + 2% + 20% + 6%)
c) DRam Bound < Memory Bandwidth Bound + Memory Latency (20 < 28 + 10)
d) Memory Latency << Local DRAM + Remote DRAM + Remote Cache (10 << 97 + 2 + 1)
Q1: What could be the reason behing the subcategory total exceeding the category value for c & d ?
for c and d i was expecting something like DRam Bound = Memory Bandwidth Bound + Memory Latency.
Q2: On increasing the CPU frequency i got following from vtune for DRAM Memory Bandwidth
1GHz - 28 % of Clockticks
1.4GHz - 37 %
1.8GHz - 42 %
2GHz - 42.5 %
2.6GHz - 42.8 %
2.7GHz - 42.9 %
2.7+boost enabled - 41.7 %
- The number of CPU stalls (for DRAM) are not increasing when the frequency exceeds 1.8 GHz. Now i am looking for the reason behind this behaviour.
I expected that with higher frequency, stalls would grow as more CPU cycles/ pipeline slots will be wasted due to data unavailability.
I am focusing on metrics highlighted in red. As cache bound clock cycles were almost constant (.2/.4% increase in each of L1,L2,L3,Store) for all the frequencies mentioned above, could i say that larger cache will not help here? - contrary to what is mentioned here
Q3: I noted that on varying the frequencies, the Vector Capacity Usage (FPU) stays constant at around 70%. Which from the explanation here means that 70% of my floating point computations executed on VPU units (rest were scalar).
also, here i can see that there are different types of execution units which can process 256 but data. Is it possible to see the break up of the Floating point applications like - how many used 256-FP MUL, how many used 256 FP Add etc ?
Q4: Are 256 FP Add/256-FP MUL and FMA are different ? If yes then on which port the front end unit dispatches the uOPs for FMA? as i can't see the FMA unit in the block diagram
please let me now if some more information is required from my end or any of the questions mentioned above are vague / unclear.