Estimating elapsed time for a vtune anaysis (knob sampling-interval)

April 13, 2020, 11:04 pm

Latest and popular articles on Intel Technologies

≫ Next: Packed non-vectorized FP operations

≪ Previous: Mac VTune downloads lead to a 404 page

Hi,

I ran a HPCPerformance analysis(vtune 2020u0) on intel8280 (RHEL7.6) with default settings as -

time mpirun -np $SLURM_NPROCS -ppn $SLURM_NTASKS_PER_NODE  $OPTS  amplxe-cl -collect hpc-performance -data-limit 0 -result-dir result_hpcperf -- ${APP_INSTALL_ROOT}/appname.exe

the analysis part

vtune: Executing actions  0 %
........
vtune: Executing actions 100 % done

took around 45 minutes and "result_hpcperf.nodeXX" directory had around 20G data.

Q1: If my linux kernel version is 3.10.0-957.el7.x86_64 then what will be the default sampling interval ?

Q2: If i reduce the sampling interval for an analysis by half, (by rough estimate) how much elapsed time and output data should i expect for the vtune analysis+report generation part ?

- I was expecting that if the sampling interval is halved (default 1ms -> 0.5ms ) , then the analysis & result generation should take around 90 minutes and i was expecting data of around 40-50 GB. Please let me know if my assumptions are incorrect.

Q3: Also, If i reduce the sampling interval for an analysis by half, then (in general based on your observations with this tool) how much accuracy in output data metrics can i expect ?

As per this article (CPU sampling interval, ms field) , i assumed the default sampling interval should be 1ms, and i reran HPC performance analysis by setting sampling-interval to 0.5 ms as -

time mpirun -np $SLURM_NPROCS -ppn $SLURM_NTASKS_PER_NODE  $OPTS  amplxe-cl -collect hpc-performance -data-limit 0 -result-dir result_hpcperf -knob sampling-interval=0.5  -- ${APP_INSTALL_ROOT}/appname.exe

the last statement to appear in the stdout was -

vtune: Executing actions  0 %

and around 11 hours ave elapsed since then and around 150G of data has been generated in results directory.

within the results directory ( find . -printf "%T+\t%p\n" | sort) i saw that the last file was changed around 11 hours ago , and that file has following contents -

[user@headnode01 hpcperf_char_00003]$ cat result_hpcperf.node3/config/log.cfg
<?xml version='1.0' encoding='UTF-8'?>

<bag xmlns:int="http://www.w3.org/2001/XMLSchema#int" xmlns:long="http://www.w3.org/2001/XMLSchema#long">
 <message_entry_t int:status="2" cap="Data collection completed successfully" msg="" long:timeStamp="1586803953480"/>
 <message_entry_t int:status="2" cap="Data collection completed successfully" msg="" long:timeStamp="1586803953542"/>
 <message_entry_t int:status="2" cap="Data collection completed successfully" msg="" long:timeStamp="1586803953687"/>
 <message_entry_t int:status="2" cap="Data collection completed successfully" msg="" long:timeStamp="1586803953748"/>
 <message_entry_t int:status="2" cap="Data collection completed successfully" msg="" long:timeStamp="1586803954281"/>
 <message_entry_t int:status="1" cap="Data collection completed with warnings" msg="Please see warning messages for details. " long:timeStamp="1586809230671">
  <message msg="Analyzing data in the node-wide mode. The hostname (node61) will be added to the result path/name." int:severity="1"/>
  <message msg="Peak bandwidth measurement started." int:severity="1"/>
  <message msg="Peak bandwidth measurement finished." int:severity="1"/>
  <message msg="To enable hardware event-base sampling, VTune Profiler has disabled the NMI watchdog timer. The watchdog timer will be re-enabled after collection completes." int:severity="2"/>
  <message msg="Collection started." int:severity="1"/>
  <message msg="Collection stopped." int:severity="1"/>
 </message_entry_t>
</bag>

also, on the compute node (node3) i checked the running processes via top command -

   PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
127588 root      20   0 4128520  82480   3308 R 100.0  0.0 563:13.50 sep
    10 root      20   0       0      0      0 S   6.2  0.0   0:22.52 rcu_sched
     1 root      20   0   56068   8276   2620 S   0.0  0.0   0:26.51 systemd

Here also , it seems that the sep command(/driver)has been running since ~9hours with no memory utilization. Not sure if the application/sep driver is running fine. Is there a way to confirm (via system logs/sep driver logs) if the application is running fine?

It would be very helpful for me if i could get an estimate of the time to be taken by this analysis to finish in my scenario?

- Asking as i will adjust the "walltime" for my vtune jobs on my cluster accordingly.

Please let me know if i can provide more information from my end to help you with answers to my queries.

↧

Packed non-vectorized FP operations

April 14, 2020, 3:54 am

Latest and popular articles on Intel Technologies

≫ Next: collecting gpu-hotspots crashes my application

≪ Previous: Estimating elapsed time for a vtune anaysis (knob sampling-interval)

I am using vtune 2020u0 on intel 8280 platform. I carried out an HPC characterization analysis and was looking at the Heading of Vectorization Section which has

Vectorization:	77.7% of Packed FP Operations
    Instruction Mix:	
    SP FLOPs:	15.4%
    Packed:	79.8%
    128-bit:	0.0%
    256-bit:	0.1%
    512-bit:	79.8%
    Scalar:	20.2%
    DP FLOPs:	0.4%
    x87 FLOPs:	0.0%
    Non-FP:	84.2%
    FP Arith/Mem Rd Instr. Ratio:	0.462
    FP Arith/Mem Wr Instr. Ratio:	1.369

-
checked for a detailed explanation here , but was unable to gain clarity so asking my queries here.
From report it seems code issued packed + non packed instructions and, out of all the packed FP instructions issued during code execution, only 77.7% were vectorized - Which (AFAIK) means these instructions resulted in use of AVX/AVX2/AVX512 bit registers.

Could you please explain / refer me to an article which explains the (general) reasons for non-vectorization of (in my case - 22.3% of packed instructions) packed instructions? and how these packed instructions would execute (using scalar registers?)?

For example - mm256_add_ps is a packed instruction, so could you help me in understanding that how the add operation could be non-vectorized in following context -

float f[8]={1.0,2.0,1.2,2.1, 5.2,5.3,10.1,11.0};
__m256 v=_mm256_load_ps(&f[0]);
v=_mm256_add_ps(v,v);

The aforementioned code is not related to the code which i have profiled.

↧

collecting gpu-hotspots crashes my application

April 14, 2020, 1:57 pm

Latest and popular articles on Intel Technologies

≫ Next: Can't view the source code for functions

≪ Previous: Packed non-vectorized FP operations

Hi,

My application crashes when I try to run collect gpu-hotspots, whereas if I just collect hotspots then it works just fine.

Vtune self check script ran just fine. Attaching the log file.

Attachment	Size
Download log.txt	62.29 KB

↧

Can't view the source code for functions

April 16, 2020, 10:47 am

Latest and popular articles on Intel Technologies

≫ Next: Unable to load .pdb files after profiling using VS2019

≪ Previous: collecting gpu-hotspots crashes my application

Hello!

I am trying to profile a C++ application with OpenMP using Intel Vtune Profiler. According to Intel Tutorial (https://software.intel.com/en-us/download/tutorial-finding-hotspots-c-sa...) I have run Hotspots Analysis and found the most time-consuming functions. In the tutorial functions' sources are shown in the Call Stack pane (see "guide" picture). However, in my case in the Call Stack pane it is written that almost all of functions have [unknown source file], so I can't find their source (see "reality" picture).

Tell me, please what should I do to find the source of these functions?
Many thanks! :)

P.S. I have compiled this application (q.exe) using NetBeans with MinGW. I have tried to run Hotspots Analysis in both Debug and Release modes, but still have no success.

Attachment	Size
Download guide.png	135.12 KB
Download reality.png	171.42 KB

↧

Unable to load .pdb files after profiling using VS2019

April 17, 2020, 9:00 am

Latest and popular articles on Intel Technologies

≫ Next: Problems importing data gathered by command line from AWS host

≪ Previous: Can't view the source code for functions

Hey,

After I profile my C++ application VTune just sits trying to load the .pdb for one of my libraries. It never moves past this .pdb. The .pdb it gets stuck on is random for each time I run. This application can be profiled without issue when compiled using VS2015 using the same Vtune version I've tried both compiling using /Zi and /Z7, but with the same results.

Behavior is identical when using the standalone VTune, or using the VS2019 integrated one. Anything I can do to help diagnose this?

Thanks

Operating system and version

Windows 10

Tool version:

VTune 2020, Update 1. 607630

Compiler version:

MSVC 2019. 16.5.4

GNU Compiler Collection (GCC)* or Microsoft Visual Studio* version (if applicable)

MSVC 2019. 16.5.4, 14.25.28610

↧

Problems importing data gathered by command line from AWS host

April 21, 2020, 11:40 am

Latest and popular articles on Intel Technologies

≫ Next: Silent CLI only install still does GUI checks

≪ Previous: Unable to load .pdb files after profiling using VS2019

Hi,

I've setup vTune profiler on an AWS linux installation where we are running processes in Docker. I don't have direct root login access on the host, so need to run vTune with sudo directly on the host (root is necessary to allow vTune to delve into the Java process running in Docker).

After the run, I copied the results directory to a local Linux VM which does have GUI access. Following the guidance on this page:

https://software.intel.com/en-us/vtune-help-importing-results-to-gui

When I try to import without ticking 'Import via a link instead of result copy', I get the following failure message:

"cannot import the result because the current project already has a result with the name"

That is the complete message.

I then try importing with the 'Import via a link...', this time I just get a spinning 'progress' image, but it never returns.

Can anyone advise on what is going wrong here?

Many thanks

↧

Silent CLI only install still does GUI checks

April 22, 2020, 9:12 am

Latest and popular articles on Intel Technologies

≫ Next: No 32bit target for remote Linux

≪ Previous: Problems importing data gathered by command line from AWS host

Hi,

My current project involves profiling against AWS instances that I spin up specifically for profiling. The AWS instances get erased at the end of the session. I'm therefore installing vTune profiler quite frequently (on Linux).

To save a little time each time I run up an instance, I'd hoped to use the silent install; so I ran through my 'normal' install using:

install.sh -d vtune_install.conf

In this installation, I customised it to remove GUI compnents. The components recorded in `vtune_install.conf` were:

COMPONENTS=;intel-vtune-profiler-2020-cli-common__noarch;intel-vtune-profiler-2020-common__noarch;intel-vtune-profiler-2020-cli__x86_64;intel-vtune-profiler-2020-cli-32bit__i486;intel-vtune-profiler-2020-collector-32linux__i486;intel-vtune-profiler-2020-collector-64linux__x86_64;intel-vtune-profiler-2020-doc__noarch;intel-vtune-profiler-2020-sep__noarch;intel-vtune-profiler-2020-target__noarch;intel-vtune-profiler-2020-vpp-server__x86_64;intel-vtune-profiler-2020-common-pset

When I run

install.sh --silent vtune_install.sh

I get an error:

Missing critical prerequisite
-- ALSA library is not found. 'Graphical user interface' compenent(s) cannot be installed
...

with a bit more info about ALSA and then a complaint about X11 not being present followed by a suggestion to deselect the 'Graphical user interface' compenent(s). As I mentioned, I had done this in the manual install which I'd previously done to generate the config for the silent install, the selected components that I listed above from the config also don't seem to include GUI components.

Is it actually possible to do a silent install without the GUI and if so, how can this be done?

Many thanks,

Dominic

↧

No 32bit target for remote Linux

April 22, 2020, 10:01 am

Latest and popular articles on Intel Technologies

≫ Next: System Profiling of an AWS host with app running in Docker - Outside any known module

≪ Previous: Silent CLI only install still does GUI checks

Hi, I'm trying to run Vtune remotely on a 32 bit Linux target but there's no vtune_profiler_target_x86.tgz in the target directory (only an x86_64 one).

Is this unsupported now?

↧

System Profiling of an AWS host with app running in Docker - Outside any known module

April 23, 2020, 9:58 am

Latest and popular articles on Intel Technologies

≫ Next: Profiler Drivers missing

≪ Previous: No 32bit target for remote Linux

Hi,

The system I'm currently profiling is a Linux AWS instance, it's a Scala app (so running on JVM) running in a docker container. When I profile using a command line like:

vtune -collect hotspots -knob sampling-mode=hw -knob enable-stack-collection=true -finalization-mode=full

The majority of the recorded time is listed as 'Outside any known module', and covers the code that I'm mainly interested in profiling. I had initially suspected that it was as a result of the code being built without debug symbols, however I built a new version with both Java and Scala compilers set to record full debug symbols, and this didn't improve things.

Can you advise on what I'm missing that could enable me to get full stack info via VTune? I could remove the application from the docker container if that would help, but this is slightly more complicated than it sounds in a prod-like environment and so I'd prefer to avoid it at this stage unless I know it's got a good chance of success.

Many thanks!

Dominic

↧

Profiler Drivers missing

April 23, 2020, 11:25 am

Latest and popular articles on Intel Technologies

≫ Next: Intel VTune Profiler Comparison with other Profiler in Market

≪ Previous: System Profiling of an AWS host with app running in Docker - Outside any known module

Brand new to vtune, on a Centos Linux machine. "uname -r" gives "3.10.0-327.el7.x86_64".

When I installed vtune, it said

The install program cannot detect the kernel source directory for OS kernel
version 3.10.0-327.el7.x86_64. If kernel sources are already installed to custom
directory, set up this parameter in Advanced Options -> Driver Build Options
dialog.
To install kernel headers, execute one of the following commands specific to
your operating system:
- CentOS / Red Hat Enterprise Linux
- On a system with the default kernel, install the kernel-devel package:
sudo yum install kernel-devel-3.10.0-327.el7.x86_64
- On a system with the PAE kernel, install the kernel-PAE package:
sudo yum install kernel-PAE-devel-3.10.0-327.el7.x86_64

When I try to run that final "yum" line, I get:

No package kernel-PAE-devel-3.10.0-327.el7.x86_64 available.

When I run either vtune-self-checker.sh or try to set up a Microarchitecture Exploration inside the vtune GUI,
they tell me "vtune: Error: This analysis type requires either an access to system-wide monitoring in the Linux perf subsystem or installation of the VTune Profiler drivers (see the "Sampling Drivers" help topic for further details). "

How do I install the VTune Profiler drivers?

↧

Intel VTune Profiler Comparison with other Profiler in Market

April 23, 2020, 1:53 pm

Latest and popular articles on Intel Technologies

≫ Next: Export data from timeline

≪ Previous: Profiler Drivers missing

Dear Friends,

I have never used VTune profiler for instrumenting C++.

Can someone tell me why Intel VTune is better then other profiliers like valgrid, perf, gprof ? In terms of slowness & time takes for profiling hows intel compares to others ?

I want to focus on measuring application cache misses , cache performance.

Also i dont want to install Intel VTune GUI on development server, so can i load the output file of VTune created via cmd line in my local machine using VTune GUI ?

Thanks

Himanshu

↧

Export data from timeline

April 23, 2020, 6:11 pm

Latest and popular articles on Intel Technologies

≫ Next: Intel VTune Profiler Installation

≪ Previous: Intel VTune Profiler Comparison with other Profiler in Market

I need the data (Memory bandwidth, CPU analysis, etc.) in time series exported into a readable format such as .csv file.

How can I do that?

↧

Intel VTune Profiler Installation

April 28, 2020, 11:44 am

Latest and popular articles on Intel Technologies

≫ Next: My scenario and Intel VTune Profiler.

≪ Previous: Export data from timeline

Hi,

I would like to install Intel VTune profiler on my Centos 7 machine.....AFAIK, the latest version does not work on CentOS 6.. But Centos 7 is fine, isn't it?

Please help me regarding the extracting the installation package to a writeable directory with the following command:

tar -xzf vtune_profiler_.<version>.tar.gz

I am struggling with the version part. Which version should I put? I mean what should be the complete command?

Thanks in advance .-)

BR
Bobby !

↧

My scenario and Intel VTune Profiler.

April 29, 2020, 4:19 am

Latest and popular articles on Intel Technologies

≫ Next: Vtune Data microarchitecture analysis metrics vs varying CPU frequency

≪ Previous: Intel VTune Profiler Installation

I am working on Ceph (https://en.wikipedia.org/wiki/Ceph_(software). It is an open-source software storage platform.

- Through git I have cloned Ceph in my home folder.
- Built its dependencies
- And compiled Ceph in debug mode.

In order to work as a developer on Ceph, a Ceph utility, vstart.sh, allows you to deploy fake local cluster for development purpose. The fake local cluster is deployed by a script vstart.sh.

As it deploys a cluster, as a developer you can develop READ and WRITE tests to test the deployed cluster. Hence client programs (client test codes). These READ and WRITE tests are compiled using GCC in build folder. Once compiled you get executable. I have some of these test codes in C and some in C++.

Here I would like to bring Intel VTune Profiler in my workflow. I would like to do profiling of Ceph through my READ and WRITE test codes. Profiling of call functions, loops, etc.

And I am using a single virtual machine (Linux CentOS 7). Ceph is written mostly in C++.

My questions:

- Does Intel VTune Profiler fits my scenario?

- If yes, given my scenario, where exactly Intel VTune Profiler should be installed?

- The executable of READ and WRITE test codes are in build folder i.e. /home/user/ceph/build....How can I launch VTune profiler in this case?

- Does Intel VTune Profler supports C executable?

Looking forward to help.

↧

Vtune Data microarchitecture analysis metrics vs varying CPU frequency

April 29, 2020, 8:27 am

Latest and popular articles on Intel Technologies

≫ Next: 2 problems with Threading analysis

≪ Previous: My scenario and Intel VTune Profiler.

Hi,
I ran a code with different frequencies and collected vtune data (on 8280 processor, rhel 7) using microarchitecture analysis. I understand that vtune(v2020) can be used to identify the portions of codes which are underutilizing the given hardware resources on a processor. I did this experiment in order to see how the application responds on variation of a particular component of hardware or , which hardware component limits the scaling of this application (example - memory frequency/cpu frequency etc.)?

So, i gathered the data with various frequencies (acpi-cpufreq) and followed the metrics breakdown trail of the numbers shown in red color on vtune GUI as -
1: Back End Bound --> 2: (Memory Bound, Core Bound) --> 3: DRAM Bound --> 4: (Memory Bandwidth, Memory Latency) --> 5: Local DRAM.

I noticed that -
a) Back-End Bound: = Memory Bound + Core Bound , example (62% of clock ticks = 42 % + 20 %)
b) Memory Bound ~= L1 Bound + L2 Bound + L3 Bound + DRam Bound + Store Bound(42 ~= 8% + 3% + 2% + 20% + 6%)
c) DRam Bound < Memory Bandwidth Bound + Memory Latency (20 < 28 + 10)
d) Memory Latency << Local DRAM + Remote DRAM + Remote Cache (10 << 97 + 2 + 1)

Q1: What could be the reason behing the subcategory total exceeding the category value for c & d ?
for c and d i was expecting something like DRam Bound = Memory Bandwidth Bound + Memory Latency.

Q2: On increasing the CPU frequency i got following from vtune for DRAM Memory Bandwidth
1GHz - 28 % of Clockticks
1.4GHz - 37 %
1.8GHz - 42 %
2GHz - 42.5 %
2.6GHz - 42.8 %
2.7GHz - 42.9 %
2.7+boost enabled - 41.7 %
- The number of CPU stalls (for DRAM) are not increasing when the frequency exceeds 1.8 GHz. Now i am looking for the reason behind this behaviour.
I expected that with higher frequency, stalls would grow as more CPU cycles/ pipeline slots will be wasted due to data unavailability.
I am focusing on metrics highlighted in red. As cache bound clock cycles were almost constant (.2/.4% increase in each of L1,L2,L3,Store) for all the frequencies mentioned above, could i say that larger cache will not help here? - contrary to what is mentioned here

Q3: I noted that on varying the frequencies, the Vector Capacity Usage (FPU) stays constant at around 70%. Which from the explanation here means that 70% of my floating point computations executed on VPU units (rest were scalar).
also, here i can see that there are different types of execution units which can process 256 but data. Is it possible to see the break up of the Floating point applications like - how many used 256-FP MUL, how many used 256 FP Add etc ?

Q4: Are 256 FP Add/256-FP MUL and FMA are different ? If yes then on which port the front end unit dispatches the uOPs for FMA? as i can't see the FMA unit in the block diagram

please let me now if some more information is required from my end or any of the questions mentioned above are vague / unclear.

↧

2 problems with Threading analysis

April 29, 2020, 11:15 pm

Latest and popular articles on Intel Technologies

≫ Next: Running finalize on different host from profiling

≪ Previous: Vtune Data microarchitecture analysis metrics vs varying CPU frequency

Hello!

I am trying to profile a C++ application with OpenMP using Intel Vtune Profiler and I have 2 troubles with Threading analysis.
1) If I run the application for a short time (there are different modes and options in this application, so I can vary the time of execution), I reach the limit of 1000 MB collected data in a few minutes.
2) Even if I run application for a short time (data limit isn't reached), after data collection, finalization of the results freezes (see "freeze" picture).

Tell me, please what should I do to solve these problems?

Many thanks! :)

P.S. For example, Hotspot analysis runs relatively correctly. I have Windows 10 and NetBeans with MinGW compiler.

UPD. I've accidentally discovered such options for application that Threading analysis runs correctly (these options corresponds to the most effective and short-time execution). But problems I've mentioned above are still interesting for me because I want to run Threading analysis with different options.

Attachment	Size
Download freeze.png	17.94 KB

↧

Running finalize on different host from profiling

April 30, 2020, 6:20 am

Latest and popular articles on Intel Technologies

≫ Next: Speeding up finalize

≪ Previous: 2 problems with Threading analysis

Hi. I need to run some code to be profiled on a special set of machines, but because they are a limited resource, I would like to do the finalization on a different machine. All of the machines have essentially the same set up (same OS, same file systems mounted, same libraries). However, when I try to finalize on a different host vtune detects that the host is different and doesn't automatically find the libraries and debug information, even though all of the information is in exactly the same place. The application is big and complicated, with hundreds of libraries and code spread over a large directory tree; it doesn't seem like I can just specify a top-level directory and have the tool find everything. Is there a way I can tell vtune to behave the same as if I was running finalize on the same host?

Thanks,

Ben

↧

Speeding up finalize

April 30, 2020, 6:27 am

Latest and popular articles on Intel Technologies

≫ Next: Interpretation of profiling results

≪ Previous: Running finalize on different host from profiling

Hi. I'm wondering if there are any tricks to speeding up finalization. My jobs usually run for 4-6 hours, of which maybe one or two hours has profiling enabled. However, finalization can take 2-4 days. I've tried limiting the sampling rate and the total data stored, but even then it still usually takes 10x longer to finalize than to profile (I have this sense that this didn't use to be the case when I was last using vtune a few years ago; perhaps I was using an older version?). I'm currently using VTune 2019. If it will make a big difference I could try to get it upgraded, but the tools are managed centrally so that's not always easy. I'm hoping there are some things I can do to bring the finalization time down without losing too much in the way of profiling coverage.

Thanks,

Ben

↧

Interpretation of profiling results

April 30, 2020, 9:10 am

Latest and popular articles on Intel Technologies

≫ Next: cache hit/miss rate calculation - cascadelake platform

≪ Previous: Speeding up finalize

Hello!

I am trying to profile a C++ application with OpenMP using Intel Vtune Profiler and I've run Hotspots, Threading user-mode and hardware-based analyzes (see "hotspots", "user" and "hardware" pictures +"threads" picture from hardware analysis).

I have several questions about results of these analyzes and I ask to help me.

1) What do these results generally mean? If I'm not mistaken, Hotspots analysis revealed that most of time was spent usefully and then Threading analyzes shows the opposite.
2) What is Semaphore object in Threading user-mode analysis?
3) Why one thread has such a lot of load? ("threads" picture) Most of work is done in parallel region.

What should I do to increase parallelism of this application?

I've read the documentation: https://software.intel.com/en-us/vtune-help-windows-targets but still can't understand what's happening in my case.

Algorithm of application is simple:
#pragma omp parallel num_threads(8){
    if(myID==0){
    <master thread job>
    }
    #pragma omp for schedule(static)
        <parallel cycle>
    if(myID==0){
    <master thread job>
    }
}

Many thanks! :)

P.S. I have Windows 10 and NetBeans with MinGW compiler

Attachment	Size
Download hotspots.png	122.33 KB
Download user.png	117.82 KB
Download hardware.png	79.46 KB
Download threads.png	36.51 KB

↧

cache hit/miss rate calculation - cascadelake platform

May 4, 2020, 9:13 am

Latest and popular articles on Intel Technologies

≫ Next: Intel VTune Profiler for time spent per function call

≪ Previous: Interpretation of profiling results

Hi,
I ran microarchitecture analysis on 8280 processor and i am looking for usage metrics related to cache utilization like - L1,L2 and L3 Hit/Miss rate (total L1 miss/total L1 requests ...., total L3 misses / total L3 requests) for the overall application. I was unable to see these in the vtune GUI summary page and from this article it seems i may have to figure it out by using a "custom profile".
From the explanation here (for sandybridge) , seems we have following for calculating "cache hit/miss rates" for demand requests-

Demand Data L1 Miss Rate => cannot calculate.
Demand Data L2 Miss Rate =>
(sum of all types of L2 demand data misses) / (sum of L2 demanded data requests) =>
(MEM_LOAD_UOPS_RETIRED.LLC_HIT_PS + MEM_LOAD_UOPS_LLC_HIT_RETIRED.XSNP_HIT_PS + MEM_LOAD_UOPS_LLC_HIT_RETIRED.XSNP_HITM_PS + MEM_LOAD_UOPS_MISC_RETIRED.LLC_MISS_PS) / (L2_RQSTS.ALL_DEMAND_DATA_RD)
Demand Data L3 Miss Rate =>
L3 demand data misses / (sum of all types of demand data L3 requests) =>
MEM_LOAD_UOPS_MISC_RETIRED.LLC_MISS_PS / (MEM_LOAD_UOPS_RETIRED.LLC_HIT_PS + MEM_LOAD_UOPS_LLC_HIT_RETIRED.XSNP_HIT_PS + MEM_LOAD_UOPS_LLC_HIT_RETIRED.XSNP_HITM_PS + MEM_LOAD_UOPS_MISC_RETIRED.LLC_MISS_PS)

Q1: As this post was for sandy bridge and i am using cascadelake, so wanted to ask if there is any change in the formula (mentioned above) for calculating the same for latest platform and are there some events which have changed/added in the latest platform which could help to calculate the -
- L1 Demand Data Hit/Miss rate
- L1,L2,L3 prefetch and instruction Hit/ Miss rate
also, in this post here , the events mentioned to get the cache hit rates does not include ones mentioned above (example MEM_LOAD_UOPS_RETIRED.LLC_HIT_PS)

amplxe-cl -collect-with runsa -knob event-config=CPU_CLK_UNHALTED.REF_TSC,MEM_LOAD_UOPS_RETIRED.L1_HIT_PS,MEM_LOAD_UOPS_RETIRED.L1_MISS_PS,MEM_LOAD_UOPS_RETIRED.L3_HIT_PS,MEM_LOAD_UOPS_RETIRED.L3_MISS_PS,MEM_UOPS_RETIRED.ALL_LOADS_PS,MEM_UOPS_RETIRED.ALL_STORES_PS,MEM_LOAD_UOPS_RETIRED.L2_HIT_PS:sa=100003,MEM_LOAD_UOPS_RETIRED.L2_MISS_PS -knob collectMemBandwidth=true -knob dram-bandwidth-limits=true -knob collectMemObjects=true

Q2: what will be the formula to calculate cache hit/miss rates with aforementioned events ?

Q3: is it possible to get few of these metrics (like MEM_LOAD_UOPS_MISC_RETIRED.LLC_MISS_PS,... ) from the uarch analysis 's raw data which i already ran via -

mpirun -np 56 -ppn 56 amplxe-cl -collect uarch-exploration -data-limit 0 -result-dir result_uarchexpl -- $PWD/app.exe

So, the following will the correct way to run the custom analysis via command line ? -

mpirun -np 56 -ppn 56 amplxe-cl -collect-with runsa -data-limit 0 -result-dir result_cacheexpl -knob event-config=MEM_LOAD_UOPS_RETIRED.LLC_HIT_PS,MEM_LOAD_UOPS_LLC_HIT_RETIRED.XSNP_HIT_PS,MEM_LOAD_UOPS_LLC_HIT_RETIRED.XSNP_HITM_PS,L2_RQSTS.ALL_DEMAND_DATA_RD,MEM_LOAD_UOPS_MISC_RETIRED.LLC_MISS_PS,CPU_CLK_UNHALTED.REF_TSC,MEM_LOAD_UOPS_RETIRED.L1_HIT_PS,MEM_LOAD_UOPS_RETIRED.L1_MISS_PS,MEM_LOAD_UOPS_RETIRED.L3_HIT_PS,MEM_LOAD_UOPS_RETIRED.L3_MISS_PS,MEM_UOPS_RETIRED.ALL_LOADS_PS,MEM_UOPS_RETIRED.ALL_STORES_PS,MEM_LOAD_UOPS_RETIRED.L2_HIT_PS:sa=100003,MEM_LOAD_UOPS_RETIRED.L2_MISS_PS  -- $PWD/app.exe

(please let me know if i need to use more/different events for cache hit calculations)

Q4: I noted that to calculate the cache miss rates, i need to get/view data as "Hardware Event Counts", not as "Hardware Event Sample Counts".https://software.intel.com/en-us/forums/vtune/topic/280087 How do i ensure this via vtune command line? as I generate summary via -

vtune -report summary -report-knob show-issues=false -r <my_result_dir>.

Let me know if i need to use a different command line to generate results/event values for the custom analysis type.

↧