I am using vtune 2020u0 on intel 8280 platform. I carried out an HPC characterization analysis and was looking at the Heading of Vectorization Section which has
Vectorization: 77.7% of Packed FP Operations Instruction Mix: SP FLOPs: 15.4% Packed: 79.8% 128-bit: 0.0% 256-bit: 0.1% 512-bit: 79.8% Scalar: 20.2% DP FLOPs: 0.4% x87 FLOPs: 0.0% Non-FP: 84.2% FP Arith/Mem Rd Instr. Ratio: 0.462 FP Arith/Mem Wr Instr. Ratio: 1.369
-
checked for a detailed explanation here , but was unable to gain clarity so asking my queries here.
From report it seems code issued packed + non packed instructions and, out of all the packed FP instructions issued during code execution, only 77.7% were vectorized - Which (AFAIK) means these instructions resulted in use of AVX/AVX2/AVX512 bit registers.
Could you please explain / refer me to an article which explains the (general) reasons for non-vectorization of (in my case - 22.3% of packed instructions) packed instructions? and how these packed instructions would execute (using scalar registers?)?
For example - mm256_add_ps is a packed instruction, so could you help me in understanding that how the add operation could be non-vectorized in following context -
float f[8]={1.0,2.0,1.2,2.1, 5.2,5.3,10.1,11.0}; __m256 v=_mm256_load_ps(&f[0]); v=_mm256_add_ps(v,v);
The aforementioned code is not related to the code which i have profiled.