Hi everybody,
I'm having some trouble in understanding VTune HPC Characterization analysis results of my CFD code. Considering its large size, I isolated one of its parallel zone and made the same analysis on this small portion, looping on it in order to have enough wall time.
The small toy code follows (the code is completely useless, but its analysis is quite similar to the one on the big code):
#include <omp.h>
#include <iostream>
#include <stdio.h>
int main() {
size_t size = 1000000000;
double * __restrict__ fluxesDataPtr = new double[size];
double * __restrict__ advFluxesDataPtr = new double[size];
double * __restrict__ difFluxesDataPtr = new double[size];
double * __restrict__ coefficientsDataPtr = new double[size];
for(int k = 0; k < 10; ++k){
#pragma omp parallel //num_threads(4)
{
//double wtime = omp_get_wtime();
#pragma omp for
for (size_t i = 0; i < size; ++i) {
//std::cout << "thread "<< omp_get_thread_num() << ""<< i << std::endl;
fluxesDataPtr[i] = coefficientsDataPtr[i] * (advFluxesDataPtr[i] - difFluxesDataPtr[i]);
}
//wtime = omp_get_wtime() - wtime;
//printf( "Time taken by thread %d is %f\n", omp_get_thread_num(), wtime );
}
}
printf( "Useless print %f\n", fluxesDataPtr[0]);
}
Consider also that measuring walltime "by hand", the code scales from 1 to 4 threads almost perfectly.
But Vtune HPC Characterization analysis gives what you can see in the attached image.
Just to be clear, the analysis ran on 4 threads of a Haswell i7-4700HQ with Ubuntu 14.04 and kernel 4.4.0-137 and the code has been compiled using
icpc (ICC) 18.0.3 20180410 with the -qopenmp flag.
The weird things are more than one: why, considering that the codes scales, the fourth thread does nothing? why nobody does nothing after 1.5sec? what is the relationship between elapsed time and the times in CPU Time column? Why is there no Spin Time even if all threads are doing nothing for the most of the elapsed time?
Finally, please, consider this small as a part of a much bigger one and feel free to ask any information you may need to better understand my problem.
Any help is really appreciated.
Thanks,
Marco