Quantcast
Channel: Intel® VTune™ Profiler (Intel® VTune™ Amplifier)
Viewing all articles
Browse latest Browse all 1347

How relevant are the line-by-line timings of a hotspot analysis

$
0
0

Not 100% sure if this is the correct sub-forum, please be gentle...

I am currently trying to optimise a code for execution time. Hence the use of VTune Amplifier. I have some doubts/questions about the accuracy/relevance of the "cpu time" column. Here is a result from my code:

-See Attachments-

For example, I can not quite understand why line 34 "f_temp(11) = f_1(11,links(14,n))" takes so much longer than all of the similar lines.

So I tried swapping the order of lines, putting this one first. Surprisingly, now the line that took it's place as the 12th line of this block took equally long to execute, according to VTune amplifier.

Is this a real effect or am I falling prey to some kind of measurement error/aliasing effect?

For this analysis I used a single (pinned) thread on a system with 2x Xeon 2687W v3. The clock speed was locked to 3GHz. Compiler options were -O3 -ip -ipo -xHost -fopenmp -g. Ifort version is 18.0.03 on a Linux operating system. Advanced hotspot analysis with 1ms sampling interval.

 

And a follow-up: provided the line-by-line results are anything to go by, how would I optimise the code further for execution time?

I could imagine that the indirect memory access here causes a lot of LLC misses and makes the code memory (latency) bound. Additional analyses using VTune Amplifier seem to confirm this. But as far as I can tell there is not much I could do about it. The nodes are already reordered using a space-filling curve to increase data locality. And then why would only some of those lines with indirect memory access take up most of the time. Anything I am missing here? The problem size used in this example does not fit into cache, which will be the use-case later.

Another possibility -judging by publications in this area- could be streaming stores. This might be able to reduce the time taken by the last line of the code shown "f_2(i,n) = f_temp(i) - omega * (f_temp(i)-feq)". However, I have no idea how I would implement this. All the samples I could find used different languages.


Viewing all articles
Browse latest Browse all 1347

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>