Greetings,
TL;DR I have two generations of processors where identical systems are drastically different in performance and I am trying to use VTune to figure out why.
Full story.
A user recently sent me a complaint that the new cluster was slower than the old cluster. This didn't surprise me too much as we got a really good deal on the procs for the new cluster and went with quantity for the parallel applications over a small number of "faster" processors. The old cluster was a hodge-podge collection of nodes comprised of whatever the fastest proc we could afford at that time. That caused a lot of problems for our parallel users so we wanted to stay close to a uniform cluster this time. I did a lot of application testing with a good sample of our apps and the difference was trivial between the fastest nodes on the old vs the new.
The "fast" nodes on the old cluster are the Xeon Westmere X5687
http://ark.intel.com/products/52578/Intel-Xeon-Processor-X5687-12M-Cache...
The nodes on the new cluster are the Xeon Sandy Bridge E5-2670
http://ark.intel.com/products/64595/Intel-Xeon-Processor-E5-2670-20M-Cac...
When he complained that it was four times slower, that bothered me. I ended up running a series of tests across multiple systems using the same binary and verified that:
* The user just had some crazy bad luck when he initially did his benchmarking. However, he was right in finding that there is a problem.
* One X5687 processor will run the code in 3 min 30 sec on average after many different runs including a fresh reboot. It is really consistent in running between 3:20 and 3:45.
* A second X5687 processor that *should* be identical consistently runs just over 9 minutes.
* One E5-2670 processor consistently averages 7 minutes
* A second E5-2670 processor consistently averages 15 minutes (hence the 4x slower response from the user)
* I have a wide variety of ranges from other E5-2670 processors with the average sitting closer to 9 minutes.
* I have a Xeon X5675 that is my "If I break it no one cares" test system which I can beat up with VTune and testing. It consistently runs just under 4 minutes.
My theories for the discrepancies between the processors are:
* Possibly some sort of cache/memory alignment problem?
* There is only one random number generated at the very beginning. The code should be pretty uniform after that point and the tests were all with the same binary. Maybe I need to compile with ifort to target specific architectures? Compiling a binary for each processor family did not seem to make much of a difference, but maybe there are other flags I should try.
However, those points would explain the difference between processors, not the differences between the same processor type.
Maybe it is something as simple as a CPU feature (like virtualization flag) enabled on one host and not the other. However, I can't seem to find that difference.
So I have turned to VTune in an effort to figure this out. VTune has pointed out several issues with the code (which we are working on) and there are improvements to be made, but so far I don't see anything that would tell me why it runs slow on one and faster on an "identical" system.
If it was just that one processor type was faster than the other then this wouldn't be of any issue. But I have been running tests, pouring over VTune output, and hitting up forums for the past few days and I feel like I am not getting anywhere in explaining this mystery.
I would greatly appreciate advice/suggestions on how I might be able to better figure out why there is such a large difference between "identical" systems. What should i be looking for in VTune? Is there a specific test I should run?
Thanks!