We have KNL and SKX systems running CentOS kernel 3.10.0-693.17.1.
The KNL systems are currently running the Intel sep4_1 driver that came with VTune amplifier 2018.0.2 build 525261, while the SKX systems are running with the "perf events" driver.
In both cases, attempting to use the "-collect memory-access" option to amplxe-cl results in repeated kernel emergency messages along the lines of:
Uhhuh. NMI received for unknown reason xx on CPU yy.
Do you have a strange power-saving mode enabled?
Dazed and confused, but trying to continue
On the KNL systems the "unknown reason" alternates between 29 and 39, and the message typically shows up for all cores. On SKX systems the "unknown reason" typically alternates between 20 and 30, and the message also typically shows up for all cores.
The nodes don't crash -- indeed, the amplxe-cl job finishes and prints out its summary report. BUT, these messages printed by "pr_emerg()" are echoed to all root windows on the master node, where they make the system operators cranky. Cranky operators often kill the offending jobs.
On the SKX nodes, about the time the unexpected NMIs start, we see a handful of messages like:
INFO: NMI handler (perf_event_nmi_handler) took too long to run: 585758.001 msec
and sometimes:
hrtimer: interrupt took 25076958 ns
The perf_event_nmi_handler message seems weird -- 5857578 msec is almost 10 minutes, and this message appeared within 3 minutes of the start of the job. The hrtimer number (25 second) is more plausible, but no less concerning.
On the KNL nodes (running sep), there are no other interesting messages in the log -- just repetitions of the trio of "Dazed and confused" messages for the duration of the job. The log that I am staring at now repeats this trio of messages 1722 times during the 18 minutes that VTune was running, then everything appears to have returned to normal.
As a short-term workaround, I have found that collecting uncore counters "manually" using "-collect-with runsa -knob event-config=..." does data collection without generating irritating kernel messages, but I have not looked in detail at the collected data.
In the slightly longer term, we plan to install and test Intel Parallel Studio 2018 update 2 along with the corresponding SEP kernel module. Does anyone know if this is likely to provide any benefit with regard to this class of problems?