Performance impact of DTLB misses

January 16, 2015, 5:50 am

Latest and popular articles on Intel Technologies

≫ Next: Significant analysis time increase after switching from Qt4 to Qt5

≪ Previous: How to provide C source code to Hotspot Analysis

I apologise in advance if I'm posting in a wrong forum.

I'm having an issue (or simply a misunderstanding) with TLB miss measurement.

In order to test our measurement, I'm simply measuring a poorly optimized matrix multiplication of two 1000x1000 matrices with the code below:

/**
* Performs a parallelized matrix multiplication in I, J, K order.
*/
void Parallel_IJK(int n, int** a, int** b, int** c){
   parallel_for (0, n, [&](int i)
   {
      for (int j=0; j<n; j++) {
         int sum = 0;
         for (int k=0; k<n; k++)
            sum += a[i][k] * b[k][j];
            c[i][j] = sum;
         }
      });
}

I'm measuring it on the following Intel microprocessor: http://ark.intel.com/products/37150/Intel-Core-i7-950-Processor-8M-Cache...

In order to measure the counters I'm using Intel Performance Counter Monitor utility, modified for measuring TLB counters. The performance impact itself is measured according to the formula on this page 24 of this Intel presentation: ( https://software.intel.com/sites/default/files/88/fe/core-i7-processor-f... ), as: TLB MISSES = ((DTLB_LOAD_MISSES.WALK_COMPLETED * 30) / CPU_CLK_UNHALTED.THREAD).

Here's the issue, the TLB impact on a single core seems to be very closely correlated with L2 misses. My questions are:

Is it normal that TLB misses are correlated with L2 misses (there seems to be roughly the same amount of L2 misses as TLB misses) while there's very low amount of L3 misses? I'm not an expert in the domain, but to my understanding TLB misses should be correlated with L3 misses and not L2, but the data tells me otherwise.
When I sample on a high rate (=each millisecond), the TLB impact formula often gives me ratio higher than 1, sometimes even 5 (= 500%). Same goes for L2 perf impact. In the documentation it's written that "ratio that is usually between 0 and 1 ; in some cases could be >1.0 due to a lower memory latency estimation", could you elaborate?

Below are some graphs to illustrate my issue, measuring and sampling my matrix multiplication over 100 ms interval. This is just a single core of the system.

↧

Significant analysis time increase after switching from Qt4 to Qt5

January 16, 2015, 5:50 am

Latest and popular articles on Intel Technologies

≫ Next: Cannot link ittnotify with generated object files

≪ Previous: Performance impact of DTLB misses

Hi,

we regularly use VTune to optimize our software. Recently we are linking to Qt5 instead of Qt4. Exactly after this modification of our software, advanced hotspots analyses of some of our software workflows take significantly longer. Example runtimes for exactly the same workflow:

Qt4: 14 s runtime without VTune attached, 48 s runtime with VTune attached
Qt5: 18 s runtime without VTune attached, 4 minutes with VTune attached

So this is an analysis runtime increase of factor 5. (With "analysis runtime" I mean the time which elapses between pressing the start and the stop button.)

I am currently analysing why our software runs slower with Qt5. First hint is (I have found this with VTune) that the QMutex implementation in Qt5 is completely different from Qt4 (e.g. there is no spinning anymore before locking a mutex).

Has anyone else experienced similar slow-downs especially concerning Qt4/Qt5? Can anyone suggest useful analysis steps? (What I am currently trying: patching the Qt4 mutex into Qt5 and see what happens. First result: VTune runtime still slower with patched Qt5.)

Best regards, Hagen

↧

Cannot link ittnotify with generated object files

January 17, 2015, 7:21 am

Latest and popular articles on Intel Technologies

≫ Next: amplxe: Warning: Cannot start collection of GPU events

≪ Previous: Significant analysis time increase after switching from Qt4 to Qt5

Hello,

I am trying to use a VTune API (ittnotify.h) in C++ code. When I try to compile and link files with single command it works correctly. But when performing separate compilation and linking I get following error:

test.cpp:6: undefined reference to `__itt_pause_ptr__3_0'
test.cpp:6: undefined reference to `__itt_pause_ptr__3_0'
test.cpp:7: undefined reference to `__itt_resume_ptr__3_0'
test.cpp:7: undefined reference to `__itt_resume_ptr__3_0'

Solution 1

icpc -Wall -c test.cpp -g -I/data/development/intel/vtune_amplifier_xe_2015/include -o test.o
icpc -g -I/data/development/intel/vtune_amplifier_xe_2015/include /data/development/intel/vtune_amplifier_xe_2015/lib64/libittnotify.a -lpthread test.o -o test

Solution 2:

icpc -g test.cpp -I/data/development/intel/vtune_amplifier_xe_2015/include /data/development/intel/vtune_amplifier_xe_2015/lib64/libittnotify.a -lpthread -o test

test.cpp

#include "stdio.h"
#include "ittnotify.h"

int main(int argc, char* argv[])
{
    __itt_pause();
    __itt_resume();
    return 1;
}

For my reasons I cannot use single command version in solution 2. Can you please help me fix steps in solution 1?

System: SLES 11 SP3, Intel Parallel Studio 2015 Pro Update 1

↧

amplxe: Warning: Cannot start collection of GPU events

January 20, 2015, 11:11 pm

Latest and popular articles on Intel Technologies

≫ Next: centos6.5 composer_xe_2013_sp1.2.144 libiomp5.so painc

≪ Previous: Cannot link ittnotify with generated object files

Hi,

My system is Win 8.1 pro. I use VTune 2015 and I want to do some GPU analysis and get some GPU Metrics. I run "CPU/GPU Concurrency Analysis" in VTune Amplifier but cannot run.

When I use the VTune GUI and run the "CPU/GPU Concurrency Analysis", it appears this in Collection log:

!Cannot start collection of GPU events

and there comes a window: amplxe-runss.exe

amplxe-runss.exe has stopped working

I even cannot shut down the command line window.

Then I tried the command line but also get the same error, what should I do?

C:\Windows\system32>"C:\Program Files (x86)\Intel\VTune Amplifier XE 2015\bin64\
amplxe-cl" -collect cpugpu-concurrency -app-working-dir E:\Projects\test\Debug -
- E:\Projects\test\Debug\test.exe
amplxe: Collection started. To stop the collection, either press CTRL-C or enter
from another console window: amplxe-cl -r C:\Windows\system32\r001cgc -command
stop.
amplxe: Warning: Cannot start collection of GPU events

I recall that I also installed the Intel INDE and PCM yesterday. Does that matter? Do I need to reinstall my operating system?

Hope for your reply. Thanks!

↧

centos6.5 composer_xe_2013_sp1.2.144 libiomp5.so painc

January 21, 2015, 3:54 am

Latest and popular articles on Intel Technologies

≫ Next: VTune supported debug formats

≪ Previous: amplxe: Warning: Cannot start collection of GPU events

dear all：

ruining in docker container

kernel：

CentOS release 6.5 (Final) 2.6.32-431.el6.x86_64 #1 SMP Fri Nov 22 03:15:09 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux

vendor_id       : GenuineIntel
cpu family      : 6
model           : 62
model name      : Intel(R) Xeon(R) CPU E5-2650 v2 @ 2.60GHz
stepping        : 4
cpu MHz         : 2599.825
cache size      : 20480 KB
physical id     : 0
siblings        : 16
core id         : 6
cpu cores       : 8
apicid          : 13
initial apicid : 13
fpu             : yes
fpu_exception   : yes
cpuid level     : 13
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm ida arat xsaveopt pln pts dts tpr_shadow vnmi flexpriority ept vpid fsgsbase smep erms
bogomips        : 5199.65
clflush size    : 64
cache_alignment : 64
address sizes   : 46 bits physical, 48 bits virtual

painc messages

The program being debugged has been started already.
Start it from the beginning? (y or n) y
Starting program: /export/servers/nginx/sbin/nginx
[Thread debugging using libthread_db enabled]
init sucess
[New Thread 0x7ffff7fa5700 (LWP 7301)]
[New Thread 0x7ffff1699700 (LWP 7302)]
[New Thread 0x7ffff1298700 (LWP 7303)]
[New Thread 0x7ffff0e97700 (LWP 7304)]
[New Thread 0x7ffff0a96700 (LWP 7305)]
[New Thread 0x7ffff0695700 (LWP 7306)]
*** glibc detected *** /export/servers/nginx/sbin/nginx: double free or corruption (!prev): 0x0000000000923f80 ***
======= Backtrace: =========
/lib64/libc.so.6(+0x75e66)[0x7ffff5623e66]
/lib64/libc.so.6(+0x789b3)[0x7ffff56269b3]
/usr/lib/libGraphicsMagick.so.3(+0x133a7c)[0x7ffff70f5a7c]
/usr/lib/libGraphicsMagick.so.3(+0x13bbeb)[0x7ffff70fdbeb]
/opt/intel/composer_xe_2013_sp1.2.144/ipp/../compiler/lib/intel64/libiomp5.so(__kmp_invoke_microtask+0x93)[0x7ffff511c233]
======= Memory map: ========
00400000-00509000 r-xp 00000000 fd:04 918887                             /export/servers/nginx/sbin/nginx
00708000-0071a000 rw-p 00108000 fd:04 918887                             /export/servers/nginx/sbin/nginx
0071a000-00968000 rw-p 00000000 00:00 0                                  [heap]
7fffd4000000-7fffd4021000 rw-p 00000000 00:00 0
7fffd4021000-7fffd8000000 ---p 00000000 00:00 0
7fffdc000000-7fffdc021000 rw-p 00000000 00:00 0
7fffdc021000-7fffe0000000 ---p 00000000 00:00 0
7fffe0000000-7fffe0021000 rw-p 00000000 00:00 0
7fffe0021000-7fffe4000000 ---p 00000000 00:00 0
7fffe4000000-7fffe4021000 rw-p 00000000 00:00 0
7fffe4021000-7fffe8000000 ---p 00000000 00:00 0
7fffe8000000-7fffe8021000 rw-p 00000000 00:00 0
7fffe8021000-7fffec000000 ---p 00000000 00:00 0
7fffec000000-7fffec021000 rw-p 00000000 00:00 0
7fffec021000-7ffff0000000 ---p 00000000 00:00 0
7ffff0295000-7ffff0296000 ---p 00000000 00:00 0
7ffff0296000-7ffff0696000 rw-p 00000000 00:00 0
7ffff0696000-7ffff0697000 ---p 00000000 00:00 0
7ffff0697000-7ffff0a97000 rw-p 00000000 00:00 0
7ffff0a97000-7ffff0a98000 ---p 00000000 00:00 0
7ffff0a98000-7ffff0e98000 rw-p 00000000 00:00 0
7ffff0e98000-7ffff0e99000 ---p 00000000 00:00 0
7ffff0e99000-7ffff1299000 rw-p 00000000 00:00 0
7ffff1299000-7ffff129a000 ---p 00000000 00:00 0
7ffff129a000-7ffff329a000 rw-p 00000000 00:00 0
7ffff329a000-7ffff32a6000 r-xp 00000000 fd:04 524365                     /lib64/libnss_files-2.12.so
7ffff32a6000-7ffff34a6000 ---p 0000c000 fd:04 524365                     /lib64/libnss_files-2.12.so
7ffff34a6000-7ffff34a7000 r--p 0000c000 fd:04 524365                     /lib64/libnss_files-2.12.so
7ffff34a7000-7ffff34a8000 rw-p 0000d000 fd:04 524365                     /lib64/libnss_files-2.12.so
7ffff34a8000-7ffff34af000 r-xp 00000000 fd:04 524528                     /lib64/librt-2.12.so
7ffff34af000-7ffff36ae000 ---p 00007000 fd:04 524528                     /lib64/librt-2.12.so
7ffff36ae000-7ffff36af000 r--p 00006000 fd:04 524528                     /lib64/librt-2.12.so
7ffff36af000-7ffff36b0000 rw-p 00007000 fd:04 524528                     /lib64/librt-2.12.so
7ffff36b0000-7ffff36ef000 r-xp 00000000 fd:04 657779                     /usr/lib64/libjpeg.so.62.0.0
7ffff36ef000-7ffff38ef000 ---p 0003f000 fd:04 657779                     /usr/lib64/libjpeg.so.62.0.0
7ffff38ef000-7ffff38f0000 rw-p 0003f000 fd:04 657779                     /usr/lib64/libjpeg.so.62.0.0
7ffff38f0000-7ffff3900000 rw-p 00000000 00:00 0
7ffff3900000-7ffff3952000 r-xp 00000000 fd:04 1323448                    /opt/intel/composer_xe_2013_sp1.2.144/compiler/lib/intel64/libintlc.so.5
7ffff3952000-7ffff3b52000 ---p 00052000 fd:04 1323448                    /opt/intel/composer_xe_2013_sp1.2.144/compiler/lib/intel64/libintlc.so.5
7ffff3b52000-7ffff3b55000 rw-p 00052000 fd:04 1323448                    /opt/intel/composer_xe_2013_sp1.2.144/compiler/lib/intel64/libintlc.so.5
7ffff3b55000-7ffff3b56000 rw-p 00000000 00:00 0
7ffff3b56000-7ffff3b5b000 r-xp 00000000 fd:04 1323378                    /opt/intel/composer_xe_2013_sp1.2.144/compiler/lib/intel64/libirng.so
7ffff3b5b000-7ffff3d5b000 ---p 00005000 fd:04 1323378                    /opt/intel/composer_xe_2013_sp1.2.144/compiler/lib/intel64/libirng.so
7ffff3d5b000-7ffff3d5d000 rw-p 00005000 fd:04 1323378                    /opt/intel/composer_xe_2013_sp1.2.144/compiler/lib/intel64/libirng.so
7ffff3d5d000-7ffff4721000 r-xp 00000000 fd:04 1323415                    /opt/intel/composer_xe_2013_sp1.2.144/compiler/lib/intel64/libsvml.so
7ffff4721000-7ffff4921000 ---p 009c4000 fd:04 1323415                    /opt/intel/composer_xe_2013_sp1.2.144/compiler/lib/intel64/libsvml.so
7ffff4921000-7ffff4958000 rw-p 009c4000 fd:04 1323415                    /opt/intel/composer_xe_2013_sp1.2.144/compiler/lib/intel64/libsvml.so
7ffff4958000-7ffff4bd9000 r-xp 00000000 fd:04 1323461                    /opt/intel/composer_xe_2013_sp1.2.144/compiler/lib/intel64/libimf.so
7ffff4bd9000-7ffff4dd8000 ---p 00281000 fd:04 1323461                    /opt/intel/composer_xe_2013_sp1.2.144/compiler/lib/intel64/libimf.so
7ffff4dd8000-7ffff4e1b000 rw-p 00280000 fd:04 1323461                    /opt/intel/composer_xe_2013_sp1.2.144/compiler/lib/intel64/libimf.so
7ffff4e1b000-7ffff4e8c000 r-xp 00000000 fd:04 524525                     /lib64/libfreebl3.so
7ffff4e8c000-7ffff508b000 ---p 00071000 fd:04 524525                     /lib64/libfreebl3.so
7ffff508b000-7ffff508d000 r--p 00070000 fd:04 524525                     /lib64/libfreebl3.so
7ffff508d000-7ffff508e000 rw-p 00072000 fd:04 524525                     /lib64/libfreebl3.so
7ffff508e000-7ffff5092000 rw-p 00000000 00:00 0
7ffff5092000-7ffff5177000 r-xp 00000000 fd:04 1323455                    /opt/intel/composer_xe_2013_sp1.2.144/compiler/lib/intel64/libiomp5.so
7ffff5177000-7ffff5377000 ---p 000e5000 fd:04 1323455                    /opt/intel/composer_xe_2013_sp1.2.144/compiler/lib/intel64/libiomp5.so
7ffff5377000-7ffff5382000 rw-p 000e5000 fd:04 1323455                    /opt/intel/composer_xe_2013_sp1.2.144/compiler/lib/intel64/libiomp5.so
7ffff5382000-7ffff53aa000 rw-p 00000000 00:00 0
7ffff53aa000-7ffff53ac000 r-xp 00000000 fd:04 524519                     /lib64/libdl-2.12.so
Program received signal SIGABRT, Aborted.
[Switching to Thread 0x7ffff0695700 (LWP 7306)]
0x00007ffff55e0625 in raise () from /lib64/libc.so.6
(gdb) bt
#0 0x00007ffff55e0625 in raise () from /lib64/libc.so.6
#1 0x00007ffff55e1e05 in abort () from /lib64/libc.so.6
#2 0x00007ffff561e537 in __libc_message () from /lib64/libc.so.6
#3 0x00007ffff5623e66 in malloc_printerr () from /lib64/libc.so.6
#4 0x00007ffff56269b3 in _int_free () from /lib64/libc.so.6
#5 0x00007ffff70f5a7c in DestroyEdge (polygon_info=0x1c32, mid=3.2252605360516574e-319, fill=7306, fill_rule=6, x=-1, y=0,
    stroke_opacity=0x1) at magick/render.c:790
#6 GetPixelOpacity (polygon_info=0x1c32, mid=3.2252605360516574e-319, fill=7306, fill_rule=6, x=-1, y=0, stroke_opacity=0x1)
    at magick/render.c:3180
#7 0x00007ffff70fdbeb in L_DrawPolygonPrimitive_3546__par_loop1_2_1200 () at magick/render.c:3594
#8 0x00007ffff511c233 in L_kmp_invoke_pass_parms ()
   from /opt/intel/composer_xe_2013_sp1.2.144/ipp/../compiler/lib/intel64/libiomp5.so
#9 0x00007fffffff5098 in ?? ()
#10 0x00007fffffff5038 in ?? ()
#11 0x00007fffffff5090 in ?? ()
#12 0x00007fffffff5048 in ?? ()
#13 0x00007fffffff5078 in ?? ()
#14 0x00007fffffff5074 in ?? ()
#15 0x00007fffffff5028 in ?? ()
#16 0x00007fffffff5030 in ?? ()
#17 0x00007fffffff5010 in ?? ()
#18 0x00007fffffff5018 in ?? ()
#19 0x00007fffffffb688 in ?? ()
#20 0x00007ffff752403c in .2.23_2__kmpc_chunk_pack_.27 () from /usr/lib/libGraphicsMagick.so.3

↧

VTune supported debug formats

January 21, 2015, 6:21 am

Latest and popular articles on Intel Technologies

≫ Next: "Cannot start collection because the Intel VTune Amplifier XE 2013 failed to create a result directory" when running with MPI

≪ Previous: centos6.5 composer_xe_2013_sp1.2.144 libiomp5.so painc

Hi!

We have our own compiler that can produce debug information for executables. We generate HLL and CodeView debug formats now. Previous versions of VTune recognized at least CodeView format in our exectuables, but current VTune version does not recognize neither of formats. I was not able to find information about what debug formats are supported by VTune, only supported compilers. As one of the supported compilers is GCC actually, I can deduce that DWARF format is supported. We can try to support DWARF format, but can you confirm that CodeView and HLL are not longer supported?

↧

"Cannot start collection because the Intel VTune Amplifier XE 2013 failed to create a result directory" when running with MPI

January 21, 2015, 10:50 am

Latest and popular articles on Intel Technologies

≫ Next: How to do Basic hotshot profile with intel 2015 + VS 2012 + MPI

≪ Previous: VTune supported debug formats

Hi all,

I get this error:

Cannot start collection because the Intel VTune Amplifier XE 2013 failed to create a result directory. Unknown error.

when I run the following amplxe-cl command:

mpiexec -np 3 ~/local/vtune_amplifier_xe_2013_update7/vtune_amplifier_xe_2013/bin64/amplxe-cl -r mpi003 --collect hotspots -- ~/local/MyBuilds/HDGProject/release/install/bin/MyFESolverDP

The errors only come from the non-master processes. If I run the above command with -np 1 (only one process), everything works fine. So, it seems that the non-master processes cannot create a results directory.

Does anyone know what might be going on here/ have any suggestions?

Thank you for your time!

John

↧

How to do Basic hotshot profile with intel 2015 + VS 2012 + MPI

January 23, 2015, 1:15 am

Latest and popular articles on Intel Technologies

≫ Next: Events for measuring branch mispredictions?

≪ Previous: "Cannot start collection because the Intel VTune Amplifier XE 2013 failed to create a result directory" when running with MPI

Hi,

I just bought the Intel studio 2015 cluster edition windows. I need to do Basic hotshot profile with intel 2015 + VS 2012 + MPI for my cfd code. I am using a workstation with 2 cpu (2x12 cores). Using 1 cpu, I managed to the analysis successfully.

I can run my code in parallel using intel's wmpiexec.exe. I can also run my code thru VS 2012 by:

setting the launch command (Configuration Properties - Debugging - Command) to the full path for mpiexec.smpd.exe (eg C:\Program Files (x86)\MPICH2\bin\mpiexec.exe)

setting the arguments (Configuration Properties - Debugging - Command Arguments) to -n xxx $(TargetPath) where xxx is the no. of cpu

However, I can't get it to work with vtune if I need to use more than 1 core.

It gives the wrong function for top hotspots and my cpu usage is 0.

Any solution?

Thanks!

↧

Events for measuring branch mispredictions?

February 1, 2015, 9:12 am

Latest and popular articles on Intel Technologies

≫ Next: Collecting call counts - possible with MPI?

≪ Previous: How to do Basic hotshot profile with intel 2015 + VS 2012 + MPI

I would like to measure the number of branch mispredictions (not specific to indirect/cond/ret etc). Do I understand the following two event counters correctly:

BACLEARS.ANY

This will tell me exactly how many branch mispredictions occurred? (Probably the event counter I am really looking for)

BR_MISP_RETIRED.ALL_BRANCHES_PS

This will tell me how many instructions were executed/retired from a branch which was later determined to be mispredicted?

Do I have those correct? A BACLEAR signal is sent by the front-end to clear the pipeline when an execution unit detects a branch was mispredicted? Therefore BACLEARS.ANY will tell me how many branch mispredictions?

↧

Collecting call counts - possible with MPI?

February 3, 2015, 9:15 am

Latest and popular articles on Intel Technologies

≫ Next: Confusion on Summary of Ellapsed Time

≪ Previous: Events for measuring branch mispredictions?

Hello. I am attempting to calculate estimated call counts using the command line instructions on this page: https://software.intel.com/en-us/articles/calculating-estimated-call-cou.... OS is RHEL 6.4 with kernel version 2.6.32-358.6.1.el6.x86_64.

My command line input is: mpirun -n 2 amplxe-cl -collect lightweight-hotspots -r <results directory> -knob enable-stack-collection=true -knob enable-call-counts=true -- /opt/parallel/dev/v20150/cmx/impi_3.4.4/cmx QBV_TILT_Test2

Before the application starts to execute, this produces:

amplxe: Error: PMU resource(s) currently being used by another profiling tool or process.
amplxe: Internal Error

This error occurs even immediately after a system reboot. The error does not occur when mpirun is not used. Is there a workaround for this, or is this type of collection not supported when using MPI?

Thanks for any input.

↧

Confusion on Summary of Ellapsed Time

February 3, 2015, 7:14 pm

Latest and popular articles on Intel Technologies

≫ Next: Intel VTune can not collect information & cause core dump

≪ Previous: Collecting call counts - possible with MPI?

When I profile my code, the summary had shown the elapsed time value (1283.389) and the subset of CPU Time value is (3685.21). So, the CPU time should be smaller than the elapsed time. If not, what is the cpu time included?

↧

Intel VTune can not collect information & cause core dump

February 4, 2015, 5:27 am

Latest and popular articles on Intel Technologies

≫ Next: vtune_amplifier_xe_2013 + how to compile

≪ Previous: Confusion on Summary of Ellapsed Time

Hello,
we are using Intel VTune 2015 for profiling our application which is running under operating system:2.6.32-504.1.3.el6.x86_64 Red Hat Enterprise Linux Server release 6.6 (Santiago)
CPU: Intel(R) Xeon(R) E5/E7 v2 processor
Frequency 2800004679
Logical CPU Count 4
----------------------------------------------------------------
I started four ngss.elf which is our product.
# ps -ef|grep ngss
root 400 31483 0 07:34 pts/0 00:00:00 ./ngss.elf --iomn 294921
root 32508 31483 0 07:34 pts/0 00:00:00 ./ngss.elf --iomn 360459
root 32509 31483 0 07:34 pts/0 00:00:00 ./ngss.elf --iomn 393228
root 32510 31483 36 07:34 pts/0 00:00:47 ./ngss.elf --iomn 425997
----------------------------------------------------------------------
Then I used the following command:
# ./amplxe-cl -collect hotspots -run-pass-thru=-no-altstack -target-pid=32510

----------------------------------------------------------------------
I used the follow command to stop Vtune. But it didn't work.
#./amplxe-cl -r /opt/intel/vtune_amplifier_xe_2015.1.0.367959/bin64/r007hs -command stop
-----------------------------------------------------------------------
So I typed CRTL+C here
The thread just became a defunct thread.
root 32510 31483 1 07:34 pts/0 00:00:51 [ngss.elf] <defunct>
Then boom! Coredump happened after a while.
[3] - Memory fault(coredump) ./ngss.elf --iomn 425997 &

# ps -ef|grep ngss
root 400 31483 0 07:34 pts/0 00:00:02 ./ngss.elf --iomn 294921
root 1468 31483 0 07:48 pts/0 00:00:00 grep ngss
root 32508 31483 0 07:34 pts/0 00:00:02 ./ngss.elf --iomn 360459
root 32509 31483 0 07:34 pts/0 00:00:02 ./ngss.elf --iomn 393228

#######################################################
kill -9 32508
#####################################################
# ./amplxe-cl -collect hotspots -duration 5 -target-pid=32508
amplxe: Collection started. To stop the collection, either press CTRL-C or enter from another console window: amplxe-cl -r /opt/intel/vtune_amplifier_xe_2015.1.0.367959/bin64/r009hs -command stop.

amplxe: Collection detached.
amplxe: Collection stopped.
amplxe: Using result path `/opt/intel/vtune_amplifier_xe_2015.1.0.367959/bin64/r009hs'
amplxe: Executing actions 34 % Precomputing frequently used data
amplxe: Warning: Cannot find data to precompute. Skipping the precomputation step.
amplxe: Executing actions 50 % Generating a report

Collection and Platform Info
----------------------------
Parameter r009hs
------------------------ --------------------------------------------------------------------------------
Application Command Line
Operating System 2.6.32-504.1.3.el6.x86_64 Red Hat Enterprise Linux Server release 6.6 (Santiago)
Computer Name isc01-s00c02h0
Result Size 1380818
Collection start time 08:45:20 04/02/2015 UTC
Collection stop time 08:45:20 04/02/2015 UTC

CPU
---
Parameter r009hs
----------------- -----------------------------------
Name Intel(R) Xeon(R) E5/E7 v2 processor
Logical CPU Count 4

Summary
-------
Elapsed Time: 0.000
amplxe: Executing actions 100 % done
drwx------ 6 root root 4096 Feb 4 08:45 r009hs
# ./amplxe-cl -report hotspots
amplxe: Using result path `/opt/intel/vtune_amplifier_xe_2015.1.0.367959/bin64/r009hs'
amplxe: Executing actions 50 % Generating a report

Empty request output.
amplxe: Executing actions 100 % done
amplxe: Error: 0x40000027 (Reporter error)

So, I got three questions:

1. What happened when I press "CTRL+C"? Did vtune send some signal or some other message to the process?

2. What happened after I input"kill -9 pid"?

3. Why "Empty request output" happened?

Thanks

↧

vtune_amplifier_xe_2013 + how to compile

February 5, 2015, 8:50 am

Latest and popular articles on Intel Technologies

≫ Next: Duration parameter of "collect"

≪ Previous: Intel VTune can not collect information & cause core dump

dear all,

I have vtune_amplifier_xe_2013, I used it one year ago to analyze the CPU time in my program.

I remember that it produce the files: .dump and .xml

I do not remember anymore how to compile the program to get the previous files.

I do not remember the flags that I have to use in ifort.

Someone could help me, please?

I am not able to find the guide anymore. Now I am trying to look inside the vtune_amplifier_xe_2013 folder.

Thanks

↧

Duration parameter of "collect"

February 9, 2015, 12:33 pm

Latest and popular articles on Intel Technologies

≫ Next: Profiling Haswell GPU command queues with VTune

≪ Previous: vtune_amplifier_xe_2013 + how to compile

Hello,

I am trying to gather some system-wide hardware counters for my application, X seconds after it has started, over a period of Y seconds. I am using the following command line:

amplxe-cl --collect my_custom_conf -target-duration-type=veryshort -duration 30 -no-auto-finalize -no-summary -data-limit=0 -resume-after=20000

and I expect the collection to start after 20s and last for 30s.

I have two questions:

a) I receive the following messages:
amplxe: Warning: Pause command is not supported for managed code profiling. Runtime overhead is still possible. Data size limit may be exceeded.
amplxe: Collection paused.
amplxe: Warning: To enable hardware event-base[d] sampling, VTune Amplifier has disabled the NMI watchdog timer. The watchdog timer will be re-enabled after collection completes.
amplxe: Collection resumed.

It seems that the first warning suggests that I cannot use the "-resume-after" option. However, the following messages suggest that vtune indeed paused the collection and resumed it at a later time. Is the warning a false alarm?

b) My second and most important question is related to the actual collection period. While I specify "-duration 30", the collection seems to last significantly longer. When I time the command above, it takes 2m16s, while I would expect something much closer to 50s (20s delay + 30s measurement).
Also, in my result's directory rXXX.amplxe file , I see the following entry:
<collectionTimeBegin type="u64_t">1423513112</collectionTimeBegin>
<collectionTimeEnd type="u64_t">1423513248</collectionTimeEnd>
which confirms the duration of 2m16s.
How can I know how long my profiling really lasts? I need to match the profiling results to the exact execution period of the application.

OS: RHEL 7.0
Vtune version: Intel(R) VTune(TM) Amplifier XE 2015 Update 1 (build 380310) Command Line Tool

Thank you in advance.

↧

Profiling Haswell GPU command queues with VTune

February 10, 2015, 9:20 pm

Latest and popular articles on Intel Technologies

≫ Next: Loop Iteration Time using VTune CLI

≪ Previous: Duration parameter of "collect"

We are looking for OpenCL timelines showing Haswell GPU command queues.

In more detail....We are transcoding some Cuda across to OpenCL under Windows, mostly targeting Haswell’s GPU. The Nvidia profilers gave us timelines containing kernels and data transfers, but we’re struggling to find something comparable with Intel tools. Code-Builder (as a VS plugin) has some simple application analysis tools, but nothing like proper timelines. We have tried Amplifier XE (GPU/CPU concurrency), but once again can’t find a way to see the relationship between the various command queues. Do we have to pay extra for the “Platform Analyser” tool before this is possible?

Second (related) question: we originally installed “intel SDK for OCL applications” to get Code Builder under visual studio. If I understand the marketing right, this has now gone away, with Code Builder now bundled inside “Intel Integrated Native Developer Experience (Intel® INDE)”. The free version of this gets the basic Code Builder, whilst the $800 version also gives me “Platform Analyzer”. True?

↧

Loop Iteration Time using VTune CLI

February 11, 2015, 2:30 am

Latest and popular articles on Intel Technologies

≫ Next: No Data Shown After Compilation of C Source Code with OpenMP

≪ Previous: Profiling Haswell GPU command queues with VTune

Hi, I am running an OpenMP code on the Intel Xeon Phi. I want to profile the code using VTune amplifier on Stampede to find out the number of loop iterations and the number of distinct array accesses for each loop. I couldn't find the related events anywhere. I want to use the command line interface of VTune so that I can use VTune GUI installed in my local system to see the results in GUI. Can you kindly help me with the appropriate command ?

Thanks in advance !!!

Regards
Jagannath

↧

No Data Shown After Compilation of C Source Code with OpenMP

February 11, 2015, 5:43 am

Latest and popular articles on Intel Technologies

≫ Next: How basic hotspot analysis is different from advance hotspot analyis

≪ Previous: Loop Iteration Time using VTune CLI

Hi,

I'm now using Intel VTune Amplifier XE 2015. Also, I have no problem in running Tachyon (sample code).

I was trying to analyze an executable file generated after complilation of a C source code with OpenMP API.

When I run Advanced Hotspots Analysis, I can't view the result of analysis at OpenMP region. It stated "No Data To Show".

Below are the steps that I'd taken to run the analysis:

1) source /opt/intel/vtune_amplifier_xe_2015/amplxe-vars.sh

2) gcc -fopenmp -g Matrix.c -O2 -o Matrix.exe

3) export EDITOR=gedit

4) ./amplxe-gui

Thank you.

↧

How basic hotspot analysis is different from advance hotspot analyis

February 12, 2015, 3:55 pm

Latest and popular articles on Intel Technologies

≫ Next: Collecting CPU Time for FLOPS calculation

≪ Previous: No Data Shown After Compilation of C Source Code with OpenMP

I run the application on intel vtune to extract the information about the functions that are consuming most of the time. Same application running with the basic hotspot analysis and advance hotspot analysis are indicating different functions. Can you please explain me what is the difference between these analyzers and what could be the reason for this behavior.

Thanks you

↧

Collecting CPU Time for FLOPS calculation

February 13, 2015, 1:46 am

Latest and popular articles on Intel Technologies

≫ Next: Problem with hardware counter analysis

≪ Previous: How basic hotspot analysis is different from advance hotspot analyis

Hi everyone,

I am trying to estimate the number of FLOP and FLOPS for my application by using hardware EBS running from the command line. I have implemented the __itt_pause() and __itt_resume() around my algorithm of interest. I run this command:

C:/Program Files (x86)/Intel/VTune Amplifier XE 2015/bin64/amplxe-cl.exe -collect-with runsa -knob event-config=FP_COMP_OPS_EXE.X87:sa=2000000 -start-paused --result-dir foo application.exe

From this article https://software.intel.com/en-us/articles/estimating-flops-using-event-based-sampling-ebs the suggested way to calculate the elapsed time is to use CPU_CLK_UNHALTED.THREAD and divide with processor frequency and # of cores, however when I run the hotspots analysis I get the CPU time (I assume for the entire application) which should be a sufficient approximation for the elapsed time around my algorithm, is it possible to include a "CPU TIME" measurement by simply adding an extra argument to my command above?

Thanks for any help,

Linus

↧

Problem with hardware counter analysis

February 17, 2015, 8:21 am

Latest and popular articles on Intel Technologies

≫ Next: Lack of Ivy Bridge support in current VTune

≪ Previous: Collecting CPU Time for FLOPS calculation

I am trying to understand the behaviour of programs by running the hardware counter analysis. I tried running the program with multiple size datasets with a finite set of counters to be measured. But to a great surprise, vtune measures most of the counters and dumps in the log. But for few runs, it only measures very few counters among the same set. It also drops down the runtime of the application to a great extent. Is this behaviour normal or its some kind of a bug?

I have attached the dump of VTune for two dataset size i.e. 250000000 and 249000000 integer (4 byte) entries. There is sudden drop in runtime and also the counter measure.

Please help !

Attachment	Size
Download data_250000000.txt	9.18 KB
Download data_249000000.txt	9.18 KB

↧