Profiling application on Xeon using VTune

April 24, 2017, 12:47 am

Latest and popular articles on Intel Technologies

≫ Next: Does Parallel Studio XE 2017 support Visual Studio 2017?

Hi, I want to profile OpenFOAM application solver (simpleFoam) on Intel Xeon but facing some error.

The host CPU is Two Intel(R) Xeon(R) E5-2670. When I execute the application in parallel without profiling it executes without any error. But when I execute the application with Vtune it gives an error which I'm unable to resolve. I have used Intel(R) VTune(TM) Amplifier XE 2015 Update 2.

The command executed is as follows,

mpirun -np 16 amplxe-cl -collect advanced-hotspots -r profile -- simpleFoam -parallel

I have attached the error file. Please find attachment.

Can you help me out in this ?

Thank You.

Attachment	Size
Download output.txt	7.59 KB

↧

Does Parallel Studio XE 2017 support Visual Studio 2017?

April 25, 2017, 7:31 am

Latest and popular articles on Intel Technologies

≫ Next: VTSS not installed for Vtune2017

≪ Previous: Profiling application on Xeon using VTune

The subject says it all really. I am upgrading my Microsoft C++ projects to Visual Studio 2017 (VS2017) but I get a lot of errors, even though the projects don't use the Intel compiler! I was wondering if adding a Parallel Studio VS2017 integration would help, but when I modify the current Parallel Studio installation it does not offer VS2017 as an option.

Error message when building an MS C++ project in VS2017 :
1>C:\Program Files (x86)\MSBuild\Microsoft.Cpp\v4.0\V140\Platforms\Win32\PlatformToolsets\v140\ImportBefore\Intel.Libs.MKL.v140.targets(55,5): error MSB6003: The specified task executable "cmd.exe" could not be run. The working directory "\mkl\tools" does not exist.

O/S: Windows server 2016
Installed MS products: VS2012, VS2013, VS2015, VS2017
Installed Intel product: Parallel Studio XE 2017 Update 2 Cluster Edition for Windows
Hardware: Intel Xeon Platinum 8180

Zone:

Thread Topic:

Question

↧

VTSS not installed for Vtune2017

April 26, 2017, 5:22 pm

Latest and popular articles on Intel Technologies

≫ Next: What is _kmp_fork_barrier and how to see if there is load imbalance?

≪ Previous: Does Parallel Studio XE 2017 support Visual Studio 2017?

Hi there,

I was wondering if VTSS (required for Advanced Profiling with callstacks) should work on my system.

Intel Xeon E5-2687W v3, Haswell-E/EP Socket 2011LGA

I installed Vtune as administrator and running the commands below (as admin) doesn't work, it fails to install the driver.

Any help/advice would be greatly appreciated.
Thanks,

Ewen

C:\Program Files (x86)\IntelSWTools\VTune Amplifier XE\bin32>amplxe-sepreg.exe -s
Checking sepdrv4_0 driver path...OK
Checking sepdrv4_0 service...
Driver status: the sepdrv4_0 service is running
Checking sepdal driver path...OK
Checking sepdal service...
Driver status: the sepdal service is running
Checking socperf2_0 driver path...OK
Checking socperf2_0 service...
Driver status: the socperf2_0 service is running
Checking VTSS++ driver path...FAILED
Checking VTSS++ service...
Driver status: the VTSS++ service is not running

C:\Program Files (x86)\IntelSWTools\VTune Amplifier XE\bin32>amplxe-sepreg.exe -i -v
Stopping service sepdrv4_0...OK
Stopping service socperf2_0...OK
Copying file C:\Program Files (x86)\IntelSWTools\VTune Amplifier XE\bin64\sepdrv\win7\socperf2_0.sys to C:\WINDOWS\System32\Drivers\socperf2_0.sys...OK
Installing service socperf2_0...OK
Warning: service socperf2_0 already exists
Starting service socperf2_0...OK
Writing startup key to HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\socperf2_0...OK
Stopping service sepdrv4_0...Copying file C:\Program Files (x86)\IntelSWTools\VTune Amplifier XE\bin64\sepdrv\win7\sepdrv4_0.sys to C:\WINDOWS\System32\Drivers\sepdrv4_0.sys...OK
Installing service sepdrv4_0...OK
Warning: service sepdrv4_0 already exists
Starting service sepdrv4_0...OK
Writing startup key to HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\sepdrv4_0...OK
Stopping service sepdal...OK
Copying file C:\Program Files (x86)\IntelSWTools\VTune Amplifier XE\bin64\sepdrv\win7\sepdal.sys to C:\WINDOWS\System32\Drivers\sepdal.sys...OK
Installing service sepdal...OK
Warning: service sepdal already exists
Starting service sepdal...OK
Writing startup key to HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\sepdal...OK
Deleting system32/drivers/vtss.sys file...OK
Forming source path for vtss.sys...OK
Forming destination path for vtss.sys...OK
Copying file C:\Program Files (x86)\IntelSWTools\VTune Amplifier XE\bin32\.\..\bin64\sepdrv\vtss.sys to C:\WINDOWS\system32\drivers\vtss.sys...OK
Installing and starting VTSS++ driver...
FAILED

Thread Topic:

Question

↧

What is _kmp_fork_barrier and how to see if there is load imbalance?

April 28, 2017, 4:09 am

Latest and popular articles on Intel Technologies

≫ Next: Trigger collection with program symbol.

≪ Previous: VTSS not installed for Vtune2017

I'm using Intel VTune Amplifier to see how my parallel application scales.

It scales pretty well on my 4-cores laptop (considering that there are portions of the algorithm that can't be parallelized):

However, when I test it on the Knights Landing (KNL), it scales horribly:

Notice that I'm using only 64 cores on purpose.

Why there is so much idle time? And what is _kmp_fork_barrier? Reading about "Imbalance or Serial Spinning (OpenMP)" it seems that this is about load imbalance, but I'm already using schedule(dynamic,1) in all omp regions.

How can I see if this is actually load imbalance? Otherwise, what could be a possible cause?

Notice I have 3 parallel omp parallel regions:

#pragma omp parallel for collapse(2) schedule(dynamic,1)

#pragma omp declare reduction(mergeFindAffineShapeArgs : std::vector<FindAffineShapeArgs> : omp_out.insert(omp_out.end(), omp_in.begin(), omp_in.end()))
#pragma omp parallel for collapse(2) schedule(dynamic,1) reduction(mergeFindAffineShapeArgs : findAffineShapeArgs)

#pragma omp declare reduction(mergeFindAffineShapeArgs : std::vector<FindAffineShapeArgs> : omp_out.insert(omp_out.end(), omp_in.begin(), omp_in.end()))
#pragma omp parallel for collapse(2) schedule(dynamic,1) reduction(mergeFindAffineShapeArgs : findAffineShapeArgs)

Is it possible that this is because of the reduction? I knew that it was pretty efficient (using a divide-et-impere merge approach).

This is the bottom-up section:

See here how the most expensive functions are well parallelized (most of them):

↧

Trigger collection with program symbol.

April 28, 2017, 12:52 pm

Latest and popular articles on Intel Technologies

≫ Next: Vtune support for Wordpress in docker/container

≪ Previous: What is _kmp_fork_barrier and how to see if there is load imbalance?

I am evaluating vtune on linux:

vtune_amplifier_xe_2017.2.0.499904

There is an option to delay the collection of samples by time. I think for more accurate testing, it would help if there was an option to trigger the collection with a symbol in the program. E.g. as soon as the function 'foo' is called, collection starts.

This would enable precise starts of profile runs that skip the initialization phases of your code.

Zone:

Game Development

↧

Vtune support for Wordpress in docker/container

April 28, 2017, 1:25 pm

Latest and popular articles on Intel Technologies

≫ Next: _kmp huge overhead and spin time for unkown calls in OpenMP?

≪ Previous: Trigger collection with program symbol.

Does Vtune support for PHP workload running in a docker container in Linux environment?

I tried to profile on a PHP workload (HHVM wordpress) running in container with Vtune. But I saw a huge part of (52%) outside known module. The steps I referred to is: https://software.intel.com/en-us/vtune-amplifier-help-2018-beta-profiling-docker-container-targets but in the instructions it mentions "Use Intel® VTune™ Amplifier's Advanced Hotspots analysis to profile native or Java* applications running in a Docker container on a Linux system" How about PHP applications support of vtune?

command line execute after the application completes warm-up and starts:/opt/intel/vtune_amplifier_xe_2017.1.0.486011/bin64/amplxe-cl -collect advanced-hotspots -knob collection-detail=stack-sampling -app-working-dir /home/username/vtune --duration 220

Thank you!

↧

_kmp huge overhead and spin time for unkown calls in OpenMP?

April 30, 2017, 1:08 am

Latest and popular articles on Intel Technologies

≫ Next: vtune collect error: valid _pthread_cleanup_push symbol is not found in the binary of the analysis target

≪ Previous: Vtune support for Wordpress in docker/container

I'm using Intel VTune to analyze my parallel application.

As you can see, there is an huge Spin Time at the beginning of the application (represented as the orange section on the left side):

It's more than 28% of the application durations (which is roughly 0.14 seconds)!

As you can see, these functions are _clone, start_thread, _kmp_launch_thread and _kmp_fork_barrier and they look like OpenMP internals or system calls, but it's not specified where these fucntion are called from.

In addition, if we zoom at the beginning of this section, we can notice a region instantiation, represented by the selected region:

However, to my knowledge I never call initInterTab2d and I have no idea if it's called by some of the labraries that I'm using (especially OpenCV).

Digging deeply and running an Advanced Hotspot analysis I found a little bit more about the firsts unkown functions:

And exaplanding tthe Function/Call Stack tab:

But again, I can't really understand why these functions are called, why they take so long and why only the master thread works during them, while the others are in a "barrier"/waiting state.

I attached part of the code, if it can be useful.

Notice that I have only one #pragma omp parallel region, which is the selected section of this image (on the right side):

The code structure is the following:

Compute some serial, non parallelizable stuff. In particular, compute a chain of blurs, which is represented by gaussianBlur (included at the end of the code). cv::GaussianBlur is an OpenCV function which exploits IPP.
Start the parallel region, where 3 parallel for are used
The first one calls hessianResponse
A single thread add the results to a shared vector.
The second parallel region localfindAffineShapeArgs generates the data used by the next parallel region. The two regions can't be merged because of load imbalance.
The third region generates the final result in a balanced way.
Note: according to the lock analysis of VTune, the critical and barrier sections are not the reason of spinning.

Attachment	Size
Download pyramid.cpp	11.8 KB

↧

vtune collect error: valid _pthread_cleanup_push symbol is not found in the binary of the analysis target

May 2, 2017, 3:20 am

Latest and popular articles on Intel Technologies

≫ Next: offcore events in Xeon E5-2620 v3 is not supported

≪ Previous: _kmp huge overhead and spin time for unkown calls in OpenMP?

Hello, all,

I get below error, when I run amplxe-cl to collect hotspots. Note: process 1183 is my application.

What is causing this? Thanks.

# amplxe-cl -collect hotspots -r temp -target-pid 1183
amplxe: Collection started. To stop the collection, either press CTRL-C or enter from another console window: amplxe-cl -r /home/chasond/vtune/t6/session10/tt3/temp -command stop.
amplxe: Error: Valid _pthread_cleanup_push symbol is not found in the binary of the analysis target.
amplxe: Error: Binary file of the analysis target does not contain symbols required for profiling. See the 'Analyzing Statically Linked Binaries' help topic for more details.
amplxe: Error: Valid _rt0_amd64_linux symbol is not found in the binary of the analysis target.
amplxe: Collection failed.
amplxe: Internal Error

Thanks,

Chason

Thread Topic:

Help Me

↧

offcore events in Xeon E5-2620 v3 is not supported

May 2, 2017, 9:48 pm

Latest and popular articles on Intel Technologies

≫ Next: Looking for right Vtune lunix version for RHEL 5.4

≪ Previous: vtune collect error: valid _pthread_cleanup_push symbol is not found in the binary of the analysis target

Hello

I'am trying to monitor the offcore events in Exon E5-4620, the kernel is 4.3.0, and the system is running normally. After the system downtime, I reinstall the system. Now, when I run pmu-tools, the ocperf.py no longer support offcore. As follows. So, how to use offcore again, or what problem I have met?

failed to read counter offcore_response_demand_code_rd_llc_miss_remote_dram
# time        core cpus    counts                   unit events
0.100098644  S0-C0   1   <not supported> offcore_response_demand_code_rd_llc_miss_remote_dram

Thanks,

wang,

Zone:

Thread Topic:

How-To

↧

Looking for right Vtune lunix version for RHEL 5.4

May 3, 2017, 11:12 am

Latest and popular articles on Intel Technologies

≫ Next: How should I interpreter these VTune results?

≪ Previous: offcore events in Xeon E5-2620 v3 is not supported

Currently some our server are running Red Hat Enterprise Linux Server release 5.4 (Tikanga).

can someone please let which vtune version support RHEL 5.4 . how to download it

Zone:

Intel® RealSense™ Technology

↧

How should I interpreter these VTune results?

May 4, 2017, 11:45 am

Latest and popular articles on Intel Technologies

≫ Next: The 'CPU Time' is far less than the 'Elapsed Time'.

≪ Previous: Looking for right Vtune lunix version for RHEL 5.4

I'm trying to parallelyzing my application using OpenMP. OpenCV (built using IPP for best efficiency) is used as external library.

On my Intel i7-4700MQ the actual wall-clock time of the application on average on 10 runs is around 0.73 seconds. I compile the code with icpc 2017 update 3 with the following compiler flags:

INTEL_OPT=-O3 -ipo -simd -xCORE-AVX2 -parallel -qopenmp -fargument-noalias -ansi-alias -no-prec-div -fp-model fast=2 -fma -align -finline-functions
INTEL_PROFILE=-g -qopt-report=5 -Bdynamic -shared-intel -debug inline-debug-info -qopenmp-link dynamic -parallel-source-info=2 -ldl

In addition I set KMP_BLOCKTIME=0 because the default value (200) was generating an huge overhead.

We can divide the code in 3 parallel regions (wrapped in only one #pragma parallel for efficiency) and a previous serial one, which is around 25% of the algorithm (and it can't be parallelized).

I'll try to describe them (or you can skip to the code structure directly):

We create a parallel region in order to avoid the overhead to create a new parallel region. The final result is to populate the rows of a matrix obejct, cv::Mat descriptor. We have 3 shared std::vector objects: (a) blurs which is a chain of blurs (not parallelizable) using GuassianBlur by OpenCV (which uses the IPP implementation of guassian blurs) (b) hessResps (size known, say 32) (c) findAffineShapeArgs (unkown size, but in order of thousands of elements, say 2.3k) (d) cv::Mat descriptors (unkown size, final result). In the serial part, we populate `blurs, which is a read only vector.
In the first parallel region,hessResps is populated using blurs without any synchronization mechanism.
In the second parallel region findLevelKeypoints is populated using hessResps as read only. Since findAffineShapeArgs size is unkown, we need a local vector localfindAffineShapeArgs which will be appended to findAffineShapeArgs in the next step
Since findAffineShapeArgs is shared and its size is unkown, we need a criticalsection where each localfindAffineShapeArgs is appended to it.
In the third parallel region, each findAffineShapeArgs is used to generate the rows of the final cv::Mat descriptor. Again, since descriptors is shared, we need a local version cv::Mat localDescriptors.
A final critical section push_back each localDescriptors to descriptors. Notice that this is extremely fast since cv::Mat is "kinda" of a smart pointer, so we push_backpointers.

This is the code structure:

cv::Mat descriptors;
std::vector<Mat> blurs(blursSize);
std::vector<Mat> hessResps(32);
std::vector<FindAffineShapeArgs> findAffineShapeArgs;//we don't know its tsize in advance

#pragma omp parallel
{
//compute all the hessianResponses
#pragma omp for collapse(2) schedule(dynamic)
for(int i=0; i<levels; i++)
    for (int j = 1; j <= scaleCycles; j++)
    {
       hessResps[/**/] = hessianResponse(/*...*/);
    }

std::vector<FindAffineShapeArgs> localfindAffineShapeArgs;
#pragma omp for collapse(2) schedule(dynamic) nowait
for(int i=0; i<levels; i++)
    for (int j = 2; j < scaleCycles; j++){
    findLevelKeypoints(localfindAffineShapeArgs, hessResps[/*...*], /*...*/); //populate localfindAffineShapeArgs with push_back
}

#pragma omp critical{
    findAffineShapeArgs.insert(findAffineShapeArgs.end(), localfindAffineShapeArgs.begin(), localfindAffineShapeArgs.end());
}

#pragma omp barrier
#pragma omp for schedule(dynamic) nowait
for(int i=0; i<findAffineShapeArgs.size(); i++){
{
  findAffineShape(findAffineShapeArgs[i]);
}

#pragma omp critical{
  for(size_t i=0; i<localRes.size(); i++)
    descriptors.push_back(localRes[i].descriptor);
}
}

At the end of the question, you can find FindAffineShapeArgs.

I'm using Intel Amplifier to see hotspots and evaluate my application.

The OpenMP Potential Gain analsysis says that the Potential Gain if there would be perfect load balancing would be 5.8%, so we can say that the workload is balanced between different CPUs.

This i the CPU usage histogram for the OpenMP region (remember that this is the result of 10 consecutive runs):

So as you can see, the Average CPU Usage is 7 cores, which is good.

This OpenMP Region Duration Histogram shows that in these 10 runs the parallel region is executed always with the same time (with a spread around 4 milliseconds):

This is the Caller/Calee tab:

For you knowledge:

interpolate is called in the last parallel region
l9_ownFilter* functions are all called in the last parallel region
samplePatch is called in the last parallel region.
hessianResponse is called in the second parallel region

Now, my first question is: how should I interpret the data above? As you can see, in many of the functions half of the time the "Effective Time by Utilization` is "ok", which would probably become "Poor" with more cores (for example on a KNL machine, where I'll test the application next).

Finally, this is the Wait and Lock analysis result:

Now, this is the first weird thing: line 276 Join Barrier (which corresponds to the most expensive wait object) is#pragma omp parallel`, so the beginning of the parallel region. So it seems that someone spawned threads before. Am I wrong? In addition, the wait time is longer than the program itself (0.827s vs 1.253s of the Join Barrier that I'm talking about)! But maybe that refers to the waiting of all threads (and not wall-clock time, which is clearly impossible since it's longer than the program itself).

Then, the Explicit Barrier at line 312 is #pragma omp barrier of the code above, and its duration is 0.183s.

Looking at the Caller/Callee tab:

As you can see, most of wait time is poor, so it refers to one thread. But I'm sure that I'm understanding this. My second question is: can we interpret this as "all the threads are waiting just for one thread who is staying behind?".

FindAffineShapeArgs definition:

struct FindAffineShapeArgs
{
    FindAffineShapeArgs(float x, float y, float s, float pixelDistance, float type, float response, const Wrapper &wrapper) :
        x(x), y(y), s(s), pixelDistance(pixelDistance), type(type), response(response), wrapper(std::cref(wrapper)) {}

    float x, y, s;
    float pixelDistance, type, response;
    std::reference_wrapper<Wrapper const> wrapper;
};

↧

The 'CPU Time' is far less than the 'Elapsed Time'.

May 7, 2017, 6:31 pm

Latest and popular articles on Intel Technologies

≫ Next: I want to know how long a function is taking in one frame

≪ Previous: How should I interpreter these VTune results?

Elapsed Time:   13.654s
CPU Time:   0.430s
Total Thread Count:   24
Paused Time:   0s

Top Hotspots
Function Module CPU Time
pj_sock_recvfrom libpj.so.2 0.070s
pj_ioqueue_poll libpj.so.2 0.040s
OS_BARESYSCALL_DoCallAsmIntel64Linux libc-dynamic.so 0.040s
sendto libpthread.so.0 0.030s
func@0x3706c58730 libasound.so.2 0.020s
[Others] 0.230s

↧

I want to know how long a function is taking in one frame

May 9, 2017, 2:19 am

Latest and popular articles on Intel Technologies

≫ Next: Tartget Android ADB not appearing in Intel VTune amplifier XE 2017

≪ Previous: The 'CPU Time' is far less than the 'Elapsed Time'.

Hi,

Pretty simple question.

I have put frame markers in using the API, so I can see the frames in the timeline view.

All I want to do is pick a frame (any frame) in the timeline, zoom in on selection, and then search for a function to see how long that function took in that frame. The function in question is not particularly slow, but I do want to know how long it's taking.

I don't seem to be able to do this easily

Thanks

↧

Tartget Android ADB not appearing in Intel VTune amplifier XE 2017

May 12, 2017, 7:11 am

Latest and popular articles on Intel Technologies

≫ Next: blue screen Parallel Studio XE 2017 Update 4

≪ Previous: I want to know how long a function is taking in one frame

I'm trying to evaluate Intel VTune amplifier XE to profile and found problems in our Android apps, I've been following the tutorials but while I try to configure the project target for Android ADB is not there.

Screenshot of non target appearing

What I'm missing/doing wrong ?

Zone:

Thread Topic:

How-To

↧

blue screen Parallel Studio XE 2017 Update 4

May 16, 2017, 10:18 pm

Latest and popular articles on Intel Technologies

≫ Next: Basic Hotspots analysis not working in VTune XE2017 Update 3 (build 510739)

≪ Previous: Tartget Android ADB not appearing in Intel VTune amplifier XE 2017

I had been profiling an application that has C#/.NET code, a C++ library, and a managed layer between the two--using Parallel Studio XE 2017 Update 2. I installed Update 4 and now cannot profile. After launching VTune, a short time later my machine locks up, followed by an automatic reboot. The Event Viewer has an entry for WER that says this was a blue-screen event, and a minidump file and a full memory dump were generated. This occurs on a Windows machine with OS 15063.296 (Creator's update). I have another machine, a XEON workstation with the Windows Anniversary edition, but at the moment I do not have the exact specifications at hand. On that machine, VTune runs, starts the data collection, but before data collection is complete, I get a crash and a window that asks whether I would like to send the report to Intel (which I have not yet done before investigating first). I have successfully profiled an application that is only C++, which suggests maybe the managed code might be an issue.

Has anyone else encountered similar problems with update 4?

Zone:

Windows*

Thread Topic:

Bug Report

↧

Basic Hotspots analysis not working in VTune XE2017 Update 3 (build 510739)

May 17, 2017, 3:42 am

Latest and popular articles on Intel Technologies

≫ Next: LLC HIT RATE on i7 6700

≪ Previous: blue screen Parallel Studio XE 2017 Update 4

Has anyone else encountered this?

When I start a Basic Hotspots analysis this "program has stopped working" message box appears:

This is with VTune XE2017 Update 3 (build 510739) running an executable built by Intel Fortran XE2015U1 for both Win32 and X64 platforms.

I'm running Vtune as Administrator on WIndows 10 Enterprise 1511 x64. Processors are 2x Xeon E5-2687W v3.

Advanced Hotspots analysis works as expected.

↧

LLC HIT RATE on i7 6700

May 18, 2017, 11:56 pm

Latest and popular articles on Intel Technologies

≫ Next: Virtual Machine performance

≪ Previous: Basic Hotspots analysis not working in VTune XE2017 Update 3 (build 510739)

Hello,

i am a student and use the INTEL VTUNE AMPLIFIER XE on Ubuntu 14.04.

BUT, I am not familiar with the INTEL VTUNE.

I would like to measure LLC hit rate on i7 6700 processor.

I got the formula

LLC hit rate = (MEM_LOAD_RETIRED.LLC_UNSHARED_HIT * 35) / (CPU_CLK_UNHALTED.THREAD * 100) from

https://software.intel.com/en-us/forums/intel-vtune-amplifier-xe/topic/2...

But, I have some questions.

1) I figure out that i7 6700 (skylake) doesn't have "MEM_LOAD_RETIRED.LLC_UNSHARED_HIT"

from a document "Events for INTEL@Microarchitecture Code Name Skylake"
Which events should I use insted of them?

2) Could you recommend some documents for measurement of cache hit rate on i7 6700 processors?

3) From that formula, Is that LLC HIT RATE caused by CPU?

4) I am running an OPENCL application on I7 6700. Is there any way to calculate the LLC HIT RATE caused by INTEL integrated GPU?

Thanks.

Thread Topic:

How-To

↧

Virtual Machine performance

May 19, 2017, 12:13 am

Latest and popular articles on Intel Technologies

≫ Next: seeing source code in 2018 beta

≪ Previous: LLC HIT RATE on i7 6700

Hello

I am a student and use the INTEL AMPLIFIER XE.

I HAVE ONE QUESTION.

Q1.) How can i get the VM(Virtual Machine) Performance on Xen by using the INTEL AMPLIFIER XE.

Thanks.

↧

seeing source code in 2018 beta

May 19, 2017, 12:21 pm

Latest and popular articles on Intel Technologies

≫ Next: How to get LLC Miss Ratio due GPU Lookups

≪ Previous: Virtual Machine performance

Hi,

Using new interface for Amplifier & cannot seem to get it to show the source code.
I'm running in debug, & have indicated the directory for source code. I set the working directory to the
directory with the source code. Program runs & finds data files. I give up...what am I missing?

Even better (although I'd settle for debug mode), how do I see source code running in release mode?

thanks,

Zone:

Windows*

↧

How to get LLC Miss Ratio due GPU Lookups

May 23, 2017, 6:44 am

Latest and popular articles on Intel Technologies

≫ Next: First questions regarding low cpu time and code execution in timeline

≪ Previous: seeing source code in 2018 beta

Hello,

i am a student and use the INTEL VTUNE AMPLIFIER XE on Ubuntu 14.04.

BUT, I am not familiar with the INTEL VTUNE.

I would like to get the GPU metric "LLC Miss Ratio due GPU Lookups"

I read the document "GPU Metrics Reference" from https://software.intel.com/en-us/node/544500

I want to know the GPU performance of OpenCL Application on Intel i7 6700 (the 6th generation Intel Core™ processor family)

Could you tell me how can i get the GPU metric ' LLC Miss Ratio due GPU Lookups' from Intel Vtune Amplifier XE 2017 (or for Systems)

Thread Topic:

How-To

↧