Intel® VTune™ Profiler

User Guide

ID 766319
Date 12/20/2024
Public
Document Table of Contents

Java* Code Analysis from the Command Line

Intel® VTune™ Profiler provides a low-overhead user-mode sampling and tracing and hardware event-based sampling analysis of the JIT compiled code executed with Oracle* JDK or OpenJDK*. The analysis of the interpreted Java methods is limited.

You may use the hardware event-based sampling data collection that monitors hardware events in the CPU's pipeline and can identify coding pitfalls limiting the most effective execution of instructions in the CPU. The hardware performance metrics are available and can be displayed against the application modules, functions, and Java code source lines. You may also run the hardware event-based sampling collection with stacks when you need to find out a call path for a function called in a driver or middleware layer in your system.

Configure Java Collection

Use the following syntax to configure Java analysis from the command line:

vtune -collect <analysis_type> [-[no-]follow-child] [-mrte-mode=<mrte_mode_value>] [<-knob> <knob_name=knob_option>] [--] <target>

where

  • <analysis_type> is the type of analysis to run
  • -[no-]follow-child is an action option to collect data on the processes spawned by the target process. It is recommended to enable the option for applications launched by a script. The option is enabled by default.
  • <mrte_mode_value> is a profiling mode for the managed code. The auto mode is enabled by default.
  • <-knob> is an option that configures the analysis
  • [knobName=knobValue] is the name of the specified knob and its value
  • <target> is the path and name of the application to analyze
NOTE:

To see all knobs available for a predefined analysis type, enter:

vtune -help collect <analysis_type>

To see knobs for a custom analysis type, enter:

vtune -help collect-with <analysis_type>

Examples

Example 1: Running Java Analysis

The following command line runs the Hotspots analysis on a java command on Linux*:

vtune -collect hotspots -- java -Xcomp -Djava.library.path=native_lib/ia32 -cp /home/Design/Java/mixed_call MixedCall 3 2

Example 2: Running Analysis for Embedded Java Command

You may embed your java command in a batch file or executable script before running the analysis. For example, on Windows* create a run.bat file with the following command:

java.exe -Xcomp -Djava.library.path=native_lib\ia32 -cp C:\Design\Java\mixed_call MixedCall 3 1

The following command line runs the Hotspots analysis on a specified batch file with embedded java command:

vtune -collect hotspots -- run.bat

Example 3: Attaching Analysis to Java Process

In case your Java application needs to run for some time or cannot be launched at the start of this analysis, you may attach the VTune Profiler to the Java process. To do this, specify the following analysis target: --target-process java.

NOTE:

The dynamic attach mechanism is supported only with the Java Development Kit (JDK).

The following example attaches the Hotspots analysis to a running Java process on Linux:

vtune -collect hotspots --target-process java

View Summary Report

VTune Profiler automatically generates the summary report when data collection completes. Similar to the Summary window, available in GUI, the command line report provides overall performance data of your Java target.

NOTE:

For more information on analyzing the summary report data, refer to the Summary Report section.

Examples

The following example generates the summary report for the Hotspots analysis result. For user-mode sampling and tracing analysis results, the summary report includes Collection and Platform information, CPU information and summary per the basic metrics.

On Windows:


Collection and Platform Info
----------------------------
Parameter                 r001hs

------------------------  ------------------------------------------
Operating System          Microsoft Windows 10
Result Size               21258782
Collection start time     11:58:36 15/04/2019 UTC
Collection stop time      11:58:50 15/04/2019 UTC

CPU
---
Parameter          r001hs
-----------------  -------------------------------------------------
Name               4th generation Intel(R) Core(TM) Processor family
Frequency          2494227391
Logical CPU Count  4

Summary
-------
Elapsed Time:       12.939
CPU Time:           14.813
Average CPU Usage:  1.012

On Linux:

Collection and Platform Info
----------------------------
Parameter                 r002hs
-------------------------------------------------------------------
Application Command Line  /tmp/java_mixed_call/src/run.sh

Operating System          3.16.0-30-generic NAME="Ubuntu"
VERSION="14.04.2 LTS, Trusty Tahr"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 14.04.2 LTS"
VERSION_ID="14.04"
HOME_URL="http://www.ubuntu.com/"
SUPPORT_URL="http://help.ubuntu.com/"
BUG_REPORT_URL="http://bugs.launchpad.net/ubuntu/"
Computer Name             10.125.21.55

Result Size               11560723

Collection start time     13:55:00 05/02/2019 UTC

Collection stop time      13:55:10 05/02/2019 UTC


CPU
---
Parameter          r001hs
-----------------  -------------------------------------------------
Name               3rd generation Intel(R) Core(TM) Processor family
Frequency          3492067692
Logical CPU Count  8

Summary
-------
Elapsed Time:       10.183
CPU Time:           19.200
Average CPU Usage:  1.885

This example generates the summary report for the Hotspots analysis (hardware event-based sampling mode) result. For hardware event-based sampling analysis results, the summary report includes Collection and Platform information, CPU information, summary per the basic metrics, and an event summary.

Collection and Platform Info
----------------------------
Parameter                 r002hs

------------------------  ------------------------------------------
Operating System          3.16.0-30-generic NAME="Ubuntu"
VERSION="14.04.2 LTS, Trusty Tahr"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 14.04.2 LTS"
VERSION_ID="14.04"
HOME_URL="http://www.ubuntu.com/"
SUPPORT_URL="http://help.ubuntu.com/"
BUG_REPORT_URL="http://bugs.launchpad.net/ubuntu/"
Result Size               171662827
Collection start time     10:44:34 15/04/2019 UTC
Collection stop time      10:44:50 15/04/2019 UTC

CPU
---
Parameter          r002hs
-----------------  -------------------------------------------------
Name               4th generation Intel(R) Core(TM) Processor family
Frequency          2494227445
Logical CPU Count  4

Summary
-------
Elapsed Time:       15.463
CPU Time:           6.392
Average CPU Usage:  0.379
CPI Rate:           1.318

Event summary
-------------
Hardware Event Type         Hardware Event Count:Self  Hardware Event Sample Count:Self  Events Per Sample
--------------------------  -------------------------  --------------------------------  -----------------
INST_RETIRED.ANY                          13014608235                              8276  1900000
CPU_CLK_UNHALTED.THREAD                   17158609921                              8207  1900000
CPU_CLK_UNHALTED.REF_TSC                  15942400300                              5163  1900000
BR_INST_RETIRED.NEAR_TAKEN                 1228364727                              4648  200003
CALL_COUNT                                  213650621                             75413  1
ITERATION_COUNT                             370567815                             84737  1
LOOP_ENTRY_COUNT                            162943310                             70069  1

Identify Hottest Methods

Use the hotspots command line report as a starting point for identifying program units (for example: functions, modules, or objects) that take the most processor time (Hotspots analysis), underutilize available CPUs or have long waits (Threading analysis), and so on.

The report displays the hottest program units in the descending order by default, starting from the most performance-critical unit. The command-line reports provide the same data that is displayed in the default GUI analysis viewpoints.

NOTE:
  • To display a list of available groupings for a hotspots report, enter: vtune -report hotspots -r <result_dir> group-by=?.
  • To set the number of top items to include in a report, use the limit action option: vtune -report <report_type> -limit <value> -r <result_dir>

Examples

This example generates the hotspots report for the Hotspots analysis result and groups the data by module. The result file is not specified and VTune Profiler uses the latest analysis result.

vtune -report hotspots

On Windows:


Function               CPU Time  CPU Time:Effective Time  CPU Time:Effective Time:Idle  CPU Time:Effective Time:Poor  CPU Time:Effective Time:Ok  CPU Time:Effective Time:Ideal  CPU Time:Effective Time:Over  CPU Time:Spin Time  CPU Time:Overhead Time  Module          Function (Full)        Source File   Start Address
---------------------  --------  -----------------------  ----------------------------  ----------------------------  --------------------------  -----------------------------  ----------------------------  
consume_time            10.371s                  10.371s                            0s                       10.341s                      0.020s                         0.010s                            0s                  0s                      0s  mixed_call.dll  consume_time           mixed_call.c  0x180001000  
NtWaitForSingleObject    1.609s                       0s                            0s                            0s                          0s                             0s                            0s              1.609s                      0s  ntdll.dll       NtWaitForSingleObject  [Unknown]     0x1800906f0  
WriteFile                0.245s                   0.245s                        0.009s                        0.190s                      0.030s                         0.016s                            0s                  0s                      0s  KERNELBASE.dll  WriteFile              [Unknown]     0x180001c50  
func@0x707d5440          0.114s                   0.010s                            0s                        0.010s                          0s                             0s                            0s              0.104s                      0s  jvm.dll         func@0x707d5440        [Unknown]     0x707d5440   
func@0x705be5c0          0.072s                   0.025s                            0s                        0.025s                          0s                             0s                            0s              0.047s                      0s  jvm.dll         func@0x705be5c0        [Unknown]     0x705be5c0   
...

On Linux:


Function            CPU Time  CPU Time:Effective Time  CPU Time:Effective Time:Idle  CPU Time:Effective Time:Poor  CPU Time:Effective Time:Ok  CPU Time:Effective Time:Ideal  CPU Time:Effective Time:Over  CPU Time:Spin Time  CPU Time:Overhead Time  Module            Function (Full)     Source File  Start Address
------------------  --------  -----------------------  ----------------------------  ----------------------------  --------------------------  -----------------------------  ----------------------------  
[libmixed_call.so]   17.180s                  17.180s                            0s                       17.180s                          0s                             0s                            0s                  0s                      0s  libmixed_call.so  [libmixed_call.so]  [Unknown]    0

[libjvm.so]           1.698s                   1.698s                        0.020s                        1.678s                          0s                             0s                            0s                  0s                      0s  libjvm.so         [libjvm.so]         [Unknown]    0

[libpthread.so.0]     0.136s                   0.136s                            0s                        0.136s                          0s                             0s                            0s                  0s                      0s  libpthread.so.0   [libpthread.so.0]   [Unknown]    0

[libtpsstool.so]      0.052s                   0.052s                            0s                        0.052s                          0s                             0s                            0s                  0s                      0s  libtpsstool.so    [libtpsstool.so]    [Unknown]    0
...

The following example generates the hotspots report for the specified Hotspots analysis result (hardware event-based sampling mode), sets the number of items to include in the report to 3, and groups the report data by application module.

vtune -report hotspots -limit 3 -r r002hs -group-by module

On Windows:


Module          CPU Time  CPU Time:Effective Time  CPU Time:Effective Time:Idle  CPU Time:Effective Time:Poor  CPU Time:Effective Time:Ok  CPU Time:Effective Time:Ideal  CPU Time:Effective Time:Over  CPU Time:Spin Time  CPU Time:Overhead Time  Instructions Retired  CPI Rate  Wait Rate  CPU Frequency Ratio  Context Switch Time  Context Switch Time:Wait Time  Context Switch Time:Inactive Time  Context Switch Count  Context Switch Count:Preemption  Context Switch Count:Synchronization  Module Path                                                                                 
--------------  --------  -----------------------  ----------------------------  ----------------------------  --------------------------  -----------------------------  ----------------------------  -------
mixed_call.dll   15.294s                  15.294s                        0.419s                       14.871s                      0.004s                             0s                            0s                  0s                      0s        21,148,958,284     1.907      0.000                1.149               1.401s                             0s                             1.401s                26,769                           26,769                                     0  C:\work\module Java\module Java\java_mixed_call\vc9\bin32\mixed_call.dll
jvm.dll           0.582s                   0.582s                        0.033s                        0.547s                      0.002s                             0s                            0s                  0s                      0s           792,807,896     1.513      0.437                0.899               0.047s                         0.005s                             0.042s                   462                              451                                    11  C:\Program Files (x86)\Java\jre8\bin\client\jvm.dll                                         
ntoskrnl.exe      0.404s                   0.404s                        0.034s                        0.370s                      0.001s                             0s                            0s                  0s                      0s           660,557,183     1.096      0.000                0.780                                                                                                                                                                                      C:\WINDOWS\system32\ntoskrnl.exe                                                            
...

On Linux:


Module            CPU Time  CPU Time:Effective Time  CPU Time:Effective Time:Idle  CPU Time:Effective Time:Poor  CPU Time:Effective Time:Ok  CPU Time:Effective Time:Ideal  CPU Time:Effective Time:Over  CPU Time:Spin Time  CPU Time:Overhead Time  Instructions Retired  CPI Rate  Wait Rate  CPU Frequency Ratio  Context Switch Time  Context Switch Time:Wait Time  Context Switch Time:Inactive Time  Context Switch Count  Context Switch Count:Preemption  Context Switch Count:Synchronization  Module Path                                                                                 
----------------  --------  -----------------------  ----------------------------  ----------------------------  --------------------------  -----------------------------  ----------------------------  ------
libmixed_call.so   15.294s                  15.294s                        0.419s                       14.871s                      0.004s                             0s                            0s                  0s                      0s        21,148,958,284     1.907      0.000                1.149               1.401s                             0s                             1.401s                26,769                           26,769                                     0  /tmp/java_mixed_call/src/libmixed_call.so
libjvm.so           0.582s                   0.582s                        0.033s                        0.547s                      0.002s                             0s                            0s                  0s                      0s           792,807,896     1.513      0.437                0.899               0.047s                         0.005s                             0.042s                   462                              451                                       11  /tmp/java_mixed_call/src/libmjvm.so                                         
...                                                    
...

Analyze Stacks

To get the maximum performance out of your Java application, writing and compiling performance critical modules of your Java project in native languages, such as C or even assembly. This will help your application take advantage of vectorization and make complete use of powerful CPU resources. This way of programming helps to employ powerful CPU resources like vector computing (implemented via SIMD units and instruction sets). In this case, compute-intensive functions become hotspots in the profiling results, which is expected as they do most of the job. However, you might be interested not only in hotspot functions, but in identifying locations in Java code these functions were called from via a JNI interface. Tracing such cross-runtime calls in the mixed language algorithm implementations could be a challenge.

Use the callstacks report to display full stack data for each hotspot function and identify the impact of each stack on the function CPU or Wait time.

NOTE:

To display a list of available groupings for a callstacks report, enter vtune -report callstacks -r <result_dir> group-by=?.

Example

The following command line generates the callstacks report for the specified Hotspots analysis result.

On Windows:


Function      Function Stack             CPU Time  Module                Function (Full)                 Source File     Start Address
------------  -------------------------  --------  --------------------  ------------------------------  --------------  -------------
consume_time                              10.371s  mixed_call.dll        consume_time                    mixed_call.c    0x180001000
              MixedCall::CallNativeFunc   10.371s  [Compiled Java code]  MixedCall::CallNativeFunc(int)  MixedCall.java  0x186debc0
              MixedCall::foo4                  0s  [Compiled Java code]  MixedCall::foo4(int)            MixedCall.java  0x186c1ae3
              MixedCall::foo3                  0s  [Compiled Java code]  MixedCall::foo3(int)            MixedCall.java  0x186bb583
              MixedCall::foo2                  0s  [Compiled Java code]  MixedCall::foo2(int)            MixedCall.java  0x186bb583
              MixedCall::foo1                  0s  [Compiled Java code]  MixedCall::foo1(int)            MixedCall.java  0x186bb583
              MixedCall::run                   0s  [Compiled Java code]  MixedCall::run()                MixedCall.java  0x186bb19d
              call_stub                        0s  [Dynamic code]        call_stub                       [Unknown]       0x18010827
...

On Linux:


Function            Function Stack             CPU Time  Module                Function (Full)                 Source File     Start Address
------------------  -------------------------  --------  --------------------  ------------------------------  --------------  --------------
[libmixed_call.so]                              17.180s  libmixed_call.so      [libmixed_call.so]              [Unknown]       0
                    [libmixed_call.so]           8.600s  libmixed_call.so      [libmixed_call.so]              [Unknown]       0
                    MixedCall::CallNativeFunc        0s  [Compiled Java code]  MixedCall::CallNativeFunc(int)  MixedCall.java  0x7fb63937eec0
                    MixedCall::foo4                  0s  [Compiled Java code]  MixedCall::foo4(int)            MixedCall.java  0x7fb6393831e3
                    MixedCall::foo3                  0s  [Compiled Java code]  MixedCall::foo3(int)            MixedCall.java  0x7fb63938046c
                    MixedCall::foo2                  0s  [Compiled Java code]  MixedCall::foo2(int)            MixedCall.java  0x7fb63938046c
                    MixedCall::foo1                  0s  [Compiled Java code]  MixedCall::foo1(int)            MixedCall.java  0x7fb63938046c
                    MixedCall::run                   0s  [Compiled Java code]  MixedCall::run()                MixedCall.java  0x7fb63938009b                    
...

Analyze Hardware Metrics

VTune Profiler provides an advanced profiling option of optimizing Java applications for the CPU microarchitecture utilized in your platform. Although Java and JVM technology is intended to free a developer from hardware architecture specific coding, once Java code is optimized for the current Intel microarchitecture, it will most probably keep this advantage for future generations of CPUs.

VTune Profiler counts the number of hardware events during the hardware event-based sampling collection to help you understand how your Java application utilizes available hardware resources. Use the hw-events report type to display hardware events count per application functions in the descending order by default.
NOTE:

To display a list of available groupings for a hw-events report, enter vtune -report hw-events -r <result_dir> group-by=?.

Example

This example generates the hw-events report for the specified Hotspots analysis (hardware event-based sampling mode) result.

On Windows:


Function               Hardware Event Count:INST_RETIRED.ANY  Hardware Event Count:CPU_CLK_UNHALTED.THREAD  Hardware Event Count:CPU_CLK_UNHALTED.REF_TSC  Hardware Event Count:BR_INST_RETIRED.NEAR_TAKEN  Hardware Event Count:ITERATION_COUNT  Hardware Event Count:LOOP_ENTRY_COUNT  Hardware Event Count:CALL_COUNT  Context Switch Time  Context Switch Time:Wait Time  Context Switch Time:Inactive Time  Context Switch Count  Context Switch Count:Preemption  Context Switch Count:Synchronization  Module          Function (Full)        Source File   Start Address
---------------------  -------------------------------------  --------------------------------------------  ---------------------------------------------  -----------------------------------------------    
consume_time                                   8,649,248,560                                28,577,118,234                                 25,656,728,125                                      126,927,912                           126,914,825                                      0                                0               0.217s                             0s                             0.217s                 4,147                            4,147                                     0  mixed_call.dll  consume_time           mixed_call.c  0x180001000 
NtWaitForSingleObject                          1,683,967,360                                 3,955,057,542                                    716,832,500                                          200,003                                     0                                      0                           66,678             223.825s                        62.467s                           161.358s                 9,030                            5,158                                 3,873  ntdll.dll       NtWaitForSingleObject  [Unknown]     0x1800906f0
WriteFile                                      1,207,593,104                                 1,022,685,972                                  1,713,743,550                                                0                                     0                                      0                           61,803               0.340s                         0.003s                             0.337s                   962                              954                                     8  KernelBase.dll  WriteFile              [Unknown]     0x180001c50

On Linux:


Function            Hardware Event Count:INST_RETIRED.ANY  Hardware Event Count:CPU_CLK_UNHALTED.THREAD  Hardware Event Count:CPU_CLK_UNHALTED.REF_TSC Context Switch Time  Context Switch Time:Wait Time  Context Switch Time:Inactive Time  Context Switch Count  Context Switch Count:Preemption  Context Switch Count:Synchronization  Module              Function (Full)     Source File  Start Address
------------------  -------------------------------------  --------------------------------------------  --------------------------------------------- -------------------  -----------------------------    
[libmixed_call.so]                         21,148,958,284                                40,338,264,445                                 35,096,009,324              1.401s                             0s                             1.401s                26,769                           26,769                                     0  [libmixed_call.so]  [libmixed_call.so]  [Unknown]    0
[libjvm.so]                                   792,807,896                                 1,199,773,286                                  1,335,034,092              0.047s                         0.005s                             0.042s                   462                              451                                    11  [libjvm.so]         [libjvm.so]         [Unknown]    0
...

Limitations

VTune Profiler supports analysis of Java applications with some limitations:

  • System-wide profiling is not supported for managed code.

  • The JVM interprets some rarely called methods instead of compiling them for the sake of performance. VTune Profiler does not recognize interpreted Java methods and marks such calls as !Interpreter in the restored call stack.

    If you want such functions to be displayed in stacks with their names, force the JVM to compile them by using the -Xcomp option (show up as [Compiled Java code] methods in the results). However, the timing characteristics may change noticeably if many small or rarely used functions are being called during execution.

  • When opening source code for a hotspot, the VTune Profiler may attribute events or time statistics to an incorrect piece of the code. It happens due to JDK Java VM specifics. For a loop, the performance metric may slip upward. Often the information is attributed to the first line of the hot method's source code.

  • Consider events and time mapping to the source code lines as approximate.

  • For the user-mode sampling based Hotspots analysis type, the VTune Profiler may display only a part of the call stack. To view the complete stack on Windows, use the -Xcomp additional command line JDK Java VM option that enables the JIT compilation for better quality of stack walking. On Linux, use additional command line JDK Java VM options that change behavior of the Java VM:

    • Use the -Xcomp additional command line JDK Java VM option that enables the JIT compilation for better quality of stack walking.

    • On Linux* x86, use client JDK Java VM instead of the server Java VM: either explicitly specify -client, or simply do not specify -server JDK Java VM command line option.

    • On Linux x64, specify -XX:-UseLoopCounter command line option that switches off on-the-fly substitution of the interpreted method with the compiled version.

  • Java application profiling is supported for the Hotspots and Microarchitecture analysis types. Support for the Threading analysis is limited as some embedded Java synchronization primitives (which do not call operating system synchronization objects) cannot be recognized by the VTune Profiler. As a result, some of the timing metrics may be distorted.

  • There are no dedicated libraries supplying a user API for collection control in the Java source code. However, you may want to try applying the native API by wrapping the __itt calls with JNI calls.