User Guide

Intel® VTune™ Profiler User Guide

ID 766319
Date 3/31/2023
Public

A newer version of this document is available. Customers should click here to go to the newest version.

Document Table of Contents

Java* Code Analysis from the Command Line

Intel® VTune™ Profiler provides a low-overhead user-mode sampling and tracing and hardware event-based sampling analysis of the JIT compiled code executed with Oracle* JDK or OpenJDK*. The analysis of the interpreted Java methods is limited.

You may use the hardware event-based sampling data collection that monitors hardware events in the CPU's pipeline and can identify coding pitfalls limiting the most effective execution of instructions in the CPU. The hardware performance metrics are available and can be displayed against the application modules, functions, and Java code source lines. You may also run the hardware event-based sampling collection with stacks when you need to find out a call path for a function called in a driver or middleware layer in your system.

Configure Java Collection

Use the following syntax to configure Java analysis from the command line:

vtune -collect <analysis_type> [-[no-]follow-child] [-mrte-mode=<mrte_mode_value>] [<-knob> <knob_name=knob_option>] [--] <target>

where

  • <analysis_type> is the type of analysis to run
  • -[no-]follow-child is an action option to collect data on the processes spawned by the target process. It is recommended to enable the option for applications launched by a script. The option is enabled by default.
  • <mrte_mode_value> is a profiling mode for the managed code. The auto mode is enabled by default.
  • <-knob> is an option that configures the analysis
  • [knobName=knobValue] is the name of the specified knob and its value
  • <target> is the path and name of the application to analyze
NOTE:

To see all knobs available for a predefined analysis type, enter:

vtune -help collect <analysis_type>

To see knobs for a custom analysis type, enter:

vtune -help collect-with <analysis_type>

Examples

Example 1: Running Java Analysis

The following command line runs the Hotspots analysis on a java command on Linux*:

vtune -collect hotspots -- java -Xcomp -Djava.library.path=native_lib/ia32 -cp /home/Design/Java/mixed_call MixedCall 3 2

Example 2: Running Analysis for Embedded Java Command

You may embed your java command in a batch file or executable script before running the analysis. For example, on Windows* create a run.bat file with the following command:

java.exe -Xcomp -Djava.library.path=native_lib\ia32 -cp C:\Design\Java\mixed_call MixedCall 3 1

The following command line runs the Hotspots analysis on a specified batch file with embedded java command:

vtune -collect hotspots -- run.bat

Example 3: Attaching Analysis to Java Process

In case your Java application needs to run for some time or cannot be launched at the start of this analysis, you may attach the VTune Profiler to the Java process. To do this, specify the following analysis target: --target-process java.

NOTE:

The dynamic attach mechanism is supported only with the Java Development Kit (JDK).

The following example attaches the Hotspots analysis to a running Java process on Linux:

vtune -collect hotspots --target-process java

View Summary Report

VTune Profiler automatically generates the summary report when data collection completes. Similar to the Summary window, available in GUI, the command line report provides overall performance data of your Java target.

NOTE:

For more information on analyzing the summary report data, refer to the Summary Report section.

Examples

The following example generates the summary report for the Hotspots analysis result. For user-mode sampling and tracing analysis results, the summary report includes Collection and Platform information, CPU information and summary per the basic metrics.

On Windows:

Collection and Platform Info ---------------------------- Parameter r001hs ------------------------ ------------------------------------------ Operating System Microsoft Windows 10 Result Size 21258782 Collection start time 11:58:36 15/04/2019 UTC Collection stop time 11:58:50 15/04/2019 UTC CPU --- Parameter r001hs ----------------- ------------------------------------------------- Name 4th generation Intel(R) Core(TM) Processor family Frequency 2494227391 Logical CPU Count 4 Summary ------- Elapsed Time: 12.939 CPU Time: 14.813 Average CPU Usage: 1.012

On Linux:

Collection and Platform Info ---------------------------- Parameter r002hs ------------------------------------------------------------------- Application Command Line /tmp/java_mixed_call/src/run.sh Operating System 3.16.0-30-generic NAME="Ubuntu" VERSION="14.04.2 LTS, Trusty Tahr" ID=ubuntu ID_LIKE=debian PRETTY_NAME="Ubuntu 14.04.2 LTS" VERSION_ID="14.04" HOME_URL="http://www.ubuntu.com/" SUPPORT_URL="http://help.ubuntu.com/" BUG_REPORT_URL="http://bugs.launchpad.net/ubuntu/" Computer Name 10.125.21.55 Result Size 11560723 Collection start time 13:55:00 05/02/2019 UTC Collection stop time 13:55:10 05/02/2019 UTC CPU --- Parameter r001hs ----------------- ------------------------------------------------- Name 3rd generation Intel(R) Core(TM) Processor family Frequency 3492067692 Logical CPU Count 8 Summary ------- Elapsed Time: 10.183 CPU Time: 19.200 Average CPU Usage: 1.885

This example generates the summary report for the Hotspots analysis (hardware event-based sampling mode) result. For hardware event-based sampling analysis results, the summary report includes Collection and Platform information, CPU information, summary per the basic metrics, and an event summary.

Collection and Platform Info ---------------------------- Parameter r002hs ------------------------ ------------------------------------------ Operating System 3.16.0-30-generic NAME="Ubuntu" VERSION="14.04.2 LTS, Trusty Tahr" ID=ubuntu ID_LIKE=debian PRETTY_NAME="Ubuntu 14.04.2 LTS" VERSION_ID="14.04" HOME_URL="http://www.ubuntu.com/" SUPPORT_URL="http://help.ubuntu.com/" BUG_REPORT_URL="http://bugs.launchpad.net/ubuntu/" Result Size 171662827 Collection start time 10:44:34 15/04/2019 UTC Collection stop time 10:44:50 15/04/2019 UTC CPU --- Parameter r002hs ----------------- ------------------------------------------------- Name 4th generation Intel(R) Core(TM) Processor family Frequency 2494227445 Logical CPU Count 4 Summary ------- Elapsed Time: 15.463 CPU Time: 6.392 Average CPU Usage: 0.379 CPI Rate: 1.318 Event summary ------------- Hardware Event Type Hardware Event Count:Self Hardware Event Sample Count:Self Events Per Sample -------------------------- ------------------------- -------------------------------- ----------------- INST_RETIRED.ANY 13014608235 8276 1900000 CPU_CLK_UNHALTED.THREAD 17158609921 8207 1900000 CPU_CLK_UNHALTED.REF_TSC 15942400300 5163 1900000 BR_INST_RETIRED.NEAR_TAKEN 1228364727 4648 200003 CALL_COUNT 213650621 75413 1 ITERATION_COUNT 370567815 84737 1 LOOP_ENTRY_COUNT 162943310 70069 1

Identify Hottest Methods

Use the hotspots command line report as a starting point for identifying program units (for example: functions, modules, or objects) that take the most processor time (Hotspots analysis), underutilize available CPUs or have long waits (Threading analysis), and so on.

The report displays the hottest program units in the descending order by default, starting from the most performance-critical unit. The command-line reports provide the same data that is displayed in the default GUI analysis viewpoints.

NOTE:
  • To display a list of available groupings for a hotspots report, enter: vtune -report hotspots -r <result_dir> group-by=?.
  • To set the number of top items to include in a report, use the limit action option: vtune -report <report_type> -limit <value> -r <result_dir>

Examples

This example generates the hotspots report for the Hotspots analysis result and groups the data by module. The result file is not specified and VTune Profiler uses the latest analysis result.

vtune -report hotspots

On Windows:

Function CPU Time CPU Time:Effective Time CPU Time:Effective Time:Idle CPU Time:Effective Time:Poor CPU Time:Effective Time:Ok CPU Time:Effective Time:Ideal CPU Time:Effective Time:Over CPU Time:Spin Time CPU Time:Overhead Time Module Function (Full) Source File Start Address --------------------- -------- ----------------------- ---------------------------- ---------------------------- -------------------------- ----------------------------- ---------------------------- consume_time 10.371s 10.371s 0s 10.341s 0.020s 0.010s 0s 0s 0s mixed_call.dll consume_time mixed_call.c 0x180001000 NtWaitForSingleObject 1.609s 0s 0s 0s 0s 0s 0s 1.609s 0s ntdll.dll NtWaitForSingleObject [Unknown] 0x1800906f0 WriteFile 0.245s 0.245s 0.009s 0.190s 0.030s 0.016s 0s 0s 0s KERNELBASE.dll WriteFile [Unknown] 0x180001c50 func@0x707d5440 0.114s 0.010s 0s 0.010s 0s 0s 0s 0.104s 0s jvm.dll func@0x707d5440 [Unknown] 0x707d5440 func@0x705be5c0 0.072s 0.025s 0s 0.025s 0s 0s 0s 0.047s 0s jvm.dll func@0x705be5c0 [Unknown] 0x705be5c0 ...

On Linux:

Function CPU Time CPU Time:Effective Time CPU Time:Effective Time:Idle CPU Time:Effective Time:Poor CPU Time:Effective Time:Ok CPU Time:Effective Time:Ideal CPU Time:Effective Time:Over CPU Time:Spin Time CPU Time:Overhead Time Module Function (Full) Source File Start Address ------------------ -------- ----------------------- ---------------------------- ---------------------------- -------------------------- ----------------------------- ---------------------------- [libmixed_call.so] 17.180s 17.180s 0s 17.180s 0s 0s 0s 0s 0s libmixed_call.so [libmixed_call.so] [Unknown] 0 [libjvm.so] 1.698s 1.698s 0.020s 1.678s 0s 0s 0s 0s 0s libjvm.so [libjvm.so] [Unknown] 0 [libpthread.so.0] 0.136s 0.136s 0s 0.136s 0s 0s 0s 0s 0s libpthread.so.0 [libpthread.so.0] [Unknown] 0 [libtpsstool.so] 0.052s 0.052s 0s 0.052s 0s 0s 0s 0s 0s libtpsstool.so [libtpsstool.so] [Unknown] 0 ...

The following example generates the hotspots report for the specified Hotspots analysis result (hardware event-based sampling mode), sets the number of items to include in the report to 3, and groups the report data by application module.

vtune -report hotspots -limit 3 -r r002hs -group-by module

On Windows:

Module CPU Time CPU Time:Effective Time CPU Time:Effective Time:Idle CPU Time:Effective Time:Poor CPU Time:Effective Time:Ok CPU Time:Effective Time:Ideal CPU Time:Effective Time:Over CPU Time:Spin Time CPU Time:Overhead Time Instructions Retired CPI Rate Wait Rate CPU Frequency Ratio Context Switch Time Context Switch Time:Wait Time Context Switch Time:Inactive Time Context Switch Count Context Switch Count:Preemption Context Switch Count:Synchronization Module Path -------------- -------- ----------------------- ---------------------------- ---------------------------- -------------------------- ----------------------------- ---------------------------- ------- mixed_call.dll 15.294s 15.294s 0.419s 14.871s 0.004s 0s 0s 0s 0s 21,148,958,284 1.907 0.000 1.149 1.401s 0s 1.401s 26,769 26,769 0 C:\work\module Java\module Java\java_mixed_call\vc9\bin32\mixed_call.dll jvm.dll 0.582s 0.582s 0.033s 0.547s 0.002s 0s 0s 0s 0s 792,807,896 1.513 0.437 0.899 0.047s 0.005s 0.042s 462 451 11 C:\Program Files (x86)\Java\jre8\bin\client\jvm.dll ntoskrnl.exe 0.404s 0.404s 0.034s 0.370s 0.001s 0s 0s 0s 0s 660,557,183 1.096 0.000 0.780 C:\WINDOWS\system32\ntoskrnl.exe ...

On Linux:

Module CPU Time CPU Time:Effective Time CPU Time:Effective Time:Idle CPU Time:Effective Time:Poor CPU Time:Effective Time:Ok CPU Time:Effective Time:Ideal CPU Time:Effective Time:Over CPU Time:Spin Time CPU Time:Overhead Time Instructions Retired CPI Rate Wait Rate CPU Frequency Ratio Context Switch Time Context Switch Time:Wait Time Context Switch Time:Inactive Time Context Switch Count Context Switch Count:Preemption Context Switch Count:Synchronization Module Path ---------------- -------- ----------------------- ---------------------------- ---------------------------- -------------------------- ----------------------------- ---------------------------- ------ libmixed_call.so 15.294s 15.294s 0.419s 14.871s 0.004s 0s 0s 0s 0s 21,148,958,284 1.907 0.000 1.149 1.401s 0s 1.401s 26,769 26,769 0 /tmp/java_mixed_call/src/libmixed_call.so libjvm.so 0.582s 0.582s 0.033s 0.547s 0.002s 0s 0s 0s 0s 792,807,896 1.513 0.437 0.899 0.047s 0.005s 0.042s 462 451 11 /tmp/java_mixed_call/src/libmjvm.so ... ...

Analyze Stacks

To get the maximum performance out of your Java application, writing and compiling performance critical modules of your Java project in native languages, such as C or even assembly. This will help your application take advantage of vectorization and make complete use of powerful CPU resources. This way of programming helps to employ powerful CPU resources like vector computing (implemented via SIMD units and instruction sets). In this case, compute-intensive functions become hotspots in the profiling results, which is expected as they do most of the job. However, you might be interested not only in hotspot functions, but in identifying locations in Java code these functions were called from via a JNI interface. Tracing such cross-runtime calls in the mixed language algorithm implementations could be a challenge.

Use the callstacks report to display full stack data for each hotspot function and identify the impact of each stack on the function CPU or Wait time.

NOTE:

To display a list of available groupings for a callstacks report, enter vtune -report callstacks -r <result_dir> group-by=?.

Example

The following command line generates the callstacks report for the specified Hotspots analysis result.

On Windows:

Function Function Stack CPU Time Module Function (Full) Source File Start Address ------------ ------------------------- -------- -------------------- ------------------------------ -------------- ------------- consume_time 10.371s mixed_call.dll consume_time mixed_call.c 0x180001000 MixedCall::CallNativeFunc 10.371s [Compiled Java code] MixedCall::CallNativeFunc(int) MixedCall.java 0x186debc0 MixedCall::foo4 0s [Compiled Java code] MixedCall::foo4(int) MixedCall.java 0x186c1ae3 MixedCall::foo3 0s [Compiled Java code] MixedCall::foo3(int) MixedCall.java 0x186bb583 MixedCall::foo2 0s [Compiled Java code] MixedCall::foo2(int) MixedCall.java 0x186bb583 MixedCall::foo1 0s [Compiled Java code] MixedCall::foo1(int) MixedCall.java 0x186bb583 MixedCall::run 0s [Compiled Java code] MixedCall::run() MixedCall.java 0x186bb19d call_stub 0s [Dynamic code] call_stub [Unknown] 0x18010827 ...

On Linux:

Function Function Stack CPU Time Module Function (Full) Source File Start Address ------------------ ------------------------- -------- -------------------- ------------------------------ -------------- -------------- [libmixed_call.so] 17.180s libmixed_call.so [libmixed_call.so] [Unknown] 0 [libmixed_call.so] 8.600s libmixed_call.so [libmixed_call.so] [Unknown] 0 MixedCall::CallNativeFunc 0s [Compiled Java code] MixedCall::CallNativeFunc(int) MixedCall.java 0x7fb63937eec0 MixedCall::foo4 0s [Compiled Java code] MixedCall::foo4(int) MixedCall.java 0x7fb6393831e3 MixedCall::foo3 0s [Compiled Java code] MixedCall::foo3(int) MixedCall.java 0x7fb63938046c MixedCall::foo2 0s [Compiled Java code] MixedCall::foo2(int) MixedCall.java 0x7fb63938046c MixedCall::foo1 0s [Compiled Java code] MixedCall::foo1(int) MixedCall.java 0x7fb63938046c MixedCall::run 0s [Compiled Java code] MixedCall::run() MixedCall.java 0x7fb63938009b ...

Analyze Hardware Metrics

VTune Profiler provides an advanced profiling option of optimizing Java applications for the CPU microarchitecture utilized in your platform. Although Java and JVM technology is intended to free a developer from hardware architecture specific coding, once Java code is optimized for the current Intel microarchitecture, it will most probably keep this advantage for future generations of CPUs.

VTune Profiler counts the number of hardware events during the hardware event-based sampling collection to help you understand how your Java application utilizes available hardware resources. Use the hw-events report type to display hardware events count per application functions in the descending order by default.
NOTE:

To display a list of available groupings for a hw-events report, enter vtune -report hw-events -r <result_dir> group-by=?.

Example

This example generates the hw-events report for the specified Hotspots analysis (hardware event-based sampling mode) result.

On Windows:

Function Hardware Event Count:INST_RETIRED.ANY Hardware Event Count:CPU_CLK_UNHALTED.THREAD Hardware Event Count:CPU_CLK_UNHALTED.REF_TSC Hardware Event Count:BR_INST_RETIRED.NEAR_TAKEN Hardware Event Count:ITERATION_COUNT Hardware Event Count:LOOP_ENTRY_COUNT Hardware Event Count:CALL_COUNT Context Switch Time Context Switch Time:Wait Time Context Switch Time:Inactive Time Context Switch Count Context Switch Count:Preemption Context Switch Count:Synchronization Module Function (Full) Source File Start Address --------------------- ------------------------------------- -------------------------------------------- --------------------------------------------- ----------------------------------------------- consume_time 8,649,248,560 28,577,118,234 25,656,728,125 126,927,912 126,914,825 0 0 0.217s 0s 0.217s 4,147 4,147 0 mixed_call.dll consume_time mixed_call.c 0x180001000 NtWaitForSingleObject 1,683,967,360 3,955,057,542 716,832,500 200,003 0 0 66,678 223.825s 62.467s 161.358s 9,030 5,158 3,873 ntdll.dll NtWaitForSingleObject [Unknown] 0x1800906f0 WriteFile 1,207,593,104 1,022,685,972 1,713,743,550 0 0 0 61,803 0.340s 0.003s 0.337s 962 954 8 KernelBase.dll WriteFile [Unknown] 0x180001c50

On Linux:

Function Hardware Event Count:INST_RETIRED.ANY Hardware Event Count:CPU_CLK_UNHALTED.THREAD Hardware Event Count:CPU_CLK_UNHALTED.REF_TSC Context Switch Time Context Switch Time:Wait Time Context Switch Time:Inactive Time Context Switch Count Context Switch Count:Preemption Context Switch Count:Synchronization Module Function (Full) Source File Start Address ------------------ ------------------------------------- -------------------------------------------- --------------------------------------------- ------------------- ----------------------------- [libmixed_call.so] 21,148,958,284 40,338,264,445 35,096,009,324 1.401s 0s 1.401s 26,769 26,769 0 [libmixed_call.so] [libmixed_call.so] [Unknown] 0 [libjvm.so] 792,807,896 1,199,773,286 1,335,034,092 0.047s 0.005s 0.042s 462 451 11 [libjvm.so] [libjvm.so] [Unknown] 0 ...

Limitations

VTune Profiler supports analysis of Java applications with some limitations:

  • System-wide profiling is not supported for managed code.

  • The JVM interprets some rarely called methods instead of compiling them for the sake of performance. VTune Profiler does not recognize interpreted Java methods and marks such calls as !Interpreter in the restored call stack.

    If you want such functions to be displayed in stacks with their names, force the JVM to compile them by using the -Xcomp option (show up as [Compiled Java code] methods in the results). However, the timing characteristics may change noticeably if many small or rarely used functions are being called during execution.

  • When opening source code for a hotspot, the VTune Profiler may attribute events or time statistics to an incorrect piece of the code. It happens due to JDK Java VM specifics. For a loop, the performance metric may slip upward. Often the information is attributed to the first line of the hot method's source code.

  • Consider events and time mapping to the source code lines as approximate.

  • For the user-mode sampling based Hotspots analysis type, the VTune Profiler may display only a part of the call stack. To view the complete stack on Windows, use the -Xcomp additional command line JDK Java VM option that enables the JIT compilation for better quality of stack walking. On Linux, use additional command line JDK Java VM options that change behavior of the Java VM:

    • Use the -Xcomp additional command line JDK Java VM option that enables the JIT compilation for better quality of stack walking.

    • On Linux* x86, use client JDK Java VM instead of the server Java VM: either explicitly specify -client, or simply do not specify -server JDK Java VM command line option.

    • On Linux x64, specify -XX:-UseLoopCounter command line option that switches off on-the-fly substitution of the interpreted method with the compiled version.

  • Java application profiling is supported for the Hotspots and Microarchitecture analysis types. Support for the Threading analysis is limited as some embedded Java synchronization primitives (which do not call operating system synchronization objects) cannot be recognized by the VTune Profiler. As a result, some of the timing metrics may be distorted.

  • There are no dedicated libraries supplying a user API for collection control in the Java source code. However, you may want to try applying the native API by wrapping the __itt calls with JNI calls.