Intel VTune Profiler Performance Analysis Cookbook

ID 766316
Date 5/19/2023
Public

A newer version of this document is available. Customers should click here to go to the newest version.

Document Table of Contents

Inefficient TCP/IP Synchronization

This recipe shows how to locate inefficient TCP/IP synchronization in your code by running the Locks and Waits analysis of the Intel® VTune™ Amplifier with the task collection enabled.

Content expert: Kirill Uhanov

NOTE:

Locks and Waits analysis was renamed to Threading analysis starting with Intel VTune Amplifier 2019.

Ingredients

  • Application: client and server applications with TCP socket communications

  • Performance analysis tools: Intel VTune Amplifier 2018 > Locks and Waits analysis

    NOTE:
    • Starting with the 2020 release, Intel® VTune™ Amplifier has been renamed to Intel® VTune™ Profiler.

    • Most recipes in the Intel® VTune™ Profiler Performance Analysis Cookbook are flexible. You can apply them to different versions of Intel® VTune™ Profiler. In some cases, minor adjustments may be required.

    • Get the latest version of Intel® VTune™ Profiler:

  • Client operating system: Microsoft* Windows* Server 2016

  • Server operating system: Linux*

Run Locks and Waits Analysis

If you see your client application take much time to warm up, consider running the Locks and Waits analysis to explore wait statistics per synchronisation object:

  1. Click the New Project toolbar button to create a new project, for example: tcpip_delays.

  2. In the Analysis Target window, select the local host target system type for the host-based analysis.

  3. Select the Launch Application target type and specify an application for analysis on the right.

  4. Click the Choose Analysis button on the right, select Algorithm Analysis > Locks and Waits.

  5. Click Start.

VTune Amplifier launches the application, collects data and finalizes the data collection result resolving symbol information, which is required for successful source analysis.

Locate Synchronization Delays on the Timeline

Open the collected result and click the Bottom-up tab to view performance details per synchronization object. On the Timeline pane, you see multiple synchronization delays when the test application starts executing. To identify synchronization objects causing these startup delays, drag-and-drop to select the first 9 seconds and use the Filter In by Selection option from the context menu:

Use the Process menu on the filter bar to filter in by the host communication process:

Now, for this selected time frame, select the Socket synchronization object that generated the highest number of waits and use the Filter In by Selection menu option to filter in the data:

To investigate Wait time for the Socket synchronization object, focus on the timeline:

If you click the Zoom In button, you can see two types of socket waits: fast and slow. Most of the slow sockets waits are about 200 ms, while fast waits last about 937 usec:

To understand the cause of fast and slow waits, wrap all send/receive calls by ITT counters to calculate send/receive bytes.

Detect send/receive Buffer Size with ITT API Counters

To use Instrumentation and Tracing Technology (ITT) APIs for tracing send/receive calls:

  1. Configure your system to be able to reach the API headers and libraries.
  2. Include the ITT API header to your source file and link the <vtune-install-dir>\[lib64 or lib32]\libittnotify.lib static library to your application.

  3. Wrap send/receive calls with ITT counters:


#include <ittnotify.h>

__itt_domain* g_domain =__itt_domain_createA("com.intel.vtune.tests.userapi_counters");
__itt_counter g_sendCounter = __itt_counter_create_typedA("send_header", g_domain->nameA, __itt_metadata_s32);
__itt_counter g_sendCounterArgs = __itt_counter_create_typedA("send_args", g_domain->nameA, __itt_metadata_s32);
__itt_counter g_recieveCounter = __itt_counter_create_typedA("recieve_header", g_domain->nameA, __itt_metadata_s32);
__itt_counter g_recieveCounterCtrl = __itt_counter_create_typedA("recieve_ctrl", g_domain->nameA, __itt_metadata_s32);
__itt_counter g_incDecCounter = __itt_counter_createA("inc_dec_counter", g_domain->nameA);

.....

    sent_bytes = send(...);
    __itt_counter_set_value(g_sendCounter, &sent_bytes);
.....
    sent_bytes = send(...);
    __itt_counter_set_value(g_sendCounterArgs, &sent_bytes);
.....
    while(data_transferred < header_size)) {
        if ((data_size = recv(...) < 0) {
                               .....
    }
    __itt_counter_set_value(g_recieveCounter, &data_transferred);
.....

        while(data_transferred < data_size) {
            if ((data_size = recv(...) < 0) {
                                               ....
                                               }
        }
    }
    __itt_counter_set_value(g_recieveCounterCtrl, &data_transferred);

Recompile your application and re-run the Locks and Waits analysis with the Analyze user tasks, events, and counters option enabled:

Identify the Cause of Inefficient TCP/IP Synchronization

You see that for the new result, the VTune Amplifier added the Global Counters section to the Timeline pane that shows the distribution of the send/receive calls collected via ITT API. When you mouse over waits on the threads and counter values, you see that small instant values of the counters correspond to the long (slow) waits:

And fairly high counter values correspond to the short (fast) waits:

Profiling of communication waits on the remote target provides a symmetric picture: when you receive a small-size buffer, you have a long wait; and when you receive a sufficient buffer, you have a fast wait.

In this recipe, you see a communication command channel. Most of the commands have a small size, which results in a significant count of long waits.

The cause of the issue is the tcp ack delay mechanism that adds waits for small buffers.

If you decrease an input (setsockopt (…, SO_RCVBUF, ..,)) buffer on the server side, you can get more than 5x startup time speed-up (from dozens to a couple seconds):