Intel® Trace Analyzer and Collector User and Reference Guide

ID 767272
Date 10/31/2024
Public
Document Table of Contents

Tracing Distributed Non-MPI Applications

This section describes the design, implementation and usage of Intel® Trace Collector for distributed applications.

Processes in non-MPI applications or systems are created and communicate using non-standard and varying methods. The communication may be slow or unsuitable for Intel Trace Collector communication patterns. Therefore a special version of the Intel Trace Collector library libVTcs was developed that neither relies on MPI nor on the application's communication, but rather implements its own communication layer using TCP/IP. This is why it is called client-server.

The libVTcs library allows the generation of executables that work without MPI. Linking is accomplished by adding libVTcs.a (VTcs.lib on Microsoft* Windows* OS) and the libraries it needs to the link line: -lVTcs $VT_ADD_LIBS. The application has to call VT_initialize() and VT_finalize() to generate a tracefile. Function tracing can be used with and without further Intel Trace Collector API calls to actually generate trace events.

Design

The application has to meet the following requirements:

  • The application handles startup and termination of all processes itself. Both startup with a fixed number of processes and dynamic spawning of processes is supported, but spawning processes is an expensive operation and should not be done too frequently.

  • For a reliable startup, the application has to gather a short string from every process in one place to bootstrap the TCP/IP communication in Intel Trace Collector. Alternatively, one process is started first and its string is passed to the others. In this case you can assume that the string is always the same for each program run, but this is less reliable because the string encodes a dynamically chosen port which may change.

  • Map the hostname to an IP address that all processes can connect to.

NOTE:
This is not the case if /etc/hosts lists the hostname as alias for 127.0.0.1 and processes are started on different hosts. As a workaround for that case the hostname is sent to other processes, which then requires a working name lookup on their host systems.

Intel® Trace Collector for distributed applications consists of a special library (libVTcs) that is linked into the application's processes and the VTserver executable, which connects to all processes and coordinates the trace file writing. Linking with libVTcs is required to keep the overhead of logging events as small as possible, while VTserver can be run easily in a different process.

Alternatively, the functionality of the VTserver can be accomplished with another API call by one of the processes.

Using VTserver

This is how the application starts, collects trace data and terminates:

  1. The application initializes itself and its communication.

  2. The application initializes communication between VTserver and processes.

  3. Trace data is collected locally by each process.

  4. VT data collection is finalized, which moves the data from the processes to the VTserver, where it is written into a file.

  5. The application terminates.

The application may iterate several times over points 2 till 4. Looping over 3 and the trace data collection part of 4 are not supported at the moment, because:

  • it requires a more complex communication between the application and VTserver

  • the startup time for 2 is expected to be sufficiently small

  • reusing the existing communication would only work well if the selection of active processes does not change

If the startup time turns out to be unacceptably high, then the protocol between application and Intel Trace Collector could be revised to support reusing the established communication channels.

Initializing and Finalizing

The application has to bootstrap the communication between the VTserver and its clients. This is done as follows:

  1. The application server initiates its processes.

  2. Each process calls VT_clientinit().

  3. VT_clientinit() allocates a port for TCP/IP communication with the VTserver or other clients and generates a string which identifies the machine and this port.

  4. Each process gets its own string as result of VT_clientinit().

  5. The application collects these strings in one place and calls VTserver with all strings as soon as all clients are ready. VT configuration is given to the VTserver as file or through command line options.

  6. Each process calls VT_initialize() to actually establish communication.

  7. The VTserver establishes communication with the processes, then waits for them to finalize the trace data collection.

  8. Trace data collection is finalized when all processes have called VT_finalize().

  9. Once the VTserver has written the trace file, it quits with a return code indicating success or failure.

Some of the VT API calls may block, especially VT_initialize(). Execute them in a separate thread if the process wants to continue. These pending calls can be aborted with VT_abort(), for example if another process failed to initialize trace data collection. This failure has to be communicated by the application itself and it also has to terminate the VTserver by sending it a kill signal, because it cannot be guaranteed that all processes and the VTserver will detect all failures that might prevent establishing the communication.

Running without VTserver

Instead of starting VTserver as rank 0 with the contact strings of all application processes, one application process can take over that role. It becomes rank 0 and calls VT_serverinit() with the information normally given to VTserver. This changes the application startup only slightly.

A more fundamental change is supported by first starting one process with rank 0 as server, then taking its contact string and passing it to the other processes. These processes then give this string as the initial value of the contact parameter in VT_clientinit(). To distinguish this kind of startup from the dynamic spawning of process described in the next section, the prefix S needs to be added by the application before calling VT_clientinit(). An example where this kind of startup is useful is a process which preforks several child processes to do some work.

In both cases it may be useful to note that the command line arguments previously passed to VTserver can be given in the argc/argv array as described in the documentation of VT_initialize().

Spawning Processes

Spawning new processes is expensive, because it involves setting up TCP communication, clock synchronization, configuration broadcasting, amongst others. Its flexibility is also restricted because it needs to map the new processes into the model of communicators that provide the context for all communication events. This model follows the one used in MPI and implies that only processes inside the same communicator can communicate at all.

For spawned processes, the following model is currently supported: one of the existing processes starts one or more new processes. These processes need to know the contact string of the spawning process and call VT_clientinit() with that information; in contrast to the startup model from the previous section, no prefix is used. Then while all spawned processes are inside VT_clientinit(), the spawning process calls VT_attach() which does all the work required to connect with the new processes.

The results of this operation are:

  • a new VT_COMM_WORLD which contains all of the spawned processes, but not the spawning process

  • a communicator which contains the spawning process and the spawned ones; the spawning process gets it as result from VT_attach() and the spawned processes by calling VT_get_parent()

The first of these communicators can be used to log communication among the spawned processes, the second for communication with their parent. There's currently no way to log communication with other processes, even if the parent has a communicator that includes them.

Tracing Events

Once a process' call to VT_initialize() has completed successfully it can start calling VT API functions that log events. These events will be associated with a time stamp generated by Intel® Trace Collector and with the thread that calls the function.

Should the need arise, then VT API functions could be provided that allow one thread to log events from several different sources instead of just itself.

Event types supported at the moment are those also provided in the normal Intel Trace Collector, like state changes (VT_enter(), VT_leave()) and sending and receiving of data (VT_log_sendmsg(), VT_log_recvmsg()). The resulting trace file is in a format that can be loaded and analyzed with Intel Trace Collector GUI.

Usage

Executables in the application are linked with -lVTcs and $VT_ADD_LIBS. It is possible to have processes implemented in different languages, as long as they use the same version of the libVTcs.

The VTserver has the following synopsis:

VTserver <contact infos> [config options]

Each contact info is guaranteed to be one word and their order on the command line is irrelevant. The configuration options can be specified on the command line by adding the prefix -- and listing its arguments after the keyword. This is an example for contacting two processes and writing into the file example.stf in STF format:

VTserver <contact1> <contact2> --logfile-name example.stf

All options can be given as environment variables. The format of the configuration file and the environment variables are described in more detail in the chapter about VT_CONFIG.

Signals

libVTcs uses the same techniques as fail-safe MPI tracing to handle failures inside the application, therefore it will generate a trace even if the application segfaults or is aborted with Ctrl + C.

When only one process runs into a problem, then libVTcs tries to notify the other processes, which then should stop their normal work and enter trace file writing mode. If this fails and the application hangs, then it might still be possible to generate a trace by sending a SIGINT to all processes manually.

Examples

There are two examples using MPI as means of communication and process handling. But as they are not linked against the normal Intel Trace Collector library, tracing of MPI has to be done with Intel Trace Collector API calls.

clientserver.c is a full-blown example that simulates and handles various error conditions. It uses threads and fork/exec to run API functions and VTserver concurrently. simplecs.c is a stripped down version that is easier to read, but does not check for errors.

The dynamic spawning of processes is demonstrated by forkcs.c. It first initializes one process as server with no clients, then forks to create new processes and connects to them with VT_attach(). This is repeated recursively. Communication is done through pipes and logged in the new communicators.

forkcs2.c is a variation of the previous example which also uses fork and pipes, but creates the additional processes at the beginning without relying on dynamic spawning.