Get Started with Intel® Trace Analyzer and Collector

ID 758510
Date 10/31/2024
Public

Improve Your Application Performance by Changing Communications

Improve the performance of the MPI application by changing blocking to non-blocking communications.

  1. In your code replace the serial MPI_Sendrcv with non-blocking communication: MPI_Isend and MPI_Irecv. For example:

    Original code snippet:

    // boundary exchange
    void exchange(para* p, grid* gr)
    {
      int i,j;
      MPI_Status status_100, status_200, status_300, status_400;
      // send down first row
      MPI_Send(gr->x_new[1], gr->lcol+2, MPI_DOUBLE, gr->down, 100, MPI_COMM_WORLD);
      MPI_Recv(gr->x_new[gr->lrow+1], gr->lcol+2, MPI_DOUBLE, gr->up, 100, MPI_COMM_WORLD, &status_100);
      // send up last row
      MPI_Send(gr->x_new[gr->lrow], gr->lcol+2, MPI_DOUBLE, gr->up, 200, MPI_COMM_WORLD);
      MPI_Recv(gr->x_new[0], gr->lcol+2, MPI_DOUBLE, gr->down, 200, MPI_COMM_WORLD, &status_200);
    // copy left column to tmp arrays
     if(gr->left != MPI_PROC_NULL)
     {
       for(i=0; i< gr->lrow+2; i++)
       {
         left_col[i] = gr->x_new[i][1];
      }
       MPI_Send(left_col, gr->lrow+2, MPI_DOUBLE, gr->left, 300, MPI_COMM_WORLD);
      }
      if(gr->right != MPI_PROC_NULL)
      {
       MPI_Recv(right_col, gr->lrow+2, MPI_DOUBLE, gr->right, 300, MPI_COMM_WORLD, &status_300);
      // copy right column to tmp
      // copy received left column to ghost cells
      for(i=0; i< gr->lrow+2; i++)
      {
        gr->x_new[i][gr->lcol+1] = right_col[i];
        right_col[i] = gr->x_new[i][gr->lcol];
      }
      // send right
      MPI_Send(right_col, gr->lrow+2, MPI_DOUBLE, gr->right, 400, MPI_COMM_WORLD);
      }
      if(gr->left != MPI_PROC_NULL)
      {
        MPI_Recv(left_col, gr->lrow+2, MPI_DOUBLE, gr->left, 400, MPI_COMM_WORLD,&status_400);
        for(i=0; i< gr->lrow+2; i++)
        {
          gr->x_new[i][0] = left_col[i];
        }
      }
    }
    

    Updated code snippet:

    MPI_Request req[7];
    // send down first row
    MPI_Isend(gr->x_new[1], gr->lcol+2, MPI_DOUBLE, gr->down, 100, MPI_COMM_WORLD, &req[0]);
    MPI_Irecv(gr->x_new[gr->lrow+1], gr->lcol+2, MPI_DOUBLE, gr->up, 100, MPI_COMM_WORLD, &req[1]);
    .....
      MPI_Waitall(7, req, MPI_STATUSES_IGNORE);
    

    Once corrected, the single iteration of the revised application will look like the following example:

  2. Use the Intel Trace Analyzer Comparison view to compare the serialized application with the revised one. Compare two traces with the help of the Comparison View, going to View > Compare. The Comparison View looks similar to:

    In the Comparison View, you can see that using non-blocking communication helps to remove serialization and decrease the time of communication of processes.

NOTE:
For more information about node-level performance of your application, see documentation for the respective tools: Intel® VTune™ Profiler MPI Code Analysis and Analyzing Intel® MPI applications using Intel® Advisor.