[mpich-discuss] Problems with computation-communication overlap in non-blocking mode

Mon Mar 10 11:30:02 CDT 2014

Hi Nikola,

I've added my responses inline.

On 03/07/2014 08:50 AM, Velickovic Nikola wrote:
>
> Dear all,
>
> I have a simple MPI program with two processes using non-blocking communication illustrated bellow:
>
> process 0:         process 1:
>
> MPI_Isend          MPI_Irecv
>
> compute stage  compute stage
>
> MPI_Wait           MPI_Wait
>
> Actual communication is performed by offloading it to another thread, or using DMA (KNEM module is used for this).
> Ideally what should happen is that process 0 issues a non-blocking send, process 1 receives the data
> and in the meantime (in parallel) the CPU cores where the processes run are doing the compute stage.
> When compute stage is completed calling MPI_Wait wraps up the communication.
>
> When I profile my application it turns out that actual communication is initiated with MPI_Wait (significant amount of time is spent there) and hence disables overlapping
> communication and computation since MPI_Wait is called after the compute stage.
> Computation in my test case takes more time than communication so MPI_Wait should not be consuming significant amount of time since the communication should be over by then.
>

This is allowed by the MPI standard. Non-blocking communication calls do 
not guarantee asynchronous progress.

> This I confirmed also by using MPI_Test instead of MPI_Wait.
> MPI_Test has the same effect as MPI_Wait (to the best of my knowledge) but is non-blocking.
> When placing MPI_Test strategically in the compute stage it initiates the communication and a certain communication-computation overlap is achieved.

Using MPI_Test is a valid way to invoke progress. MPICH also provides an 
environment variable (MPIR_CVAR_ASYNC_PROGRESS) that you can set to true 
in your program env to cause MPICH to make progress on your non-blocking 
communication, without having to modify your code. Yet another way would 
be to run as a multi-threaded MPI program and use one thread for 
communication and one for computation.

Ken