[mpich-discuss] How to use non-blocking send/receive without calling MPI_Wait

Tue Apr 7 10:12:50 CDT 2015

Hi Lei,

Your profiling is correct. MPI_Isend/Irecv only start the communication but
the real progress is made in MPI_Waitall. If you want to overlap
communication with ResFromDivInvisFlux, you can put MPI_Test inside
ResFromDivInvisFlux
to force MPI to make progress periodically when MPI_Test is called.

Another option is to use asynchronous progress thread. Check
MPIR_CVAR_ASYNC_PROGRESS
for more information.

Best regards,
--
Huiwei

On Tue, Apr 7, 2015 at 2:48 AM, Lei Shi <lshi at ku.edu> wrote:

>  Here is my pure MPI overlap version. I use intel traceanalyzer, the
> profiling shows that right now, communication only  proceed when I call
> mpi_waitall on nodes with 10g network.
>
> /** pure mpi overlap  **/
>   template<typename T>
>   void CPR_NS_3D_Solver<T>::UpdateRes(T**q, T**res){
>     if(_n_proc>1)
>       SendInterfaceSol(); //call isend/irecv to send msg 1
>
>     ResFromDivInvisFlux(q,res); //do local jobs
>
>     if(_n_proc>1){
>       RevInterfaceSol(); //mpi_waitall for msg 1
>       if(vis_mode_)
>         SendInterfaceCorrGrad(); //depends on msg 1 then snd msg 2
>     }
>
>     if(vis_mode_)
>       ResFromDivVisFlux(q,res); //computing, which depends on msg 1
>
>     if(_n_proc>1 && vis_mode_)
>       RevInterfaceCorrGrad(); //mpi_waitall for msg 2
>
>     ResFromFluxCorrection(q,res); //computing, which depends on msg 1 and 2
>   }
>
>
> On Tue, Apr 7, 2015 at 2:39 AM, Lei Shi <lshi at ku.edu> wrote:
>
>>
>>
>> On Tue, Apr 7, 2015 at 2:37 AM, Lei Shi <leishi at ku.edu> wrote:
>>
>>> Hi Huiwei and Jeff,
>>>
>>> I use hybrid OpenMP/MPI to do overlap communication. So I put all
>>> communication in one dedicated OpenMP thread and computation in the other
>>> thread. For this case, I'm using intel MPI library. Probably I did some
>>> mistakes
>>>
>>> One version of my code using one dedicated thread to do messaging is
>>> like this
>>>
>>> /* hybrid mpi/openmp overlap **/template<typename T>void CPR_NS_3D_Solver<T>::UpdateRes(T**q, T**res){
>>>   int thread_id,n_thread;
>>>   int sol_rev_flag=0,grad_rev_flag=0;
>>>
>>>   // Explicitly disable dynamic teams
>>>   omp_set_dynamic(0);
>>>   // Use 2 threads for all consecutive parallel regions
>>>   omp_set_num_threads(2);
>>>     #pragma omp parallel default(shared) private(thread_id)
>>>   {
>>>     thread_id=omp_get_thread_num();
>>>     n_thread=omp_get_num_threads();
>>>
>>>     /** communication thread   **/
>>>     if(thread_id==1){
>>>       SendInterfaceSol();
>>>       RevInterfaceSol();#pragma omp flush
>>>       sol_rev_flag=1;#pragma omp flush(sol_rev_flag)
>>>     }
>>>
>>>     /** computation thread **/
>>>     if(thread_id==0){
>>>       ResFromDivInvisFlux(q,res); //local computation
>>>         #pragma omp flush(sol_rev_flag)
>>>         while(sol_rev_flag!=1){             #pragma omp flush(sol_rev_flag)
>>>         }#pragma omp flush
>>>         ResFromFluxCorrection(q,res); //depends on interface sol
>>>     }
>>>   }//end of omp
>>>     }
>>>
>>> template<typename T>
>>>
>>>   void CPR_NS_3D_Solver<T>::SendInterfaceSol(){
>>>     uint *n_if_to_proc=this->grid_->num_iface_proc;
>>>     uint **if_to_proc=this->grid_->snd_iface_proc;
>>>     uint **rev_if_to_f=this->grid_->rev_iface_proc;
>>>
>>>     int tag=52;
>>>     for(int p2=0;p2<_n_proc;++p2){
>>>       if(p2!=_proc_id){
>>>         int nif=n_if_to_proc[p2];
>>>         //pack data to send ....
>>>
>>>       }
>>>     }
>>>
>>>     /**      * Exchange interface sol     **/
>>>     int n_proc_exchange=0;
>>>     for(int z=0;z<_n_proc;++z){
>>>       int nif=n_if_to_proc[z];
>>>
>>>       //send data
>>>       if(nif>0){        MPI_Isend(&snd_buf_[z][0],n_buf_[z],MPI_DOUBLE,z,tag, MPI_COMM_WORLD, &s_sol_req_[n_proc_exchange]);
>>>         MPI_Irecv(&rev_buf_[z][0],n_buf_[z],MPI_DOUBLE,z,tag, MPI_COMM_WORLD, &r_sol_req_[n_proc_exchange]);
>>>         n_proc_exchange++;
>>>       }
>>>     }
>>>
>>>   }
>>>
>>>   template<typename T>
>>>   void CPR_NS_3D_Solver<T>::RevInterfaceSol(){    uint *n_if_to_proc=this->grid_->num_iface_proc;
>>>     uint **if_to_proc=this->grid_->snd_iface_proc;
>>>     uint **rev_if_to_f=this->grid_->rev_iface_proc;
>>>
>>>     //wait
>>>     if(n_proc_exchange_>0){      MPI_Waitall(n_proc_exchange_,s_sol_req_,MPI_STATUS_IGNORE);      MPI_Waitall(n_proc_exchange_,r_sol_req_,MPI_STATUS_IGNORE);    }
>>>
>>>     /** store to local data structure **/
>>>     for(int z=0;z<_n_proc;++z){
>>>       int nif=n_if_to_proc[z];
>>>
>>>       if(nif>0){
>>>
>>>        //unpacking ....      }
>>>     }
>>>
>>>   }
>>>
>>>
>>>
>>>
>>>
>>>
>>> Sincerely Yours,
>>>
>>> Lei Shi
>>> ---------
>>>
>>> On Fri, Apr 3, 2015 at 4:37 PM, Jeff Hammond <jeff.science at gmail.com>
>>> wrote:
>>>
>>>> As far as I know, Ethernet is not good at making asynchronous progress
>>>> in hardware the way e.g. InfiniBand is.  I would have thought that a
>>>> dedicated progress thread would help, but it seems you tried that.  Did you
>>>> use your own progress thread or MPICH_ASYNC_PROGRESS=1?
>>>>
>>>> Jeff
>>>>
>>>> On Fri, Apr 3, 2015 at 10:10 AM, Lei Shi <lshi at ku.edu> wrote:
>>>>
>>>>> Huiwei,
>>>>>
>>>>> Thanks for your email. Your answer leads to my another question about
>>>>> asynchronous MPI communication.
>>>>>
>>>>> I'm trying to do an overlapped communication/computing to speedup my
>>>>> MPI code. I read some papers comparing some different approaches to do the
>>>>> overlapped communication. The "naive" overlapped communication
>>>>> implementation, which only use non-blocking mpi Isend/Irecv and the hybrid
>>>>> approach using OpenMP and MPI together. In the hybrid approach, a separated
>>>>> thread is use to do all non-blocking communications. Just exactly as you
>>>>> said, the results indicate that current MPI implementations do not
>>>>> support true asynchronous communication.
>>>>>
>>>>> If I use the naive approach, my code with non-blocking or blocking
>>>>> send/recv gives me almost the same performance in term of Wtime. All
>>>>> communications are postponed to MPI_Wait.
>>>>>
>>>>> I have tried calling mpi_test to push library to do communication
>>>>> during iterations. And try to use a dedicated thread to do communication
>>>>> and the other thread to do computing only. However, the performance gains
>>>>> are very small or no gain at all. I'm wondering it is due to the hardware.
>>>>> The cluster I tested uses 10G Ethernet card.
>>>>>
>>>>>
>>>>> Best,
>>>>>
>>>>> Lei Shi
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Fri, Apr 3, 2015 at 8:49 AM, Huiwei Lu <huiweilu at mcs.anl.gov>
>>>>> wrote:
>>>>>
>>>>>> Hi Lei,
>>>>>>
>>>>>> As far as I know, all current MPI implementations do not support true
>>>>>> asynchronous communication for now. i.e., If there is no MPI calls in your
>>>>>> iterations, MPICH will not be able to make progress on communication.
>>>>>>
>>>>>> One solution is to poll the MPI runtime regularly to make progress by
>>>>>> inserting MPI_Test to your iteration (even though you do not want to check
>>>>>> the data).
>>>>>>
>>>>>> Another solution is to enable MPI's asynchronous progress thread to
>>>>>> make progress for you.
>>>>>>
>>>>>> --
>>>>>> Huiwei
>>>>>>
>>>>>> On Thu, Apr 2, 2015 at 11:44 PM, Lei Shi <lshi at ku.edu> wrote:
>>>>>>
>>>>>>> Hi Junchao,
>>>>>>>
>>>>>>> Thanks for your reply. For my case, I don't want to check the data
>>>>>>> have been received or not. So I don't want to call MPI_Test or any function
>>>>>>> to verify it. But my problem is like if I ignore calling the MPI_Wait, just
>>>>>>> call Isend/Irev, my program freezes for several sec and then continues to
>>>>>>> run. My guess is probably I messed up the MPI library internal buffer by
>>>>>>> doing this.
>>>>>>>
>>>>>>> On Thu, Apr 2, 2015 at 7:25 PM, Junchao Zhang <jczhang at mcs.anl.gov>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Does MPI_Test fit your needs?
>>>>>>>>
>>>>>>>> --Junchao Zhang
>>>>>>>>
>>>>>>>> On Thu, Apr 2, 2015 at 7:16 PM, Lei Shi <lshi at ku.edu> wrote:
>>>>>>>>
>>>>>>>>> I want to use non-blocking send/rev MPI_Isend/MPI_Irev to do
>>>>>>>>> communication. But in my case, I don't really care what kind of data I get
>>>>>>>>> or it is ready to use or not. So I don't want to waste my time to do any
>>>>>>>>> synchronization  by calling MPI_Wait or etc API.
>>>>>>>>>
>>>>>>>>> But when I avoid calling MPI_Wait, my program is freezed several
>>>>>>>>> secs after running some iterations (after multiple MPI_Isend/Irev
>>>>>>>>> callings), then continues. It takes even more time than the case with
>>>>>>>>> MPI_Wait.  So my question is how to do a "true" non-blocking communication
>>>>>>>>> without waiting for the data ready or not. Thanks.
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> discuss mailing list     discuss at mpich.org
>>>>>>>>> To manage subscription options or unsubscribe:
>>>>>>>>> https://lists.mpich.org/mailman/listinfo/discuss
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> discuss mailing list     discuss at mpich.org
>>>>>>>> To manage subscription options or unsubscribe:
>>>>>>>> https://lists.mpich.org/mailman/listinfo/discuss
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> discuss mailing list     discuss at mpich.org
>>>>>>> To manage subscription options or unsubscribe:
>>>>>>> https://lists.mpich.org/mailman/listinfo/discuss
>>>>>>>
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> discuss mailing list     discuss at mpich.org
>>>>>> To manage subscription options or unsubscribe:
>>>>>> https://lists.mpich.org/mailman/listinfo/discuss
>>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> discuss mailing list     discuss at mpich.org
>>>>> To manage subscription options or unsubscribe:
>>>>> https://lists.mpich.org/mailman/listinfo/discuss
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Jeff Hammond
>>>> jeff.science at gmail.com
>>>> http://jeffhammond.github.io/
>>>>
>>>> _______________________________________________
>>>> discuss mailing list     discuss at mpich.org
>>>> To manage subscription options or unsubscribe:
>>>> https://lists.mpich.org/mailman/listinfo/discuss
>>>>
>>>
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20150407/42e851e8/attachment.html>
-------------- next part --------------
_______________________________________________
discuss mailing list     discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss