[mpich-discuss] How to use non-blocking send/receive without calling MPI_Wait

Lei Shi lshi at ku.edu
Tue Apr 7 14:17:05 CDT 2015


Hi Huiwei and jeff,

Thanks a lot. I will try those ideas you guys suggested on some test codes
first and do profiling to verify it at the same time. When I get a chance
to run my code on our infiniband nodes, I will repeat those tests and let
you guys know the difference. Thanks again!
Best,

Lei Shi

On Tue, Apr 7, 2015 at 10:12 AM, Huiwei Lu <huiweilu at mcs.anl.gov> wrote:

> Hi Lei,
>
> Your profiling is correct. MPI_Isend/Irecv only start the communication
> but the real progress is made in MPI_Waitall. If you want to overlap
> communication with ResFromDivInvisFlux, you can put MPI_Test inside ResFromDivInvisFlux
> to force MPI to make progress periodically when MPI_Test is called.
>
> Another option is to use asynchronous progress thread. Check MPIR_CVAR_ASYNC_PROGRESS
> for more information.
>
> Best regards,
> --
> Huiwei
>
> On Tue, Apr 7, 2015 at 2:48 AM, Lei Shi <lshi at ku.edu> wrote:
>
>>  Here is my pure MPI overlap version. I use intel traceanalyzer, the
>> profiling shows that right now, communication only  proceed when I call
>> mpi_waitall on nodes with 10g network.
>>
>> /** pure mpi overlap  **/
>>   template<typename T>
>>   void CPR_NS_3D_Solver<T>::UpdateRes(T**q, T**res){
>>     if(_n_proc>1)
>>       SendInterfaceSol(); //call isend/irecv to send msg 1
>>
>>     ResFromDivInvisFlux(q,res); //do local jobs
>>
>>     if(_n_proc>1){
>>       RevInterfaceSol(); //mpi_waitall for msg 1
>>       if(vis_mode_)
>>         SendInterfaceCorrGrad(); //depends on msg 1 then snd msg 2
>>     }
>>
>>     if(vis_mode_)
>>       ResFromDivVisFlux(q,res); //computing, which depends on msg 1
>>
>>     if(_n_proc>1 && vis_mode_)
>>       RevInterfaceCorrGrad(); //mpi_waitall for msg 2
>>
>>     ResFromFluxCorrection(q,res); //computing, which depends on msg 1 and 2
>>   }
>>
>>
>> On Tue, Apr 7, 2015 at 2:39 AM, Lei Shi <lshi at ku.edu> wrote:
>>
>>>
>>>
>>> On Tue, Apr 7, 2015 at 2:37 AM, Lei Shi <leishi at ku.edu> wrote:
>>>
>>>> Hi Huiwei and Jeff,
>>>>
>>>> I use hybrid OpenMP/MPI to do overlap communication. So I put all
>>>> communication in one dedicated OpenMP thread and computation in the other
>>>> thread. For this case, I'm using intel MPI library. Probably I did some
>>>> mistakes
>>>>
>>>> One version of my code using one dedicated thread to do messaging is
>>>> like this
>>>>
>>>> /* hybrid mpi/openmp overlap **/template<typename T>void CPR_NS_3D_Solver<T>::UpdateRes(T**q, T**res){
>>>>   int thread_id,n_thread;
>>>>   int sol_rev_flag=0,grad_rev_flag=0;
>>>>
>>>>   // Explicitly disable dynamic teams
>>>>   omp_set_dynamic(0);
>>>>   // Use 2 threads for all consecutive parallel regions
>>>>   omp_set_num_threads(2);
>>>>     #pragma omp parallel default(shared) private(thread_id)
>>>>   {
>>>>     thread_id=omp_get_thread_num();
>>>>     n_thread=omp_get_num_threads();
>>>>
>>>>     /** communication thread   **/
>>>>     if(thread_id==1){
>>>>       SendInterfaceSol();
>>>>       RevInterfaceSol();#pragma omp flush
>>>>       sol_rev_flag=1;#pragma omp flush(sol_rev_flag)
>>>>     }
>>>>
>>>>     /** computation thread **/
>>>>     if(thread_id==0){
>>>>       ResFromDivInvisFlux(q,res); //local computation
>>>>         #pragma omp flush(sol_rev_flag)
>>>>         while(sol_rev_flag!=1){             #pragma omp flush(sol_rev_flag)
>>>>         }#pragma omp flush
>>>>         ResFromFluxCorrection(q,res); //depends on interface sol
>>>>     }
>>>>   }//end of omp
>>>>     }
>>>>
>>>> template<typename T>
>>>>
>>>>   void CPR_NS_3D_Solver<T>::SendInterfaceSol(){
>>>>     uint *n_if_to_proc=this->grid_->num_iface_proc;
>>>>     uint **if_to_proc=this->grid_->snd_iface_proc;
>>>>     uint **rev_if_to_f=this->grid_->rev_iface_proc;
>>>>
>>>>     int tag=52;
>>>>     for(int p2=0;p2<_n_proc;++p2){
>>>>       if(p2!=_proc_id){
>>>>         int nif=n_if_to_proc[p2];
>>>>         //pack data to send ....
>>>>
>>>>       }
>>>>     }
>>>>
>>>>     /**      * Exchange interface sol     **/
>>>>     int n_proc_exchange=0;
>>>>     for(int z=0;z<_n_proc;++z){
>>>>       int nif=n_if_to_proc[z];
>>>>
>>>>       //send data
>>>>       if(nif>0){        MPI_Isend(&snd_buf_[z][0],n_buf_[z],MPI_DOUBLE,z,tag, MPI_COMM_WORLD, &s_sol_req_[n_proc_exchange]);
>>>>         MPI_Irecv(&rev_buf_[z][0],n_buf_[z],MPI_DOUBLE,z,tag, MPI_COMM_WORLD, &r_sol_req_[n_proc_exchange]);
>>>>         n_proc_exchange++;
>>>>       }
>>>>     }
>>>>
>>>>   }
>>>>
>>>>   template<typename T>
>>>>   void CPR_NS_3D_Solver<T>::RevInterfaceSol(){    uint *n_if_to_proc=this->grid_->num_iface_proc;
>>>>     uint **if_to_proc=this->grid_->snd_iface_proc;
>>>>     uint **rev_if_to_f=this->grid_->rev_iface_proc;
>>>>
>>>>     //wait
>>>>     if(n_proc_exchange_>0){      MPI_Waitall(n_proc_exchange_,s_sol_req_,MPI_STATUS_IGNORE);      MPI_Waitall(n_proc_exchange_,r_sol_req_,MPI_STATUS_IGNORE);    }
>>>>
>>>>     /** store to local data structure **/
>>>>     for(int z=0;z<_n_proc;++z){
>>>>       int nif=n_if_to_proc[z];
>>>>
>>>>       if(nif>0){
>>>>
>>>>        //unpacking ....      }
>>>>     }
>>>>
>>>>   }
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> Sincerely Yours,
>>>>
>>>> Lei Shi
>>>> ---------
>>>>
>>>> On Fri, Apr 3, 2015 at 4:37 PM, Jeff Hammond <jeff.science at gmail.com>
>>>> wrote:
>>>>
>>>>> As far as I know, Ethernet is not good at making asynchronous progress
>>>>> in hardware the way e.g. InfiniBand is.  I would have thought that a
>>>>> dedicated progress thread would help, but it seems you tried that.  Did you
>>>>> use your own progress thread or MPICH_ASYNC_PROGRESS=1?
>>>>>
>>>>> Jeff
>>>>>
>>>>> On Fri, Apr 3, 2015 at 10:10 AM, Lei Shi <lshi at ku.edu> wrote:
>>>>>
>>>>>> Huiwei,
>>>>>>
>>>>>> Thanks for your email. Your answer leads to my another question about
>>>>>> asynchronous MPI communication.
>>>>>>
>>>>>> I'm trying to do an overlapped communication/computing to speedup my
>>>>>> MPI code. I read some papers comparing some different approaches to do the
>>>>>> overlapped communication. The "naive" overlapped communication
>>>>>> implementation, which only use non-blocking mpi Isend/Irecv and the hybrid
>>>>>> approach using OpenMP and MPI together. In the hybrid approach, a separated
>>>>>> thread is use to do all non-blocking communications. Just exactly as you
>>>>>> said, the results indicate that current MPI implementations do not
>>>>>> support true asynchronous communication.
>>>>>>
>>>>>> If I use the naive approach, my code with non-blocking or blocking
>>>>>> send/recv gives me almost the same performance in term of Wtime. All
>>>>>> communications are postponed to MPI_Wait.
>>>>>>
>>>>>> I have tried calling mpi_test to push library to do communication
>>>>>> during iterations. And try to use a dedicated thread to do communication
>>>>>> and the other thread to do computing only. However, the performance gains
>>>>>> are very small or no gain at all. I'm wondering it is due to the hardware.
>>>>>> The cluster I tested uses 10G Ethernet card.
>>>>>>
>>>>>>
>>>>>> Best,
>>>>>>
>>>>>> Lei Shi
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Fri, Apr 3, 2015 at 8:49 AM, Huiwei Lu <huiweilu at mcs.anl.gov>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi Lei,
>>>>>>>
>>>>>>> As far as I know, all current MPI implementations do not support
>>>>>>> true asynchronous communication for now. i.e., If there is no MPI calls in
>>>>>>> your iterations, MPICH will not be able to make progress on communication.
>>>>>>>
>>>>>>> One solution is to poll the MPI runtime regularly to make progress
>>>>>>> by inserting MPI_Test to your iteration (even though you do not want to
>>>>>>> check the data).
>>>>>>>
>>>>>>> Another solution is to enable MPI's asynchronous progress thread to
>>>>>>> make progress for you.
>>>>>>>
>>>>>>> --
>>>>>>> Huiwei
>>>>>>>
>>>>>>> On Thu, Apr 2, 2015 at 11:44 PM, Lei Shi <lshi at ku.edu> wrote:
>>>>>>>
>>>>>>>> Hi Junchao,
>>>>>>>>
>>>>>>>> Thanks for your reply. For my case, I don't want to check the data
>>>>>>>> have been received or not. So I don't want to call MPI_Test or any function
>>>>>>>> to verify it. But my problem is like if I ignore calling the MPI_Wait, just
>>>>>>>> call Isend/Irev, my program freezes for several sec and then continues to
>>>>>>>> run. My guess is probably I messed up the MPI library internal buffer by
>>>>>>>> doing this.
>>>>>>>>
>>>>>>>> On Thu, Apr 2, 2015 at 7:25 PM, Junchao Zhang <jczhang at mcs.anl.gov>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Does MPI_Test fit your needs?
>>>>>>>>>
>>>>>>>>> --Junchao Zhang
>>>>>>>>>
>>>>>>>>> On Thu, Apr 2, 2015 at 7:16 PM, Lei Shi <lshi at ku.edu> wrote:
>>>>>>>>>
>>>>>>>>>> I want to use non-blocking send/rev MPI_Isend/MPI_Irev to do
>>>>>>>>>> communication. But in my case, I don't really care what kind of data I get
>>>>>>>>>> or it is ready to use or not. So I don't want to waste my time to do any
>>>>>>>>>> synchronization  by calling MPI_Wait or etc API.
>>>>>>>>>>
>>>>>>>>>> But when I avoid calling MPI_Wait, my program is freezed several
>>>>>>>>>> secs after running some iterations (after multiple MPI_Isend/Irev
>>>>>>>>>> callings), then continues. It takes even more time than the case with
>>>>>>>>>> MPI_Wait.  So my question is how to do a "true" non-blocking communication
>>>>>>>>>> without waiting for the data ready or not. Thanks.
>>>>>>>>>>
>>>>>>>>>> _______________________________________________
>>>>>>>>>> discuss mailing list     discuss at mpich.org
>>>>>>>>>> To manage subscription options or unsubscribe:
>>>>>>>>>> https://lists.mpich.org/mailman/listinfo/discuss
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> discuss mailing list     discuss at mpich.org
>>>>>>>>> To manage subscription options or unsubscribe:
>>>>>>>>> https://lists.mpich.org/mailman/listinfo/discuss
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> discuss mailing list     discuss at mpich.org
>>>>>>>> To manage subscription options or unsubscribe:
>>>>>>>> https://lists.mpich.org/mailman/listinfo/discuss
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> discuss mailing list     discuss at mpich.org
>>>>>>> To manage subscription options or unsubscribe:
>>>>>>> https://lists.mpich.org/mailman/listinfo/discuss
>>>>>>>
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> discuss mailing list     discuss at mpich.org
>>>>>> To manage subscription options or unsubscribe:
>>>>>> https://lists.mpich.org/mailman/listinfo/discuss
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Jeff Hammond
>>>>> jeff.science at gmail.com
>>>>> http://jeffhammond.github.io/
>>>>>
>>>>> _______________________________________________
>>>>> discuss mailing list     discuss at mpich.org
>>>>> To manage subscription options or unsubscribe:
>>>>> https://lists.mpich.org/mailman/listinfo/discuss
>>>>>
>>>>
>>>>
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20150407/e2f4d7cd/attachment.html>
-------------- next part --------------
_______________________________________________
discuss mailing list     discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss


More information about the discuss mailing list