[mpich-discuss] How to use non-blocking send/receive without calling MPI_Wait

Jeff Hammond jeff.science at gmail.com
Tue Apr 7 10:14:10 CDT 2015


I don't expect MPI to make async progress on Ethernet unless you have a
comm thread in MPI poking the network.  In MPICH, you would use
MPICH_ASYNC_PROGRESS=1.  There is something similar in Intel MPI, but I do
not know the same of the environment variable (this may be ironic, given I
work for Intel, but such is life).

Your solution with OpenMP of dedicating a comm thread makes sense.  I'd
make thread 0 the comm thread and let the rest of the threads compute,
since this might have some advantages relative to NUNA or thread migration,
but it's probably not a significant effect on most platforms.  Also, this
is technically required for using MPI_THREAD_FUNNELED, but I know of no
implementation where it actually matters.

At this point, I don't know what more you want.  Your expectation of
perfect overlap is not justified on Ethernet and you appear to have written
reasonable code for your problem of interest.

Jeff

On Tue, Apr 7, 2015 at 12:48 AM, Lei Shi <lshi at ku.edu> wrote:

>  Here is my pure MPI overlap version. I use intel traceanalyzer, the
> profiling shows that right now, communication only  proceed when I call
> mpi_waitall on nodes with 10g network.
>
> /** pure mpi overlap  **/
>   template<typename T>
>   void CPR_NS_3D_Solver<T>::UpdateRes(T**q, T**res){
>     if(_n_proc>1)
>       SendInterfaceSol(); //call isend/irecv to send msg 1
>
>     ResFromDivInvisFlux(q,res); //do local jobs
>
>     if(_n_proc>1){
>       RevInterfaceSol(); //mpi_waitall for msg 1
>       if(vis_mode_)
>         SendInterfaceCorrGrad(); //depends on msg 1 then snd msg 2
>     }
>
>     if(vis_mode_)
>       ResFromDivVisFlux(q,res); //computing, which depends on msg 1
>
>     if(_n_proc>1 && vis_mode_)
>       RevInterfaceCorrGrad(); //mpi_waitall for msg 2
>
>     ResFromFluxCorrection(q,res); //computing, which depends on msg 1 and 2
>   }
>
>
> On Tue, Apr 7, 2015 at 2:39 AM, Lei Shi <lshi at ku.edu> wrote:
>
>>
>>
>> On Tue, Apr 7, 2015 at 2:37 AM, Lei Shi <leishi at ku.edu> wrote:
>>
>>> Hi Huiwei and Jeff,
>>>
>>> I use hybrid OpenMP/MPI to do overlap communication. So I put all
>>> communication in one dedicated OpenMP thread and computation in the other
>>> thread. For this case, I'm using intel MPI library. Probably I did some
>>> mistakes
>>>
>>> One version of my code using one dedicated thread to do messaging is
>>> like this
>>>
>>> /* hybrid mpi/openmp overlap **/template<typename T>void CPR_NS_3D_Solver<T>::UpdateRes(T**q, T**res){
>>>   int thread_id,n_thread;
>>>   int sol_rev_flag=0,grad_rev_flag=0;
>>>
>>>   // Explicitly disable dynamic teams
>>>   omp_set_dynamic(0);
>>>   // Use 2 threads for all consecutive parallel regions
>>>   omp_set_num_threads(2);
>>>     #pragma omp parallel default(shared) private(thread_id)
>>>   {
>>>     thread_id=omp_get_thread_num();
>>>     n_thread=omp_get_num_threads();
>>>
>>>     /** communication thread   **/
>>>     if(thread_id==1){
>>>       SendInterfaceSol();
>>>       RevInterfaceSol();#pragma omp flush
>>>       sol_rev_flag=1;#pragma omp flush(sol_rev_flag)
>>>     }
>>>
>>>     /** computation thread **/
>>>     if(thread_id==0){
>>>       ResFromDivInvisFlux(q,res); //local computation
>>>         #pragma omp flush(sol_rev_flag)
>>>         while(sol_rev_flag!=1){             #pragma omp flush(sol_rev_flag)
>>>         }#pragma omp flush
>>>         ResFromFluxCorrection(q,res); //depends on interface sol
>>>     }
>>>   }//end of omp
>>>     }
>>>
>>> template<typename T>
>>>
>>>   void CPR_NS_3D_Solver<T>::SendInterfaceSol(){
>>>     uint *n_if_to_proc=this->grid_->num_iface_proc;
>>>     uint **if_to_proc=this->grid_->snd_iface_proc;
>>>     uint **rev_if_to_f=this->grid_->rev_iface_proc;
>>>
>>>     int tag=52;
>>>     for(int p2=0;p2<_n_proc;++p2){
>>>       if(p2!=_proc_id){
>>>         int nif=n_if_to_proc[p2];
>>>         //pack data to send ....
>>>
>>>       }
>>>     }
>>>
>>>     /**      * Exchange interface sol     **/
>>>     int n_proc_exchange=0;
>>>     for(int z=0;z<_n_proc;++z){
>>>       int nif=n_if_to_proc[z];
>>>
>>>       //send data
>>>       if(nif>0){        MPI_Isend(&snd_buf_[z][0],n_buf_[z],MPI_DOUBLE,z,tag, MPI_COMM_WORLD, &s_sol_req_[n_proc_exchange]);
>>>         MPI_Irecv(&rev_buf_[z][0],n_buf_[z],MPI_DOUBLE,z,tag, MPI_COMM_WORLD, &r_sol_req_[n_proc_exchange]);
>>>         n_proc_exchange++;
>>>       }
>>>     }
>>>
>>>   }
>>>
>>>   template<typename T>
>>>   void CPR_NS_3D_Solver<T>::RevInterfaceSol(){    uint *n_if_to_proc=this->grid_->num_iface_proc;
>>>     uint **if_to_proc=this->grid_->snd_iface_proc;
>>>     uint **rev_if_to_f=this->grid_->rev_iface_proc;
>>>
>>>     //wait
>>>     if(n_proc_exchange_>0){      MPI_Waitall(n_proc_exchange_,s_sol_req_,MPI_STATUS_IGNORE);      MPI_Waitall(n_proc_exchange_,r_sol_req_,MPI_STATUS_IGNORE);    }
>>>
>>>     /** store to local data structure **/
>>>     for(int z=0;z<_n_proc;++z){
>>>       int nif=n_if_to_proc[z];
>>>
>>>       if(nif>0){
>>>
>>>        //unpacking ....      }
>>>     }
>>>
>>>   }
>>>
>>>
>>>
>>>
>>>
>>>
>>> Sincerely Yours,
>>>
>>> Lei Shi
>>> ---------
>>>
>>> On Fri, Apr 3, 2015 at 4:37 PM, Jeff Hammond <jeff.science at gmail.com>
>>> wrote:
>>>
>>>> As far as I know, Ethernet is not good at making asynchronous progress
>>>> in hardware the way e.g. InfiniBand is.  I would have thought that a
>>>> dedicated progress thread would help, but it seems you tried that.  Did you
>>>> use your own progress thread or MPICH_ASYNC_PROGRESS=1?
>>>>
>>>> Jeff
>>>>
>>>> On Fri, Apr 3, 2015 at 10:10 AM, Lei Shi <lshi at ku.edu> wrote:
>>>>
>>>>> Huiwei,
>>>>>
>>>>> Thanks for your email. Your answer leads to my another question about
>>>>> asynchronous MPI communication.
>>>>>
>>>>> I'm trying to do an overlapped communication/computing to speedup my
>>>>> MPI code. I read some papers comparing some different approaches to do the
>>>>> overlapped communication. The "naive" overlapped communication
>>>>> implementation, which only use non-blocking mpi Isend/Irecv and the hybrid
>>>>> approach using OpenMP and MPI together. In the hybrid approach, a separated
>>>>> thread is use to do all non-blocking communications. Just exactly as you
>>>>> said, the results indicate that current MPI implementations do not
>>>>> support true asynchronous communication.
>>>>>
>>>>> If I use the naive approach, my code with non-blocking or blocking
>>>>> send/recv gives me almost the same performance in term of Wtime. All
>>>>> communications are postponed to MPI_Wait.
>>>>>
>>>>> I have tried calling mpi_test to push library to do communication
>>>>> during iterations. And try to use a dedicated thread to do communication
>>>>> and the other thread to do computing only. However, the performance gains
>>>>> are very small or no gain at all. I'm wondering it is due to the hardware.
>>>>> The cluster I tested uses 10G Ethernet card.
>>>>>
>>>>>
>>>>> Best,
>>>>>
>>>>> Lei Shi
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Fri, Apr 3, 2015 at 8:49 AM, Huiwei Lu <huiweilu at mcs.anl.gov>
>>>>> wrote:
>>>>>
>>>>>> Hi Lei,
>>>>>>
>>>>>> As far as I know, all current MPI implementations do not support true
>>>>>> asynchronous communication for now. i.e., If there is no MPI calls in your
>>>>>> iterations, MPICH will not be able to make progress on communication.
>>>>>>
>>>>>> One solution is to poll the MPI runtime regularly to make progress by
>>>>>> inserting MPI_Test to your iteration (even though you do not want to check
>>>>>> the data).
>>>>>>
>>>>>> Another solution is to enable MPI's asynchronous progress thread to
>>>>>> make progress for you.
>>>>>>
>>>>>> --
>>>>>> Huiwei
>>>>>>
>>>>>> On Thu, Apr 2, 2015 at 11:44 PM, Lei Shi <lshi at ku.edu> wrote:
>>>>>>
>>>>>>> Hi Junchao,
>>>>>>>
>>>>>>> Thanks for your reply. For my case, I don't want to check the data
>>>>>>> have been received or not. So I don't want to call MPI_Test or any function
>>>>>>> to verify it. But my problem is like if I ignore calling the MPI_Wait, just
>>>>>>> call Isend/Irev, my program freezes for several sec and then continues to
>>>>>>> run. My guess is probably I messed up the MPI library internal buffer by
>>>>>>> doing this.
>>>>>>>
>>>>>>> On Thu, Apr 2, 2015 at 7:25 PM, Junchao Zhang <jczhang at mcs.anl.gov>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Does MPI_Test fit your needs?
>>>>>>>>
>>>>>>>> --Junchao Zhang
>>>>>>>>
>>>>>>>> On Thu, Apr 2, 2015 at 7:16 PM, Lei Shi <lshi at ku.edu> wrote:
>>>>>>>>
>>>>>>>>> I want to use non-blocking send/rev MPI_Isend/MPI_Irev to do
>>>>>>>>> communication. But in my case, I don't really care what kind of data I get
>>>>>>>>> or it is ready to use or not. So I don't want to waste my time to do any
>>>>>>>>> synchronization  by calling MPI_Wait or etc API.
>>>>>>>>>
>>>>>>>>> But when I avoid calling MPI_Wait, my program is freezed several
>>>>>>>>> secs after running some iterations (after multiple MPI_Isend/Irev
>>>>>>>>> callings), then continues. It takes even more time than the case with
>>>>>>>>> MPI_Wait.  So my question is how to do a "true" non-blocking communication
>>>>>>>>> without waiting for the data ready or not. Thanks.
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> discuss mailing list     discuss at mpich.org
>>>>>>>>> To manage subscription options or unsubscribe:
>>>>>>>>> https://lists.mpich.org/mailman/listinfo/discuss
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> discuss mailing list     discuss at mpich.org
>>>>>>>> To manage subscription options or unsubscribe:
>>>>>>>> https://lists.mpich.org/mailman/listinfo/discuss
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> discuss mailing list     discuss at mpich.org
>>>>>>> To manage subscription options or unsubscribe:
>>>>>>> https://lists.mpich.org/mailman/listinfo/discuss
>>>>>>>
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> discuss mailing list     discuss at mpich.org
>>>>>> To manage subscription options or unsubscribe:
>>>>>> https://lists.mpich.org/mailman/listinfo/discuss
>>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> discuss mailing list     discuss at mpich.org
>>>>> To manage subscription options or unsubscribe:
>>>>> https://lists.mpich.org/mailman/listinfo/discuss
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Jeff Hammond
>>>> jeff.science at gmail.com
>>>> http://jeffhammond.github.io/
>>>>
>>>> _______________________________________________
>>>> discuss mailing list     discuss at mpich.org
>>>> To manage subscription options or unsubscribe:
>>>> https://lists.mpich.org/mailman/listinfo/discuss
>>>>
>>>
>>>
>>
>


-- 
Jeff Hammond
jeff.science at gmail.com
http://jeffhammond.github.io/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20150407/dc1d80ee/attachment.html>
-------------- next part --------------
_______________________________________________
discuss mailing list     discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss


More information about the discuss mailing list