[mpich-discuss] How to use non-blocking send/receive without calling MPI_Wait
Huiwei Lu
huiweilu at mcs.anl.gov
Tue Apr 7 10:12:50 CDT 2015
Hi Lei,
Your profiling is correct. MPI_Isend/Irecv only start the communication but
the real progress is made in MPI_Waitall. If you want to overlap
communication with ResFromDivInvisFlux, you can put MPI_Test inside
ResFromDivInvisFlux
to force MPI to make progress periodically when MPI_Test is called.
Another option is to use asynchronous progress thread. Check
MPIR_CVAR_ASYNC_PROGRESS
for more information.
Best regards,
--
Huiwei
On Tue, Apr 7, 2015 at 2:48 AM, Lei Shi <lshi at ku.edu> wrote:
> Here is my pure MPI overlap version. I use intel traceanalyzer, the
> profiling shows that right now, communication only proceed when I call
> mpi_waitall on nodes with 10g network.
>
> /** pure mpi overlap **/
> template<typename T>
> void CPR_NS_3D_Solver<T>::UpdateRes(T**q, T**res){
> if(_n_proc>1)
> SendInterfaceSol(); //call isend/irecv to send msg 1
>
> ResFromDivInvisFlux(q,res); //do local jobs
>
> if(_n_proc>1){
> RevInterfaceSol(); //mpi_waitall for msg 1
> if(vis_mode_)
> SendInterfaceCorrGrad(); //depends on msg 1 then snd msg 2
> }
>
> if(vis_mode_)
> ResFromDivVisFlux(q,res); //computing, which depends on msg 1
>
> if(_n_proc>1 && vis_mode_)
> RevInterfaceCorrGrad(); //mpi_waitall for msg 2
>
> ResFromFluxCorrection(q,res); //computing, which depends on msg 1 and 2
> }
>
>
> On Tue, Apr 7, 2015 at 2:39 AM, Lei Shi <lshi at ku.edu> wrote:
>
>>
>>
>> On Tue, Apr 7, 2015 at 2:37 AM, Lei Shi <leishi at ku.edu> wrote:
>>
>>> Hi Huiwei and Jeff,
>>>
>>> I use hybrid OpenMP/MPI to do overlap communication. So I put all
>>> communication in one dedicated OpenMP thread and computation in the other
>>> thread. For this case, I'm using intel MPI library. Probably I did some
>>> mistakes
>>>
>>> One version of my code using one dedicated thread to do messaging is
>>> like this
>>>
>>> /* hybrid mpi/openmp overlap **/template<typename T>void CPR_NS_3D_Solver<T>::UpdateRes(T**q, T**res){
>>> int thread_id,n_thread;
>>> int sol_rev_flag=0,grad_rev_flag=0;
>>>
>>> // Explicitly disable dynamic teams
>>> omp_set_dynamic(0);
>>> // Use 2 threads for all consecutive parallel regions
>>> omp_set_num_threads(2);
>>> #pragma omp parallel default(shared) private(thread_id)
>>> {
>>> thread_id=omp_get_thread_num();
>>> n_thread=omp_get_num_threads();
>>>
>>> /** communication thread **/
>>> if(thread_id==1){
>>> SendInterfaceSol();
>>> RevInterfaceSol();#pragma omp flush
>>> sol_rev_flag=1;#pragma omp flush(sol_rev_flag)
>>> }
>>>
>>> /** computation thread **/
>>> if(thread_id==0){
>>> ResFromDivInvisFlux(q,res); //local computation
>>> #pragma omp flush(sol_rev_flag)
>>> while(sol_rev_flag!=1){ #pragma omp flush(sol_rev_flag)
>>> }#pragma omp flush
>>> ResFromFluxCorrection(q,res); //depends on interface sol
>>> }
>>> }//end of omp
>>> }
>>>
>>> template<typename T>
>>>
>>> void CPR_NS_3D_Solver<T>::SendInterfaceSol(){
>>> uint *n_if_to_proc=this->grid_->num_iface_proc;
>>> uint **if_to_proc=this->grid_->snd_iface_proc;
>>> uint **rev_if_to_f=this->grid_->rev_iface_proc;
>>>
>>> int tag=52;
>>> for(int p2=0;p2<_n_proc;++p2){
>>> if(p2!=_proc_id){
>>> int nif=n_if_to_proc[p2];
>>> //pack data to send ....
>>>
>>> }
>>> }
>>>
>>> /** * Exchange interface sol **/
>>> int n_proc_exchange=0;
>>> for(int z=0;z<_n_proc;++z){
>>> int nif=n_if_to_proc[z];
>>>
>>> //send data
>>> if(nif>0){ MPI_Isend(&snd_buf_[z][0],n_buf_[z],MPI_DOUBLE,z,tag, MPI_COMM_WORLD, &s_sol_req_[n_proc_exchange]);
>>> MPI_Irecv(&rev_buf_[z][0],n_buf_[z],MPI_DOUBLE,z,tag, MPI_COMM_WORLD, &r_sol_req_[n_proc_exchange]);
>>> n_proc_exchange++;
>>> }
>>> }
>>>
>>> }
>>>
>>> template<typename T>
>>> void CPR_NS_3D_Solver<T>::RevInterfaceSol(){ uint *n_if_to_proc=this->grid_->num_iface_proc;
>>> uint **if_to_proc=this->grid_->snd_iface_proc;
>>> uint **rev_if_to_f=this->grid_->rev_iface_proc;
>>>
>>> //wait
>>> if(n_proc_exchange_>0){ MPI_Waitall(n_proc_exchange_,s_sol_req_,MPI_STATUS_IGNORE); MPI_Waitall(n_proc_exchange_,r_sol_req_,MPI_STATUS_IGNORE); }
>>>
>>> /** store to local data structure **/
>>> for(int z=0;z<_n_proc;++z){
>>> int nif=n_if_to_proc[z];
>>>
>>> if(nif>0){
>>>
>>> //unpacking .... }
>>> }
>>>
>>> }
>>>
>>>
>>>
>>>
>>>
>>>
>>> Sincerely Yours,
>>>
>>> Lei Shi
>>> ---------
>>>
>>> On Fri, Apr 3, 2015 at 4:37 PM, Jeff Hammond <jeff.science at gmail.com>
>>> wrote:
>>>
>>>> As far as I know, Ethernet is not good at making asynchronous progress
>>>> in hardware the way e.g. InfiniBand is. I would have thought that a
>>>> dedicated progress thread would help, but it seems you tried that. Did you
>>>> use your own progress thread or MPICH_ASYNC_PROGRESS=1?
>>>>
>>>> Jeff
>>>>
>>>> On Fri, Apr 3, 2015 at 10:10 AM, Lei Shi <lshi at ku.edu> wrote:
>>>>
>>>>> Huiwei,
>>>>>
>>>>> Thanks for your email. Your answer leads to my another question about
>>>>> asynchronous MPI communication.
>>>>>
>>>>> I'm trying to do an overlapped communication/computing to speedup my
>>>>> MPI code. I read some papers comparing some different approaches to do the
>>>>> overlapped communication. The "naive" overlapped communication
>>>>> implementation, which only use non-blocking mpi Isend/Irecv and the hybrid
>>>>> approach using OpenMP and MPI together. In the hybrid approach, a separated
>>>>> thread is use to do all non-blocking communications. Just exactly as you
>>>>> said, the results indicate that current MPI implementations do not
>>>>> support true asynchronous communication.
>>>>>
>>>>> If I use the naive approach, my code with non-blocking or blocking
>>>>> send/recv gives me almost the same performance in term of Wtime. All
>>>>> communications are postponed to MPI_Wait.
>>>>>
>>>>> I have tried calling mpi_test to push library to do communication
>>>>> during iterations. And try to use a dedicated thread to do communication
>>>>> and the other thread to do computing only. However, the performance gains
>>>>> are very small or no gain at all. I'm wondering it is due to the hardware.
>>>>> The cluster I tested uses 10G Ethernet card.
>>>>>
>>>>>
>>>>> Best,
>>>>>
>>>>> Lei Shi
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Fri, Apr 3, 2015 at 8:49 AM, Huiwei Lu <huiweilu at mcs.anl.gov>
>>>>> wrote:
>>>>>
>>>>>> Hi Lei,
>>>>>>
>>>>>> As far as I know, all current MPI implementations do not support true
>>>>>> asynchronous communication for now. i.e., If there is no MPI calls in your
>>>>>> iterations, MPICH will not be able to make progress on communication.
>>>>>>
>>>>>> One solution is to poll the MPI runtime regularly to make progress by
>>>>>> inserting MPI_Test to your iteration (even though you do not want to check
>>>>>> the data).
>>>>>>
>>>>>> Another solution is to enable MPI's asynchronous progress thread to
>>>>>> make progress for you.
>>>>>>
>>>>>> --
>>>>>> Huiwei
>>>>>>
>>>>>> On Thu, Apr 2, 2015 at 11:44 PM, Lei Shi <lshi at ku.edu> wrote:
>>>>>>
>>>>>>> Hi Junchao,
>>>>>>>
>>>>>>> Thanks for your reply. For my case, I don't want to check the data
>>>>>>> have been received or not. So I don't want to call MPI_Test or any function
>>>>>>> to verify it. But my problem is like if I ignore calling the MPI_Wait, just
>>>>>>> call Isend/Irev, my program freezes for several sec and then continues to
>>>>>>> run. My guess is probably I messed up the MPI library internal buffer by
>>>>>>> doing this.
>>>>>>>
>>>>>>> On Thu, Apr 2, 2015 at 7:25 PM, Junchao Zhang <jczhang at mcs.anl.gov>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Does MPI_Test fit your needs?
>>>>>>>>
>>>>>>>> --Junchao Zhang
>>>>>>>>
>>>>>>>> On Thu, Apr 2, 2015 at 7:16 PM, Lei Shi <lshi at ku.edu> wrote:
>>>>>>>>
>>>>>>>>> I want to use non-blocking send/rev MPI_Isend/MPI_Irev to do
>>>>>>>>> communication. But in my case, I don't really care what kind of data I get
>>>>>>>>> or it is ready to use or not. So I don't want to waste my time to do any
>>>>>>>>> synchronization by calling MPI_Wait or etc API.
>>>>>>>>>
>>>>>>>>> But when I avoid calling MPI_Wait, my program is freezed several
>>>>>>>>> secs after running some iterations (after multiple MPI_Isend/Irev
>>>>>>>>> callings), then continues. It takes even more time than the case with
>>>>>>>>> MPI_Wait. So my question is how to do a "true" non-blocking communication
>>>>>>>>> without waiting for the data ready or not. Thanks.
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> discuss mailing list discuss at mpich.org
>>>>>>>>> To manage subscription options or unsubscribe:
>>>>>>>>> https://lists.mpich.org/mailman/listinfo/discuss
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> discuss mailing list discuss at mpich.org
>>>>>>>> To manage subscription options or unsubscribe:
>>>>>>>> https://lists.mpich.org/mailman/listinfo/discuss
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> discuss mailing list discuss at mpich.org
>>>>>>> To manage subscription options or unsubscribe:
>>>>>>> https://lists.mpich.org/mailman/listinfo/discuss
>>>>>>>
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> discuss mailing list discuss at mpich.org
>>>>>> To manage subscription options or unsubscribe:
>>>>>> https://lists.mpich.org/mailman/listinfo/discuss
>>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> discuss mailing list discuss at mpich.org
>>>>> To manage subscription options or unsubscribe:
>>>>> https://lists.mpich.org/mailman/listinfo/discuss
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Jeff Hammond
>>>> jeff.science at gmail.com
>>>> http://jeffhammond.github.io/
>>>>
>>>> _______________________________________________
>>>> discuss mailing list discuss at mpich.org
>>>> To manage subscription options or unsubscribe:
>>>> https://lists.mpich.org/mailman/listinfo/discuss
>>>>
>>>
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20150407/42e851e8/attachment.html>
-------------- next part --------------
_______________________________________________
discuss mailing list discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss
More information about the discuss
mailing list