[mpich-discuss] MPI_Send/MPI_Recv - getting delayed indeffinitely

Madhawa Bandara madawa911 at gmail.com
Wed Dec 25 20:50:32 CST 2013


Thanks for the response. Actually I was using float as the data type in the program whereas MPI_DOUBLE is used in communication.. That has been the reason
Serializing means assigning a tag?

Sent from my Windows phone

-----Original Message-----
From: "Pavan Balaji" <balaji at mcs.anl.gov>
Sent: ‎12/‎25/‎2013 22:14
To: "discuss at mpich.org" <discuss at mpich.org>
Subject: Re: [mpich-discuss] MPI_Send/MPI_Recv - getting delayed indeffinitely


Conceptually, I don’t see anything wrong with the code snippet.  There is a performance problem since you are serializing the MPI_Recv’s at the master.  So it cannot receive data from worker2 before it receives data from worker1, even if worker2 is ready earlier.  That’ll lose some performance, but it should not affect correctness (and for a 3 node program, you might not care about it anyway).

The segmentation fault might be an issue with the serial part of your code.  Typically this means you didn’t allocate enough memory somewhere.  Did you try running the processes with a debugger?

Instead of doing:

% mpiexec -n 3 ./foo

try this:

% mpiexec -n 3 ddd ./foo

“ddd” is a frontend for gdb and will open a different debugging screen for each process.

  — Pavan

On Dec 23, 2013, at 8:04 PM, Madhawa Bandara <madawa911 at gmail.com> wrote:

> 
> Hi,
> 
> I use mpich2 on a small cluster of 3 nodes and each node has Ubuntu 12.04 installed. I use this cluster to do the following.
> 
> 1. Master node send some matrices to the 2 workers
> 2. Workers perform some calculations and send the resulted matrices back to the master.
> 3. Master perform some final calculations.
> 
> code snippet:
> 
> //master(taskid=0)
> 
>  MPI_Send(ha11, n / 2 * n / 2, MPI_DOUBLE, 1, 1, MPI_COMM_WORLD); //to worker 1
>  MPI_Send(ha11, n / 2 * n / 2, MPI_DOUBLE, 2, 1, MPI_COMM_WORLD); //to worker 2
> 
> 
>  MPI_Recv(hM1, n / 2 * n / 2, MPI_DOUBLE, 1, 2, MPI_COMM_WORLD,&status); //from worker 1
>  MPI_Recv(hM2, n / 2 * n / 2, MPI_DOUBLE, 2, 2, MPI_COMM_WORLD,&status);//from worker 2
> 
> //final calculations using hM1,hM2
> 
> //worker 1 (taskid=1)
> 
> MPI_Recv(ha11, n / 2 * n / 2, MPI_DOUBLE, 0, 1, MPI_COMM_WORLD,&status);
> //does some calculations
> MPI_Send(hM1, n / 2 * n / 2, MPI_DOUBLE, 0, 2, MPI_COMM_WORLD); //sends back
> 
>  
> //worker 2(taskid=2)
> MPI_Recv(ha11, n / 2 * n / 2, MPI_DOUBLE, 0, 1, MPI_COMM_WORLD,&status);
> //does some calculations
> MPI_Send(hM2, n / 2 * n / 2, MPI_DOUBLE, 0, 2, MPI_COMM_WORLD); //sends back
> 
> 
> This worked fine at first, for n=128 to n=2048. But after I pushed 'n' beyond 2048 I got a segmentation fault from the worker 1. 
> 
> Since then, code works fine for the small n values. But whenever I set the value n=128 or greater, worker 1 is getting delayed indefinitely while the rest of the nodes works fine.
> 
> What could be the reason for this? And how can I resolve this?  If I have done any mistakes please point out. Thanks in advance.
> -- 
> Regards,
> H.K. Madhawa Bandara
> 
> 
> 
> _______________________________________________
> discuss mailing list     discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss

--
Pavan Balaji
http://www.mcs.anl.gov/~balaji

_______________________________________________
discuss mailing list     discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20131226/8c3b7b36/attachment.html>


More information about the discuss mailing list