[mpich-discuss] MPI_Send/MPI_Recv - getting delayed indeffinitely

Madhawa Bandara madawa911 at gmail.com
Mon Dec 23 06:04:02 CST 2013


Hi,

I use mpich2 on a small cluster of 3 nodes and each node has Ubuntu 12.04
installed. I use this cluster to do the following.

1. *Master *node send some matrices to the 2 *workers*
2. Workers perform some calculations and send the resulted matrices back to
the master.
3. Master perform some final calculations.

code snippet:

//master(taskid=0)

 MPI_Send(ha11, n / 2 * n / 2, MPI_DOUBLE, 1, 1, MPI_COMM_WORLD); //to
worker 1
 MPI_Send(ha11, n / 2 * n / 2, MPI_DOUBLE, 2, 1, MPI_COMM_WORLD); //to
worker 2


 MPI_Recv(hM1, n / 2 * n / 2, MPI_DOUBLE, 1, 2, MPI_COMM_WORLD,&status); //from
worker 1
 MPI_Recv(hM2, n / 2 * n / 2, MPI_DOUBLE, 2, 2, MPI_COMM_WORLD,&status);//from
worker 2

//final calculations using hM1,hM2

//worker 1 (taskid=1)

MPI_Recv(ha11, n / 2 * n / 2, MPI_DOUBLE, 0, 1, MPI_COMM_WORLD,&status);
//does some calculations
MPI_Send(hM1, n / 2 * n / 2, MPI_DOUBLE, 0, 2, MPI_COMM_WORLD); //sends back


//worker 2(taskid=2)
MPI_Recv(ha11, n / 2 * n / 2, MPI_DOUBLE, 0, 1, MPI_COMM_WORLD,&status);
//does some calculations
MPI_Send(hM2, n / 2 * n / 2, MPI_DOUBLE, 0, 2, MPI_COMM_WORLD); //sends back


This worked fine at first, for n=128 to n=2048. But after I pushed 'n'
beyond 2048 I got a segmentation fault from the worker 1.

Since then, code works fine for the small n values. But whenever I set the
value n=128 or greater, worker 1 is getting delayed indefinitely while the
rest of the nodes works fine.

What could be the reason for this? And how can I resolve this?  If I have
done any mistakes please point out. Thanks in advance.
-- 
Regards,
*H.K. Madhawa Bandara*
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20131223/aafcf479/attachment.html>


More information about the discuss mailing list