[mpich-discuss] MPI_Send/MPI_Recv - getting delayed indeffinitely
Madhawa Bandara
madawa911 at gmail.com
Mon Dec 23 06:04:02 CST 2013
Hi,
I use mpich2 on a small cluster of 3 nodes and each node has Ubuntu 12.04
installed. I use this cluster to do the following.
1. *Master *node send some matrices to the 2 *workers*
2. Workers perform some calculations and send the resulted matrices back to
the master.
3. Master perform some final calculations.
code snippet:
//master(taskid=0)
MPI_Send(ha11, n / 2 * n / 2, MPI_DOUBLE, 1, 1, MPI_COMM_WORLD); //to
worker 1
MPI_Send(ha11, n / 2 * n / 2, MPI_DOUBLE, 2, 1, MPI_COMM_WORLD); //to
worker 2
MPI_Recv(hM1, n / 2 * n / 2, MPI_DOUBLE, 1, 2, MPI_COMM_WORLD,&status); //from
worker 1
MPI_Recv(hM2, n / 2 * n / 2, MPI_DOUBLE, 2, 2, MPI_COMM_WORLD,&status);//from
worker 2
//final calculations using hM1,hM2
//worker 1 (taskid=1)
MPI_Recv(ha11, n / 2 * n / 2, MPI_DOUBLE, 0, 1, MPI_COMM_WORLD,&status);
//does some calculations
MPI_Send(hM1, n / 2 * n / 2, MPI_DOUBLE, 0, 2, MPI_COMM_WORLD); //sends back
//worker 2(taskid=2)
MPI_Recv(ha11, n / 2 * n / 2, MPI_DOUBLE, 0, 1, MPI_COMM_WORLD,&status);
//does some calculations
MPI_Send(hM2, n / 2 * n / 2, MPI_DOUBLE, 0, 2, MPI_COMM_WORLD); //sends back
This worked fine at first, for n=128 to n=2048. But after I pushed 'n'
beyond 2048 I got a segmentation fault from the worker 1.
Since then, code works fine for the small n values. But whenever I set the
value n=128 or greater, worker 1 is getting delayed indefinitely while the
rest of the nodes works fine.
What could be the reason for this? And how can I resolve this? If I have
done any mistakes please point out. Thanks in advance.
--
Regards,
*H.K. Madhawa Bandara*
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20131223/aafcf479/attachment.html>
More information about the discuss
mailing list