[mpich-discuss] MPI_Send/MPI_Recv - getting delayed indeffinitely
Pavan Balaji
balaji at mcs.anl.gov
Wed Dec 25 10:44:53 CST 2013
Conceptually, I don’t see anything wrong with the code snippet. There is a performance problem since you are serializing the MPI_Recv’s at the master. So it cannot receive data from worker2 before it receives data from worker1, even if worker2 is ready earlier. That’ll lose some performance, but it should not affect correctness (and for a 3 node program, you might not care about it anyway).
The segmentation fault might be an issue with the serial part of your code. Typically this means you didn’t allocate enough memory somewhere. Did you try running the processes with a debugger?
Instead of doing:
% mpiexec -n 3 ./foo
try this:
% mpiexec -n 3 ddd ./foo
“ddd” is a frontend for gdb and will open a different debugging screen for each process.
— Pavan
On Dec 23, 2013, at 8:04 PM, Madhawa Bandara <madawa911 at gmail.com> wrote:
>
> Hi,
>
> I use mpich2 on a small cluster of 3 nodes and each node has Ubuntu 12.04 installed. I use this cluster to do the following.
>
> 1. Master node send some matrices to the 2 workers
> 2. Workers perform some calculations and send the resulted matrices back to the master.
> 3. Master perform some final calculations.
>
> code snippet:
>
> //master(taskid=0)
>
> MPI_Send(ha11, n / 2 * n / 2, MPI_DOUBLE, 1, 1, MPI_COMM_WORLD); //to worker 1
> MPI_Send(ha11, n / 2 * n / 2, MPI_DOUBLE, 2, 1, MPI_COMM_WORLD); //to worker 2
>
>
> MPI_Recv(hM1, n / 2 * n / 2, MPI_DOUBLE, 1, 2, MPI_COMM_WORLD,&status); //from worker 1
> MPI_Recv(hM2, n / 2 * n / 2, MPI_DOUBLE, 2, 2, MPI_COMM_WORLD,&status);//from worker 2
>
> //final calculations using hM1,hM2
>
> //worker 1 (taskid=1)
>
> MPI_Recv(ha11, n / 2 * n / 2, MPI_DOUBLE, 0, 1, MPI_COMM_WORLD,&status);
> //does some calculations
> MPI_Send(hM1, n / 2 * n / 2, MPI_DOUBLE, 0, 2, MPI_COMM_WORLD); //sends back
>
>
> //worker 2(taskid=2)
> MPI_Recv(ha11, n / 2 * n / 2, MPI_DOUBLE, 0, 1, MPI_COMM_WORLD,&status);
> //does some calculations
> MPI_Send(hM2, n / 2 * n / 2, MPI_DOUBLE, 0, 2, MPI_COMM_WORLD); //sends back
>
>
> This worked fine at first, for n=128 to n=2048. But after I pushed 'n' beyond 2048 I got a segmentation fault from the worker 1.
>
> Since then, code works fine for the small n values. But whenever I set the value n=128 or greater, worker 1 is getting delayed indefinitely while the rest of the nodes works fine.
>
> What could be the reason for this? And how can I resolve this? If I have done any mistakes please point out. Thanks in advance.
> --
> Regards,
> H.K. Madhawa Bandara
>
>
>
> _______________________________________________
> discuss mailing list discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss
--
Pavan Balaji
http://www.mcs.anl.gov/~balaji
More information about the discuss
mailing list