[mpich-discuss] having problem running MPICH on multiple nodes

"Antonio J. Peña" apenya at mcs.anl.gov
Tue Nov 25 14:46:24 CST 2014


Hi Amin,

Can you share with us a minimal piece of code with which you can 
reproduce this issue?

Thanks,
   Antonio


On 11/25/2014 12:52 PM, Amin Hassani wrote:
> Hi,
>
> I am having problem running MPICH, on multiple nodes. When I run an 
> multiple MPI processes on one node, it totally works, but when I try 
> to run on multiple nodes, it fails with the error below.
> My machines have Debian OS, Both infiniband and TCP interconnects. I'm 
> guessing it has something do to with the TCP network, but I can run 
> openmpi on these machines with no problem. But for some reason I 
> cannot run MPICH on multiple nodes. Please let me know if more info is 
> needed from my side. I'm guessing there are some configuration that I 
> am missing. I used MPICH 3.1.3 for this test. I googled this problem 
> but couldn't find any solution.
>
> ​In my MPI program, I am doing a simple allreduce over MPI_COMM_WORLD​.
>
> ​my host file (hosts-hydra) is something like this:
> oakmnt-0-a:1
> oakmnt-0-b:1 ​
>
> ​I get this error:​
>
> $ mpirun -hostfile hosts-hydra -np 2  test_dup
> Assertion failed in file ../src/mpi/coll/helper_fns.c at line 490: 
> status->MPI_TAG == recvtag
> Assertion failed in file ../src/mpi/coll/helper_fns.c at line 490: 
> status->MPI_TAG == recvtag
> internal ABORT - process 1
> internal ABORT - process 0
>
> ===================================================================================
> =   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
> =   PID 30744 RUNNING AT oakmnt-0-b
> =   EXIT CODE: 1
> =   CLEANING UP REMAINING PROCESSES
> =   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
> ===================================================================================
> [mpiexec at vulcan13] HYDU_sock_read 
> (../../../../src/pm/hydra/utils/sock/sock.c:239): read error (Bad file 
> descriptor)
> [mpiexec at vulcan13] control_cb 
> (../../../../src/pm/hydra/pm/pmiserv/pmiserv_cb.c:199): unable to read 
> command from proxy
> [mpiexec at vulcan13] HYDT_dmxu_poll_wait_for_event 
> (../../../../src/pm/hydra/tools/demux/demux_poll.c:76): callback 
> returned error status
> [mpiexec at vulcan13] HYD_pmci_wait_for_completion 
> (../../../../src/pm/hydra/pm/pmiserv/pmiserv_pmci.c:198): error 
> waiting for event
> [mpiexec at vulcan13] main 
> (../../../../src/pm/hydra/ui/mpich/mpiexec.c:344): process manager 
> error waiting for completion
>
> Thanks.
> Amin Hassani,
> CIS department at UAB,
> Birmingham, AL, USA.
>
>
> _______________________________________________
> discuss mailing list     discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss


-- 
Antonio J. Peña
Postdoctoral Appointee
Mathematics and Computer Science Division
Argonne National Laboratory
9700 South Cass Avenue, Bldg. 240, Of. 3148
Argonne, IL 60439-4847
apenya at mcs.anl.gov
www.mcs.anl.gov/~apenya

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20141125/06daa61b/attachment.html>
-------------- next part --------------
_______________________________________________
discuss mailing list     discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss


More information about the discuss mailing list