[mpich-discuss] having problem running MPICH on multiple nodes

Amin Hassani ahassani at cis.uab.edu
Tue Nov 25 12:52:03 CST 2014


Hi,

I am having problem running MPICH, on multiple nodes. When I run an
multiple MPI processes on one node, it totally works, but when I try to run
on multiple nodes, it fails with the error below.
My machines have Debian OS, Both infiniband and TCP interconnects. I'm
guessing it has something do to with the TCP network, but I can run openmpi
on these machines with no problem. But for some reason I cannot run MPICH
on multiple nodes. Please let me know if more info is needed from my side.
I'm guessing there are some configuration that I am missing. I used MPICH
3.1.3 for this test. I googled this problem but couldn't find any solution.

​In my MPI program, I am doing a simple allreduce over MPI_COMM_WORLD​.

​my host file (hosts-hydra) is something like this:
oakmnt-0-a:1
oakmnt-0-b:1​

​I get this error:​

$ mpirun -hostfile hosts-hydra -np 2  test_dup
Assertion failed in file ../src/mpi/coll/helper_fns.c at line 490:
status->MPI_TAG == recvtag
Assertion failed in file ../src/mpi/coll/helper_fns.c at line 490:
status->MPI_TAG == recvtag
internal ABORT - process 1
internal ABORT - process 0

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   PID 30744 RUNNING AT oakmnt-0-b
=   EXIT CODE: 1
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
[mpiexec at vulcan13] HYDU_sock_read
(../../../../src/pm/hydra/utils/sock/sock.c:239): read error (Bad file
descriptor)
[mpiexec at vulcan13] control_cb
(../../../../src/pm/hydra/pm/pmiserv/pmiserv_cb.c:199): unable to read
command from proxy
[mpiexec at vulcan13] HYDT_dmxu_poll_wait_for_event
(../../../../src/pm/hydra/tools/demux/demux_poll.c:76): callback returned
error status
[mpiexec at vulcan13] HYD_pmci_wait_for_completion
(../../../../src/pm/hydra/pm/pmiserv/pmiserv_pmci.c:198): error waiting for
event
[mpiexec at vulcan13] main (../../../../src/pm/hydra/ui/mpich/mpiexec.c:344):
process manager error waiting for completion

Thanks.
Amin Hassani,
CIS department at UAB,
Birmingham, AL, USA.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20141125/d4e34fa5/attachment.html>


More information about the discuss mailing list