[mpich-discuss] having problem running MPICH on multiple nodes

Amin Hassani ahassani at cis.uab.edu
Tue Nov 25 23:50:22 CST 2014


Tried with the new configure too. same problem :(

$ mpirun -hostfile hosts-hydra -np 2  test_dup
Fatal error in MPI_Send: Unknown error class, error stack:
MPI_Send(174)..............: MPI_Send(buf=0x7fffd90c76c8, count=1, MPI_INT,
dest=1, tag=0, MPI_COMM_WORLD) failed
MPID_nem_tcp_connpoll(1832): Communication error with rank 1: Connection
refused

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   PID 5459 RUNNING AT oakmnt-0-a
=   EXIT CODE: 1
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
[proxy:0:1 at oakmnt-0-b] HYD_pmcd_pmip_control_cmd_cb
(../../../../src/pm/hydra/pm/pmiserv/pmip_cb.c:885): assert (!closed) failed
[proxy:0:1 at oakmnt-0-b] HYDT_dmxu_poll_wait_for_event
(../../../../src/pm/hydra/tools/demux/demux_poll.c:76): callback returned
error status
[proxy:0:1 at oakmnt-0-b] main
(../../../../src/pm/hydra/pm/pmiserv/pmip.c:206): demux engine error
waiting for event
[mpiexec at oakmnt-0-a] HYDT_bscu_wait_for_completion
(../../../../src/pm/hydra/tools/bootstrap/utils/bscu_wait.c:76): one of the
processes terminated badly; aborting
[mpiexec at oakmnt-0-a] HYDT_bsci_wait_for_completion
(../../../../src/pm/hydra/tools/bootstrap/src/bsci_wait.c:23): launcher
returned error waiting for completion
[mpiexec at oakmnt-0-a] HYD_pmci_wait_for_completion
(../../../../src/pm/hydra/pm/pmiserv/pmiserv_pmci.c:218): launcher returned
error waiting for completion
[mpiexec at oakmnt-0-a] main
(../../../../src/pm/hydra/ui/mpich/mpiexec.c:344): process manager error
waiting for completion


Amin Hassani,
CIS department at UAB,
Birmingham, AL, USA.

On Tue, Nov 25, 2014 at 11:44 PM, Lu, Huiwei <huiweilu at mcs.anl.gov> wrote:

> So the error only happens when there is communication.
>
> It may be caused by IB as your guessed before. Could you try to
> reconfigure MPICH using "./configure --with-device=ch3:nemesis:tcp” and try
> again?
>
>> Huiwei
>
> > On Nov 25, 2014, at 11:23 PM, Amin Hassani <ahassani at cis.uab.edu> wrote:
> >
> > Yes it works.
> > output:
> >
> > $ mpirun -hostfile hosts-hydra -np 2  test
> > rank 1
> > rank 0
> >
> >
> > Amin Hassani,
> > CIS department at UAB,
> > Birmingham, AL, USA.
> >
> > On Tue, Nov 25, 2014 at 11:20 PM, Lu, Huiwei <huiweilu at mcs.anl.gov>
> wrote:
> > Could you try to run the following simple code to see if it works?
> >
> > #include <mpi.h>
> > #include <stdio.h>
> > int main(int argc, char** argv)
> > {
> >     int rank, size;
> >     MPI_Init(&argc, &argv);
> >     MPI_Comm_rank(MPI_COMM_WORLD, &rank);
> >     printf("rank %d\n", rank);
> >     MPI_Finalize();
> >     return 0;
> > }
> >
> > —
> > Huiwei
> >
> > > On Nov 25, 2014, at 11:11 PM, Amin Hassani <ahassani at cis.uab.edu>
> wrote:
> > >
> > > No, I checked. Also I always install my MPI's in
> /nethome/students/ahassani/usr/mpi. I never install them in
> /nethome/students/ahassani/usr. So MPI files will never get there. Even if
> put the /usr/mpi/bin in front of /usr/bin, it won't affect anything. There
> has never been any mpi installed in /usr/bin.
> > >
> > > Thank you.
> > > _______________________________________________
> > > discuss mailing list     discuss at mpich.org
> > > To manage subscription options or unsubscribe:
> > > https://lists.mpich.org/mailman/listinfo/discuss
> >
> > _______________________________________________
> > discuss mailing list     discuss at mpich.org
> > To manage subscription options or unsubscribe:
> > https://lists.mpich.org/mailman/listinfo/discuss
> >
> > _______________________________________________
> > discuss mailing list     discuss at mpich.org
> > To manage subscription options or unsubscribe:
> > https://lists.mpich.org/mailman/listinfo/discuss
>
> _______________________________________________
> discuss mailing list     discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20141125/cc2abdef/attachment.html>
-------------- next part --------------
_______________________________________________
discuss mailing list     discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss


More information about the discuss mailing list