[mpich-discuss] mpiexec crash

Pavan Balaji balaji at mcs.anl.gov
Mon Oct 28 21:31:58 CDT 2013


Rohit,

Can you send us the error messages?  What you sent are just the cleanup messages when Hydra noticed that your application died and cleaned up the remaining processes.  That doesn’t tell us any information we need.

Thanks,

  —- Pavan

On Oct 28, 2013, at 7:48 PM, Jain, Rohit <Rohit_Jain at mentor.com> wrote:

> Pavan,
> 
> We retried the runs again. There is no ENOENT error now. But, MPI is still failing consistently with same error:
> 
>> [proxy:0:0 at gretel] HYD_pmcd_pmip_control_cmd_cb (</PATH/TO>/src/pm/hydra/pm/pmiserv/pmip_cb.c:934): assert (!closed) failed
>> [proxy:0:0 at gretel] HYDT_dmxu_poll_wait_for_event (</PATH/TO>/src/mpich2-1.5/src/pm/hydra/tools/demux/demux_poll.c:77): callback returned error status
>> [proxy:0:0 at gretel] main (</PATH/TO>/src/mpich2-1.5/src/pm/hydra/pm/pmiserv/pmip.c:210): demux engine error waiting for event
>> [mpiexec at gretel] control_cb (</PATH/TO>/src/mpich2-1.5/src/pm/hydra/pm/pmiserv/pmiserv_cb.c:201): assert (!closed) failed
>> [mpiexec at gretel] HYDT_dmxu_poll_wait_for_event (</PATH/TO>/src/mpich2-1.5/src/pm/hydra/tools/demux/demux_poll.c:77): callback returned error status
>> [mpiexec at gretel] HYD_pmci_wait_for_completion (</PATH/TO>/src/mpich2-1.5/src/pm/hydra/pm/pmiserv/pmiserv_pmci.c:196): error waiting for event
>> [mpiexec at gretel] main (</PATH/TO>/src/mpich2-1.5/src/pm/hydra/ui/mpich/mpiexec.c:325): process manager error waiting for completion
> 
> We are running it on same machine as:
> 	mpiexec -n 1 <exec> : -n 1 <exec> :.....
> 
> What would cause such error to appear? How do we debug such issues?
> 
> Regards,
> Rohit
> 
> 
> -----Original Message-----
> From: discuss-bounces at mpich.org [mailto:discuss-bounces at mpich.org] On Behalf Of Pavan Balaji
> Sent: Wednesday, October 23, 2013 7:22 PM
> To: discuss at mpich.org
> Subject: Re: [mpich-discuss] mpiexec crash
> 
> 
> On Oct 23, 2013, at 5:29 PM, Cherukumilli, Vasu <Vasu_Cherukumilli at mentor.com> wrote:
>> Crash that we are seeing:
>> 
>> [proxy:0:0 at gretel] HYD_pmcd_pmip_control_cmd_cb (</PATH/TO>/src/pm/hydra/pm/pmiserv/pmip_cb.c:934): assert (!closed) failed
>> [proxy:0:0 at gretel] HYDT_dmxu_poll_wait_for_event (</PATH/TO>/src/mpich2-1.5/src/pm/hydra/tools/demux/demux_poll.c:77): callback returned error status
>> [proxy:0:0 at gretel] main (</PATH/TO>/src/mpich2-1.5/src/pm/hydra/pm/pmiserv/pmip.c:210): demux engine error waiting for event
>> [mpiexec at gretel] control_cb (</PATH/TO>/src/mpich2-1.5/src/pm/hydra/pm/pmiserv/pmiserv_cb.c:201): assert (!closed) failed
>> [mpiexec at gretel] HYDT_dmxu_poll_wait_for_event (</PATH/TO>/src/mpich2-1.5/src/pm/hydra/tools/demux/demux_poll.c:77): callback returned error status
>> [mpiexec at gretel] HYD_pmci_wait_for_completion (</PATH/TO>/src/mpich2-1.5/src/pm/hydra/pm/pmiserv/pmiserv_pmci.c:196): error waiting for event
>> [mpiexec at gretel] main (</PATH/TO>/src/mpich2-1.5/src/pm/hydra/ui/mpich/mpiexec.c:325): process manager error waiting for completion
> 
> These are cleanup messages.  You should have gotten an output which says so.
> 
>> No such file or directory. (errno = ENOENT)
> 
> This is the real error message.  Did you make sure your executables are located on all the nodes in the same location?
> 
>  -- Pavan
> 
> --
> Pavan Balaji
> http://www.mcs.anl.gov/~balaji
> 
> _______________________________________________
> discuss mailing list     discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss
> _______________________________________________
> discuss mailing list     discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss

--
Pavan Balaji
http://www.mcs.anl.gov/~balaji




More information about the discuss mailing list