[mpich-discuss] mpiexec crash

Pavan Balaji balaji at mcs.anl.gov
Mon Oct 28 21:56:25 CDT 2013


Hi Rohit,

In the cleanup message, did Hydra say what went wrong?  Maybe it said the application had a segmentation fault?

  —- Pavan

On Oct 28, 2013, at 9:43 PM, Jain, Rohit <Rohit_Jain at mentor.com> wrote:

> Hi Pavan,
> 
> I don't see any other messages from application or hydra. Run goes for a while and then ends abruptly with these messages.
> 
> Regards,
> Rohit
> 
> 
> -----Original Message-----
> From: discuss-bounces at mpich.org [mailto:discuss-bounces at mpich.org] On Behalf Of Pavan Balaji
> Sent: Monday, October 28, 2013 7:32 PM
> To: discuss at mpich.org
> Subject: Re: [mpich-discuss] mpiexec crash
> 
> Rohit,
> 
> Can you send us the error messages?  What you sent are just the cleanup messages when Hydra noticed that your application died and cleaned up the remaining processes.  That doesn't tell us any information we need.
> 
> Thanks,
> 
>  -- Pavan
> 
> On Oct 28, 2013, at 7:48 PM, Jain, Rohit <Rohit_Jain at mentor.com> wrote:
> 
>> Pavan,
>> 
>> We retried the runs again. There is no ENOENT error now. But, MPI is still failing consistently with same error:
>> 
>>> [proxy:0:0 at gretel] HYD_pmcd_pmip_control_cmd_cb 
>>> (</PATH/TO>/src/pm/hydra/pm/pmiserv/pmip_cb.c:934): assert (!closed) 
>>> failed [proxy:0:0 at gretel] HYDT_dmxu_poll_wait_for_event 
>>> (</PATH/TO>/src/mpich2-1.5/src/pm/hydra/tools/demux/demux_poll.c:77): 
>>> callback returned error status [proxy:0:0 at gretel] main 
>>> (</PATH/TO>/src/mpich2-1.5/src/pm/hydra/pm/pmiserv/pmip.c:210): demux 
>>> engine error waiting for event [mpiexec at gretel] control_cb 
>>> (</PATH/TO>/src/mpich2-1.5/src/pm/hydra/pm/pmiserv/pmiserv_cb.c:201): 
>>> assert (!closed) failed [mpiexec at gretel] 
>>> HYDT_dmxu_poll_wait_for_event 
>>> (</PATH/TO>/src/mpich2-1.5/src/pm/hydra/tools/demux/demux_poll.c:77): 
>>> callback returned error status [mpiexec at gretel] 
>>> HYD_pmci_wait_for_completion 
>>> (</PATH/TO>/src/mpich2-1.5/src/pm/hydra/pm/pmiserv/pmiserv_pmci.c:196
>>> ): error waiting for event [mpiexec at gretel] main 
>>> (</PATH/TO>/src/mpich2-1.5/src/pm/hydra/ui/mpich/mpiexec.c:325): 
>>> process manager error waiting for completion
>> 
>> We are running it on same machine as:
>> 	mpiexec -n 1 <exec> : -n 1 <exec> :.....
>> 
>> What would cause such error to appear? How do we debug such issues?
>> 
>> Regards,
>> Rohit
>> 
>> 
>> -----Original Message-----
>> From: discuss-bounces at mpich.org [mailto:discuss-bounces at mpich.org] On 
>> Behalf Of Pavan Balaji
>> Sent: Wednesday, October 23, 2013 7:22 PM
>> To: discuss at mpich.org
>> Subject: Re: [mpich-discuss] mpiexec crash
>> 
>> 
>> On Oct 23, 2013, at 5:29 PM, Cherukumilli, Vasu <Vasu_Cherukumilli at mentor.com> wrote:
>>> Crash that we are seeing:
>>> 
>>> [proxy:0:0 at gretel] HYD_pmcd_pmip_control_cmd_cb 
>>> (</PATH/TO>/src/pm/hydra/pm/pmiserv/pmip_cb.c:934): assert (!closed) 
>>> failed [proxy:0:0 at gretel] HYDT_dmxu_poll_wait_for_event 
>>> (</PATH/TO>/src/mpich2-1.5/src/pm/hydra/tools/demux/demux_poll.c:77): 
>>> callback returned error status [proxy:0:0 at gretel] main 
>>> (</PATH/TO>/src/mpich2-1.5/src/pm/hydra/pm/pmiserv/pmip.c:210): demux 
>>> engine error waiting for event [mpiexec at gretel] control_cb 
>>> (</PATH/TO>/src/mpich2-1.5/src/pm/hydra/pm/pmiserv/pmiserv_cb.c:201): 
>>> assert (!closed) failed [mpiexec at gretel] 
>>> HYDT_dmxu_poll_wait_for_event 
>>> (</PATH/TO>/src/mpich2-1.5/src/pm/hydra/tools/demux/demux_poll.c:77): 
>>> callback returned error status [mpiexec at gretel] 
>>> HYD_pmci_wait_for_completion 
>>> (</PATH/TO>/src/mpich2-1.5/src/pm/hydra/pm/pmiserv/pmiserv_pmci.c:196
>>> ): error waiting for event [mpiexec at gretel] main 
>>> (</PATH/TO>/src/mpich2-1.5/src/pm/hydra/ui/mpich/mpiexec.c:325): 
>>> process manager error waiting for completion
>> 
>> These are cleanup messages.  You should have gotten an output which says so.
>> 
>>> No such file or directory. (errno = ENOENT)
>> 
>> This is the real error message.  Did you make sure your executables are located on all the nodes in the same location?
>> 
>> -- Pavan
>> 
>> --
>> Pavan Balaji
>> http://www.mcs.anl.gov/~balaji
>> 
>> _______________________________________________
>> discuss mailing list     discuss at mpich.org
>> To manage subscription options or unsubscribe:
>> https://lists.mpich.org/mailman/listinfo/discuss
>> _______________________________________________
>> discuss mailing list     discuss at mpich.org
>> To manage subscription options or unsubscribe:
>> https://lists.mpich.org/mailman/listinfo/discuss
> 
> --
> Pavan Balaji
> http://www.mcs.anl.gov/~balaji
> 
> _______________________________________________
> discuss mailing list     discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss
> _______________________________________________
> discuss mailing list     discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss

--
Pavan Balaji
http://www.mcs.anl.gov/~balaji




More information about the discuss mailing list