[mpich-discuss] mpiexec crash

Pavan Balaji balaji at mcs.anl.gov
Tue Oct 29 08:52:07 CDT 2013


These are typically cleanup messages, meaning something went wrong in the application.  Basically, the application terminated abruptly without cleaning up resources with hydra correctly.  So hydra went berserk and cleaned up all the remaining processes and associated resources.

Perhaps there is room to add some more diagnostic messages to improve the error reporting.  But, for the time being, can you run your application through a debugger to see what’s going on?

  —- Pavan

On Oct 29, 2013, at 12:56 AM, Jain, Rohit <Rohit_Jain at mentor.com> wrote:

> Unfortunately, there is no other indication of what went wrong. 
> 
> When do we expect these hydra messages to show up? 
> Does it point to application problem or issue in hydra itself?
> 
> Regards,
> Rohit
> 
> 
> -----Original Message-----
> From: discuss-bounces at mpich.org [mailto:discuss-bounces at mpich.org] On Behalf Of Pavan Balaji
> Sent: Monday, October 28, 2013 7:57 PM
> To: discuss at mpich.org
> Subject: Re: [mpich-discuss] mpiexec crash
> 
> Hi Rohit,
> 
> In the cleanup message, did Hydra say what went wrong?  Maybe it said the application had a segmentation fault?
> 
>  -- Pavan
> 
> On Oct 28, 2013, at 9:43 PM, Jain, Rohit <Rohit_Jain at mentor.com> wrote:
> 
>> Hi Pavan,
>> 
>> I don't see any other messages from application or hydra. Run goes for a while and then ends abruptly with these messages.
>> 
>> Regards,
>> Rohit
>> 
>> 
>> -----Original Message-----
>> From: discuss-bounces at mpich.org [mailto:discuss-bounces at mpich.org] On 
>> Behalf Of Pavan Balaji
>> Sent: Monday, October 28, 2013 7:32 PM
>> To: discuss at mpich.org
>> Subject: Re: [mpich-discuss] mpiexec crash
>> 
>> Rohit,
>> 
>> Can you send us the error messages?  What you sent are just the cleanup messages when Hydra noticed that your application died and cleaned up the remaining processes.  That doesn't tell us any information we need.
>> 
>> Thanks,
>> 
>> -- Pavan
>> 
>> On Oct 28, 2013, at 7:48 PM, Jain, Rohit <Rohit_Jain at mentor.com> wrote:
>> 
>>> Pavan,
>>> 
>>> We retried the runs again. There is no ENOENT error now. But, MPI is still failing consistently with same error:
>>> 
>>>> [proxy:0:0 at gretel] HYD_pmcd_pmip_control_cmd_cb
>>>> (</PATH/TO>/src/pm/hydra/pm/pmiserv/pmip_cb.c:934): assert (!closed) 
>>>> failed [proxy:0:0 at gretel] HYDT_dmxu_poll_wait_for_event
>>>> (</PATH/TO>/src/mpich2-1.5/src/pm/hydra/tools/demux/demux_poll.c:77): 
>>>> callback returned error status [proxy:0:0 at gretel] main
>>>> (</PATH/TO>/src/mpich2-1.5/src/pm/hydra/pm/pmiserv/pmip.c:210): 
>>>> demux engine error waiting for event [mpiexec at gretel] control_cb
>>>> (</PATH/TO>/src/mpich2-1.5/src/pm/hydra/pm/pmiserv/pmiserv_cb.c:201): 
>>>> assert (!closed) failed [mpiexec at gretel] 
>>>> HYDT_dmxu_poll_wait_for_event
>>>> (</PATH/TO>/src/mpich2-1.5/src/pm/hydra/tools/demux/demux_poll.c:77): 
>>>> callback returned error status [mpiexec at gretel] 
>>>> HYD_pmci_wait_for_completion
>>>> (</PATH/TO>/src/mpich2-1.5/src/pm/hydra/pm/pmiserv/pmiserv_pmci.c:19
>>>> 6
>>>> ): error waiting for event [mpiexec at gretel] main
>>>> (</PATH/TO>/src/mpich2-1.5/src/pm/hydra/ui/mpich/mpiexec.c:325): 
>>>> process manager error waiting for completion
>>> 
>>> We are running it on same machine as:
>>> 	mpiexec -n 1 <exec> : -n 1 <exec> :.....
>>> 
>>> What would cause such error to appear? How do we debug such issues?
>>> 
>>> Regards,
>>> Rohit
>>> 
>>> 
>>> -----Original Message-----
>>> From: discuss-bounces at mpich.org [mailto:discuss-bounces at mpich.org] On 
>>> Behalf Of Pavan Balaji
>>> Sent: Wednesday, October 23, 2013 7:22 PM
>>> To: discuss at mpich.org
>>> Subject: Re: [mpich-discuss] mpiexec crash
>>> 
>>> 
>>> On Oct 23, 2013, at 5:29 PM, Cherukumilli, Vasu <Vasu_Cherukumilli at mentor.com> wrote:
>>>> Crash that we are seeing:
>>>> 
>>>> [proxy:0:0 at gretel] HYD_pmcd_pmip_control_cmd_cb
>>>> (</PATH/TO>/src/pm/hydra/pm/pmiserv/pmip_cb.c:934): assert (!closed) 
>>>> failed [proxy:0:0 at gretel] HYDT_dmxu_poll_wait_for_event
>>>> (</PATH/TO>/src/mpich2-1.5/src/pm/hydra/tools/demux/demux_poll.c:77): 
>>>> callback returned error status [proxy:0:0 at gretel] main
>>>> (</PATH/TO>/src/mpich2-1.5/src/pm/hydra/pm/pmiserv/pmip.c:210): 
>>>> demux engine error waiting for event [mpiexec at gretel] control_cb
>>>> (</PATH/TO>/src/mpich2-1.5/src/pm/hydra/pm/pmiserv/pmiserv_cb.c:201): 
>>>> assert (!closed) failed [mpiexec at gretel] 
>>>> HYDT_dmxu_poll_wait_for_event
>>>> (</PATH/TO>/src/mpich2-1.5/src/pm/hydra/tools/demux/demux_poll.c:77): 
>>>> callback returned error status [mpiexec at gretel] 
>>>> HYD_pmci_wait_for_completion
>>>> (</PATH/TO>/src/mpich2-1.5/src/pm/hydra/pm/pmiserv/pmiserv_pmci.c:19
>>>> 6
>>>> ): error waiting for event [mpiexec at gretel] main
>>>> (</PATH/TO>/src/mpich2-1.5/src/pm/hydra/ui/mpich/mpiexec.c:325): 
>>>> process manager error waiting for completion
>>> 
>>> These are cleanup messages.  You should have gotten an output which says so.
>>> 
>>>> No such file or directory. (errno = ENOENT)
>>> 
>>> This is the real error message.  Did you make sure your executables are located on all the nodes in the same location?
>>> 
>>> -- Pavan
>>> 
>>> --
>>> Pavan Balaji
>>> http://www.mcs.anl.gov/~balaji
>>> 
>>> _______________________________________________
>>> discuss mailing list     discuss at mpich.org
>>> To manage subscription options or unsubscribe:
>>> https://lists.mpich.org/mailman/listinfo/discuss
>>> _______________________________________________
>>> discuss mailing list     discuss at mpich.org
>>> To manage subscription options or unsubscribe:
>>> https://lists.mpich.org/mailman/listinfo/discuss
>> 
>> --
>> Pavan Balaji
>> http://www.mcs.anl.gov/~balaji
>> 
>> _______________________________________________
>> discuss mailing list     discuss at mpich.org
>> To manage subscription options or unsubscribe:
>> https://lists.mpich.org/mailman/listinfo/discuss
>> _______________________________________________
>> discuss mailing list     discuss at mpich.org
>> To manage subscription options or unsubscribe:
>> https://lists.mpich.org/mailman/listinfo/discuss
> 
> --
> Pavan Balaji
> http://www.mcs.anl.gov/~balaji
> 
> _______________________________________________
> discuss mailing list     discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss
> _______________________________________________
> discuss mailing list     discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss

--
Pavan Balaji
http://www.mcs.anl.gov/~balaji




More information about the discuss mailing list