[mpich-discuss] mpiexec crash

Jain, Rohit Rohit_Jain at mentor.com
Tue Oct 29 00:56:07 CDT 2013


Unfortunately, there is no other indication of what went wrong. 

When do we expect these hydra messages to show up? 
Does it point to application problem or issue in hydra itself?

Regards,
Rohit


-----Original Message-----
From: discuss-bounces at mpich.org [mailto:discuss-bounces at mpich.org] On Behalf Of Pavan Balaji
Sent: Monday, October 28, 2013 7:57 PM
To: discuss at mpich.org
Subject: Re: [mpich-discuss] mpiexec crash

Hi Rohit,

In the cleanup message, did Hydra say what went wrong?  Maybe it said the application had a segmentation fault?

  -- Pavan

On Oct 28, 2013, at 9:43 PM, Jain, Rohit <Rohit_Jain at mentor.com> wrote:

> Hi Pavan,
> 
> I don't see any other messages from application or hydra. Run goes for a while and then ends abruptly with these messages.
> 
> Regards,
> Rohit
> 
> 
> -----Original Message-----
> From: discuss-bounces at mpich.org [mailto:discuss-bounces at mpich.org] On 
> Behalf Of Pavan Balaji
> Sent: Monday, October 28, 2013 7:32 PM
> To: discuss at mpich.org
> Subject: Re: [mpich-discuss] mpiexec crash
> 
> Rohit,
> 
> Can you send us the error messages?  What you sent are just the cleanup messages when Hydra noticed that your application died and cleaned up the remaining processes.  That doesn't tell us any information we need.
> 
> Thanks,
> 
>  -- Pavan
> 
> On Oct 28, 2013, at 7:48 PM, Jain, Rohit <Rohit_Jain at mentor.com> wrote:
> 
>> Pavan,
>> 
>> We retried the runs again. There is no ENOENT error now. But, MPI is still failing consistently with same error:
>> 
>>> [proxy:0:0 at gretel] HYD_pmcd_pmip_control_cmd_cb
>>> (</PATH/TO>/src/pm/hydra/pm/pmiserv/pmip_cb.c:934): assert (!closed) 
>>> failed [proxy:0:0 at gretel] HYDT_dmxu_poll_wait_for_event
>>> (</PATH/TO>/src/mpich2-1.5/src/pm/hydra/tools/demux/demux_poll.c:77): 
>>> callback returned error status [proxy:0:0 at gretel] main
>>> (</PATH/TO>/src/mpich2-1.5/src/pm/hydra/pm/pmiserv/pmip.c:210): 
>>> demux engine error waiting for event [mpiexec at gretel] control_cb
>>> (</PATH/TO>/src/mpich2-1.5/src/pm/hydra/pm/pmiserv/pmiserv_cb.c:201): 
>>> assert (!closed) failed [mpiexec at gretel] 
>>> HYDT_dmxu_poll_wait_for_event
>>> (</PATH/TO>/src/mpich2-1.5/src/pm/hydra/tools/demux/demux_poll.c:77): 
>>> callback returned error status [mpiexec at gretel] 
>>> HYD_pmci_wait_for_completion
>>> (</PATH/TO>/src/mpich2-1.5/src/pm/hydra/pm/pmiserv/pmiserv_pmci.c:19
>>> 6
>>> ): error waiting for event [mpiexec at gretel] main
>>> (</PATH/TO>/src/mpich2-1.5/src/pm/hydra/ui/mpich/mpiexec.c:325): 
>>> process manager error waiting for completion
>> 
>> We are running it on same machine as:
>> 	mpiexec -n 1 <exec> : -n 1 <exec> :.....
>> 
>> What would cause such error to appear? How do we debug such issues?
>> 
>> Regards,
>> Rohit
>> 
>> 
>> -----Original Message-----
>> From: discuss-bounces at mpich.org [mailto:discuss-bounces at mpich.org] On 
>> Behalf Of Pavan Balaji
>> Sent: Wednesday, October 23, 2013 7:22 PM
>> To: discuss at mpich.org
>> Subject: Re: [mpich-discuss] mpiexec crash
>> 
>> 
>> On Oct 23, 2013, at 5:29 PM, Cherukumilli, Vasu <Vasu_Cherukumilli at mentor.com> wrote:
>>> Crash that we are seeing:
>>> 
>>> [proxy:0:0 at gretel] HYD_pmcd_pmip_control_cmd_cb
>>> (</PATH/TO>/src/pm/hydra/pm/pmiserv/pmip_cb.c:934): assert (!closed) 
>>> failed [proxy:0:0 at gretel] HYDT_dmxu_poll_wait_for_event
>>> (</PATH/TO>/src/mpich2-1.5/src/pm/hydra/tools/demux/demux_poll.c:77): 
>>> callback returned error status [proxy:0:0 at gretel] main
>>> (</PATH/TO>/src/mpich2-1.5/src/pm/hydra/pm/pmiserv/pmip.c:210): 
>>> demux engine error waiting for event [mpiexec at gretel] control_cb
>>> (</PATH/TO>/src/mpich2-1.5/src/pm/hydra/pm/pmiserv/pmiserv_cb.c:201): 
>>> assert (!closed) failed [mpiexec at gretel] 
>>> HYDT_dmxu_poll_wait_for_event
>>> (</PATH/TO>/src/mpich2-1.5/src/pm/hydra/tools/demux/demux_poll.c:77): 
>>> callback returned error status [mpiexec at gretel] 
>>> HYD_pmci_wait_for_completion
>>> (</PATH/TO>/src/mpich2-1.5/src/pm/hydra/pm/pmiserv/pmiserv_pmci.c:19
>>> 6
>>> ): error waiting for event [mpiexec at gretel] main
>>> (</PATH/TO>/src/mpich2-1.5/src/pm/hydra/ui/mpich/mpiexec.c:325): 
>>> process manager error waiting for completion
>> 
>> These are cleanup messages.  You should have gotten an output which says so.
>> 
>>> No such file or directory. (errno = ENOENT)
>> 
>> This is the real error message.  Did you make sure your executables are located on all the nodes in the same location?
>> 
>> -- Pavan
>> 
>> --
>> Pavan Balaji
>> http://www.mcs.anl.gov/~balaji
>> 
>> _______________________________________________
>> discuss mailing list     discuss at mpich.org
>> To manage subscription options or unsubscribe:
>> https://lists.mpich.org/mailman/listinfo/discuss
>> _______________________________________________
>> discuss mailing list     discuss at mpich.org
>> To manage subscription options or unsubscribe:
>> https://lists.mpich.org/mailman/listinfo/discuss
> 
> --
> Pavan Balaji
> http://www.mcs.anl.gov/~balaji
> 
> _______________________________________________
> discuss mailing list     discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss
> _______________________________________________
> discuss mailing list     discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss

--
Pavan Balaji
http://www.mcs.anl.gov/~balaji

_______________________________________________
discuss mailing list     discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss



More information about the discuss mailing list