[mpich-discuss] Parallel test hanging with mpich on rhel7

Balaji, Pavan balaji at anl.gov
Mon Feb 10 22:56:41 CST 2014


On 2/10/14, 10:47 PM, "Orion Poplawski" <orion at cora.nwra.com> wrote:

>On 02/06/2014 09:10 PM, Balaji, Pavan wrote:
>> 
>> Thanks.  That’s very useful analysis.  Would you be willing to try the
>> attached patch to see if it solves this issue?
>> 
>>   — Pavan
>
>Well, it seems to prevent a hang (although I'm also updating from 3.0.4
>to 3.1rc3 so not sure what is all changing here), but it does not run:

It might be easier to use the nightly snapshots to make sure you are not
missing some fixes:

http://www.mpich.org/static/tarballs/nightly/master/mpich/


The patch I sent, as well as a few other patches after 3.1rc3, are all
included in the nightly snapshots.

>============================
>Fatal error in MPI_Init: Other MPI error, error stack:
>MPIR_Init_thread(467)..............:
>MPID_Init(177).....................: channel initialization failed
>MPIDI_CH3_Init(70).................:
>MPID_nem_init(319).................:
>MPID_nem_tcp_init(171).............:
>MPID_nem_tcp_get_business_card(418):
>MPID_nem_tcp_init(377).............: gethostbyname failed, i-00001ff8
>(errno 1)
>Fatal error in MPI_Init: Other MPI error, error stack:
>MPIR_Init_thread(467)..............:
>MPID_Init(177).....................: channel initialization failed
>MPIDI_CH3_Init(70).................:
>MPID_nem_init(319).................:
>MPID_nem_tcp_init(171).............:
>MPID_nem_tcp_get_business_card(418):
>MPID_nem_tcp_init(377).............: gethostbyname failed, i-00001ff8
>(errno 1)

That’s really weird.  Errno 1 is "permission denied”.  I don’t know how
that’s happening with gethostbyname.

Can you send your mpiexec command line and a small program that reproduces
this error?  E.g., if a program that just does MPI_INIT/MPI_FINALIZE
reproduces this error, that’ll be best.

  — Pavan



More information about the discuss mailing list