[mpich-discuss] MPICH 3.2 failing in MPI_Init

Andrew Wood andrew at fluidgravity.co.uk
Thu Apr 7 07:49:50 CDT 2016


Hi,

I'm trying to get MPICH 3.2 working on our cluster, but jobs are failing in
MPI_Init with the following output if they are run on two or more nodes (4
processes per node):


Fatal error in MPI_Init: Other MPI error, error stack:
MPIR_Init_thread(490).................:
MPID_Init(201)........................: channel initialization failed
MPIDI_CH3_Init(93)....................:
MPID_nem_init(285)....................:
MPIDI_CH3I_Seg_commit(366)............:
MPIU_SHMW_Hnd_deserialize(324)........:
MPIU_SHMW_Seg_open(867)...............:
MPIU_SHMW_Seg_create_attach_templ(638): open failed - No such file or directory
mpiexec: Error: handle_pmi: unknown cmd abort.



The full output above only occurs intermittently. Sometimes only the last line
appears (job aborted before stderr is flushed?).



Our cluster uses Torque 2.5.13 and Maui 3.3.1, and the jobs are launched with
mpiexec 0.84.



I've configured MPICH as follows.

./configure --enable-error-checking=all --enable-error-messages=all
--enable-g=all --disable-fast --enable-check-compiler-flags --enable-fortran=all
--enable-cxx --enable-romio --enable-debuginfo --enable-versioning --enable-strict



I've found the problem goes away if I include the option
'--enable-nemesis-dbg-nolocal', but presumably that could have an impact on
performance.


The problem doesn't occur with MPICH 3.1.4, configured with the same options.


I've found this message in the mailing list archives, reporting the same problem
http://lists.mpich.org/pipermail/discuss/2015-December/004352.html
However, that was on a system using SLURM, and the replies suggest that the
problem was with SLURM rather than MPICH, and we're not using SLURM on our system.

Can anyone help?


Regards,
Andy.


-- 
Dr Andrew Wood
Fluid Gravity Engineering Ltd.
83 Market Street
St Andrews
Fife KY16 9NX
Tel: +44 (0)1334 460805
Fax: +44 (0)1334 460813

Fluid Gravity Engineering Ltd is registered in the UK with registration number
1674369. The registered address is Fluid Gravity Engineering Ltd, Unit 1, The
Old Coach House, 1 West Street, Emsworth, Hampshire, PO10 7DX.
_______________________________________________
discuss mailing list     discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss


More information about the discuss mailing list