[mpich-discuss] MPICH 3.2 failing in MPI_Init

Andrew Wood andrew at fluidgravity.co.uk
Thu Apr 7 10:59:21 CDT 2016


Thanks for the response.

Using the MPICH-built mpiexec gives this error message:

ssh_askpass: exec(/usr/lib/ssh/ksshaskpass): No such file or directory
Host key verification failed.


which presumably means I'd have to do something with the ssh configuration on
the nodes. The same error occurs with MPICH 3.1.4, but like I said, MPICH 3.1.4
works fine with mpiexec 0.84.

Andy.



On 07/04/16 15:55, Kenneth Raffenetti wrote:
> Just to be sure, can you use the mpiexec that is built/installed with MPICH 3.2?
> You mention mpiexec version 0.84 below, so that's the first thing I would try.
> 
> Ken
> 
> On 04/07/2016 05:49 AM, Andrew Wood wrote:
>> Hi,
>>
>> I'm trying to get MPICH 3.2 working on our cluster, but jobs are failing in
>> MPI_Init with the following output if they are run on two or more nodes (4
>> processes per node):
>>
>>
>> Fatal error in MPI_Init: Other MPI error, error stack:
>> MPIR_Init_thread(490).................:
>> MPID_Init(201)........................: channel initialization failed
>> MPIDI_CH3_Init(93)....................:
>> MPID_nem_init(285)....................:
>> MPIDI_CH3I_Seg_commit(366)............:
>> MPIU_SHMW_Hnd_deserialize(324)........:
>> MPIU_SHMW_Seg_open(867)...............:
>> MPIU_SHMW_Seg_create_attach_templ(638): open failed - No such file or directory
>> mpiexec: Error: handle_pmi: unknown cmd abort.
>>
>>
>>
>> The full output above only occurs intermittently. Sometimes only the last line
>> appears (job aborted before stderr is flushed?).
>>
>>
>>
>> Our cluster uses Torque 2.5.13 and Maui 3.3.1, and the jobs are launched with
>> mpiexec 0.84.
>>
>>
>>
>> I've configured MPICH as follows.
>>
>> ./configure --enable-error-checking=all --enable-error-messages=all
>> --enable-g=all --disable-fast --enable-check-compiler-flags --enable-fortran=all
>> --enable-cxx --enable-romio --enable-debuginfo --enable-versioning
>> --enable-strict
>>
>>
>>
>> I've found the problem goes away if I include the option
>> '--enable-nemesis-dbg-nolocal', but presumably that could have an impact on
>> performance.
>>
>>
>> The problem doesn't occur with MPICH 3.1.4, configured with the same options.
>>
>>
>> I've found this message in the mailing list archives, reporting the same problem
>> http://lists.mpich.org/pipermail/discuss/2015-December/004352.html
>> However, that was on a system using SLURM, and the replies suggest that the
>> problem was with SLURM rather than MPICH, and we're not using SLURM on our
>> system.
>>
>> Can anyone help?
>>
>>
>> Regards,
>> Andy.
>>
>>
> .
> 


-- 
Dr Andrew Wood
Fluid Gravity Engineering Ltd.
83 Market Street
St Andrews
Fife KY16 9NX
Tel: +44 (0)1334 460805
Fax: +44 (0)1334 460813

Fluid Gravity Engineering Ltd is registered in the UK with registration number
1674369. The registered address is Fluid Gravity Engineering Ltd, Unit 1, The
Old Coach House, 1 West Street, Emsworth, Hampshire, PO10 7DX.


More information about the discuss mailing list