[mpich-discuss] MPICH 3.2 failing in MPI_Init

Andrew Wood andrew at fluidgravity.co.uk
Thu Apr 7 12:03:30 CDT 2016


I've now tried using the MPICH-built version of mpiexec, but using our desktops
as nodes, instead of our cluster. The job runs successfully in that case.

Andy.

On 07/04/16 16:59, Andrew Wood wrote:
> Thanks for the response.
> 
> Using the MPICH-built mpiexec gives this error message:
> 
> ssh_askpass: exec(/usr/lib/ssh/ksshaskpass): No such file or directory
> Host key verification failed.
> 
> 
> which presumably means I'd have to do something with the ssh configuration on
> the nodes. The same error occurs with MPICH 3.1.4, but like I said, MPICH 3.1.4
> works fine with mpiexec 0.84.
> 
> Andy.
> 
> 
> 
> On 07/04/16 15:55, Kenneth Raffenetti wrote:
>> Just to be sure, can you use the mpiexec that is built/installed with MPICH 3.2?
>> You mention mpiexec version 0.84 below, so that's the first thing I would try.
>>
>> Ken
>>
>> On 04/07/2016 05:49 AM, Andrew Wood wrote:
>>> Hi,
>>>
>>> I'm trying to get MPICH 3.2 working on our cluster, but jobs are failing in
>>> MPI_Init with the following output if they are run on two or more nodes (4
>>> processes per node):
>>>
>>>
>>> Fatal error in MPI_Init: Other MPI error, error stack:
>>> MPIR_Init_thread(490).................:
>>> MPID_Init(201)........................: channel initialization failed
>>> MPIDI_CH3_Init(93)....................:
>>> MPID_nem_init(285)....................:
>>> MPIDI_CH3I_Seg_commit(366)............:
>>> MPIU_SHMW_Hnd_deserialize(324)........:
>>> MPIU_SHMW_Seg_open(867)...............:
>>> MPIU_SHMW_Seg_create_attach_templ(638): open failed - No such file or directory
>>> mpiexec: Error: handle_pmi: unknown cmd abort.
>>>
>>>
>>>
>>> The full output above only occurs intermittently. Sometimes only the last line
>>> appears (job aborted before stderr is flushed?).
>>>
>>>
>>>
>>> Our cluster uses Torque 2.5.13 and Maui 3.3.1, and the jobs are launched with
>>> mpiexec 0.84.
>>>
>>>
>>>
>>> I've configured MPICH as follows.
>>>
>>> ./configure --enable-error-checking=all --enable-error-messages=all
>>> --enable-g=all --disable-fast --enable-check-compiler-flags --enable-fortran=all
>>> --enable-cxx --enable-romio --enable-debuginfo --enable-versioning
>>> --enable-strict
>>>
>>>
>>>
>>> I've found the problem goes away if I include the option
>>> '--enable-nemesis-dbg-nolocal', but presumably that could have an impact on
>>> performance.
>>>
>>>
>>> The problem doesn't occur with MPICH 3.1.4, configured with the same options.
>>>
>>>
>>> I've found this message in the mailing list archives, reporting the same problem
>>> http://lists.mpich.org/pipermail/discuss/2015-December/004352.html
>>> However, that was on a system using SLURM, and the replies suggest that the
>>> problem was with SLURM rather than MPICH, and we're not using SLURM on our
>>> system.
>>>
>>> Can anyone help?
>>>
>>>
>>> Regards,
>>> Andy.
>>>
>>>
>> .
>>
> 
> 


-- 
Dr Andrew Wood
Fluid Gravity Engineering Ltd.
83 Market Street
St Andrews
Fife KY16 9NX
Tel: +44 (0)1334 460805
Fax: +44 (0)1334 460813

Fluid Gravity Engineering Ltd is registered in the UK with registration number
1674369. The registered address is Fluid Gravity Engineering Ltd, Unit 1, The
Old Coach House, 1 West Street, Emsworth, Hampshire, PO10 7DX.


More information about the discuss mailing list