[mpich-discuss] MPICH 3.2 failing in MPI_Init

Andrew Wood andrew at fluidgravity.co.uk
Fri Apr 8 04:50:21 CDT 2016


That fixes it and the job now runs successfully. So it looks like the problem is
associated with mpiexec 0.84.

Thanks,
Andy.

On 07/04/16 18:52, Kenneth Raffenetti wrote:
> It's probably missing support for your job launcher on the cluster. You can
> provide a flag (--with-pbs=<install_dir>) to tell the build where to look.
> 
> Ken
> 
> On 04/07/2016 10:03 AM, Andrew Wood wrote:
>> I've now tried using the MPICH-built version of mpiexec, but using our desktops
>> as nodes, instead of our cluster. The job runs successfully in that case.
>>
>> Andy.
>>
>> On 07/04/16 16:59, Andrew Wood wrote:
>>> Thanks for the response.
>>>
>>> Using the MPICH-built mpiexec gives this error message:
>>>
>>> ssh_askpass: exec(/usr/lib/ssh/ksshaskpass): No such file or directory
>>> Host key verification failed.
>>>
>>>
>>> which presumably means I'd have to do something with the ssh configuration on
>>> the nodes. The same error occurs with MPICH 3.1.4, but like I said, MPICH 3.1.4
>>> works fine with mpiexec 0.84.
>>>
>>> Andy.
>>>
>>>
>>>
>>> On 07/04/16 15:55, Kenneth Raffenetti wrote:
>>>> Just to be sure, can you use the mpiexec that is built/installed with MPICH
>>>> 3.2?
>>>> You mention mpiexec version 0.84 below, so that's the first thing I would try.
>>>>
>>>> Ken
>>>>
>>>> On 04/07/2016 05:49 AM, Andrew Wood wrote:
>>>>> Hi,
>>>>>
>>>>> I'm trying to get MPICH 3.2 working on our cluster, but jobs are failing in
>>>>> MPI_Init with the following output if they are run on two or more nodes (4
>>>>> processes per node):
>>>>>
>>>>>
>>>>> Fatal error in MPI_Init: Other MPI error, error stack:
>>>>> MPIR_Init_thread(490).................:
>>>>> MPID_Init(201)........................: channel initialization failed
>>>>> MPIDI_CH3_Init(93)....................:
>>>>> MPID_nem_init(285)....................:
>>>>> MPIDI_CH3I_Seg_commit(366)............:
>>>>> MPIU_SHMW_Hnd_deserialize(324)........:
>>>>> MPIU_SHMW_Seg_open(867)...............:
>>>>> MPIU_SHMW_Seg_create_attach_templ(638): open failed - No such file or
>>>>> directory
>>>>> mpiexec: Error: handle_pmi: unknown cmd abort.
>>>>>
>>>>>
>>>>>
>>>>> The full output above only occurs intermittently. Sometimes only the last line
>>>>> appears (job aborted before stderr is flushed?).
>>>>>
>>>>>
>>>>>
>>>>> Our cluster uses Torque 2.5.13 and Maui 3.3.1, and the jobs are launched with
>>>>> mpiexec 0.84.
>>>>>
>>>>>
>>>>>
>>>>> I've configured MPICH as follows.
>>>>>
>>>>> ./configure --enable-error-checking=all --enable-error-messages=all
>>>>> --enable-g=all --disable-fast --enable-check-compiler-flags
>>>>> --enable-fortran=all
>>>>> --enable-cxx --enable-romio --enable-debuginfo --enable-versioning
>>>>> --enable-strict
>>>>>
>>>>>
>>>>>
>>>>> I've found the problem goes away if I include the option
>>>>> '--enable-nemesis-dbg-nolocal', but presumably that could have an impact on
>>>>> performance.
>>>>>
>>>>>
>>>>> The problem doesn't occur with MPICH 3.1.4, configured with the same options.
>>>>>
>>>>>
>>>>> I've found this message in the mailing list archives, reporting the same
>>>>> problem
>>>>> http://lists.mpich.org/pipermail/discuss/2015-December/004352.html
>>>>> However, that was on a system using SLURM, and the replies suggest that the
>>>>> problem was with SLURM rather than MPICH, and we're not using SLURM on our
>>>>> system.
>>>>>
>>>>> Can anyone help?
>>>>>
>>>>>
>>>>> Regards,
>>>>> Andy.
>>>>>
>>>>>
>>>> .
>>>>
>>>
>>>
>>
>>
> .
> 


-- 
Dr Andrew Wood
Fluid Gravity Engineering Ltd.
83 Market Street
St Andrews
Fife KY16 9NX
Tel: +44 (0)1334 460805
Fax: +44 (0)1334 460813

Fluid Gravity Engineering Ltd is registered in the UK with registration number
1674369. The registered address is Fluid Gravity Engineering Ltd, Unit 1, The
Old Coach House, 1 West Street, Emsworth, Hampshire, PO10 7DX.


More information about the discuss mailing list