[mpich-discuss] MPICH 3.2 failing in MPI_Init

Kenneth Raffenetti raffenet at mcs.anl.gov
Thu Apr 7 12:52:13 CDT 2016


It's probably missing support for your job launcher on the cluster. You 
can provide a flag (--with-pbs=<install_dir>) to tell the build where to 
look.

Ken

On 04/07/2016 10:03 AM, Andrew Wood wrote:
> I've now tried using the MPICH-built version of mpiexec, but using our desktops
> as nodes, instead of our cluster. The job runs successfully in that case.
>
> Andy.
>
> On 07/04/16 16:59, Andrew Wood wrote:
>> Thanks for the response.
>>
>> Using the MPICH-built mpiexec gives this error message:
>>
>> ssh_askpass: exec(/usr/lib/ssh/ksshaskpass): No such file or directory
>> Host key verification failed.
>>
>>
>> which presumably means I'd have to do something with the ssh configuration on
>> the nodes. The same error occurs with MPICH 3.1.4, but like I said, MPICH 3.1.4
>> works fine with mpiexec 0.84.
>>
>> Andy.
>>
>>
>>
>> On 07/04/16 15:55, Kenneth Raffenetti wrote:
>>> Just to be sure, can you use the mpiexec that is built/installed with MPICH 3.2?
>>> You mention mpiexec version 0.84 below, so that's the first thing I would try.
>>>
>>> Ken
>>>
>>> On 04/07/2016 05:49 AM, Andrew Wood wrote:
>>>> Hi,
>>>>
>>>> I'm trying to get MPICH 3.2 working on our cluster, but jobs are failing in
>>>> MPI_Init with the following output if they are run on two or more nodes (4
>>>> processes per node):
>>>>
>>>>
>>>> Fatal error in MPI_Init: Other MPI error, error stack:
>>>> MPIR_Init_thread(490).................:
>>>> MPID_Init(201)........................: channel initialization failed
>>>> MPIDI_CH3_Init(93)....................:
>>>> MPID_nem_init(285)....................:
>>>> MPIDI_CH3I_Seg_commit(366)............:
>>>> MPIU_SHMW_Hnd_deserialize(324)........:
>>>> MPIU_SHMW_Seg_open(867)...............:
>>>> MPIU_SHMW_Seg_create_attach_templ(638): open failed - No such file or directory
>>>> mpiexec: Error: handle_pmi: unknown cmd abort.
>>>>
>>>>
>>>>
>>>> The full output above only occurs intermittently. Sometimes only the last line
>>>> appears (job aborted before stderr is flushed?).
>>>>
>>>>
>>>>
>>>> Our cluster uses Torque 2.5.13 and Maui 3.3.1, and the jobs are launched with
>>>> mpiexec 0.84.
>>>>
>>>>
>>>>
>>>> I've configured MPICH as follows.
>>>>
>>>> ./configure --enable-error-checking=all --enable-error-messages=all
>>>> --enable-g=all --disable-fast --enable-check-compiler-flags --enable-fortran=all
>>>> --enable-cxx --enable-romio --enable-debuginfo --enable-versioning
>>>> --enable-strict
>>>>
>>>>
>>>>
>>>> I've found the problem goes away if I include the option
>>>> '--enable-nemesis-dbg-nolocal', but presumably that could have an impact on
>>>> performance.
>>>>
>>>>
>>>> The problem doesn't occur with MPICH 3.1.4, configured with the same options.
>>>>
>>>>
>>>> I've found this message in the mailing list archives, reporting the same problem
>>>> http://lists.mpich.org/pipermail/discuss/2015-December/004352.html
>>>> However, that was on a system using SLURM, and the replies suggest that the
>>>> problem was with SLURM rather than MPICH, and we're not using SLURM on our
>>>> system.
>>>>
>>>> Can anyone help?
>>>>
>>>>
>>>> Regards,
>>>> Andy.
>>>>
>>>>
>>> .
>>>
>>
>>
>
>
_______________________________________________
discuss mailing list     discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss


More information about the discuss mailing list