[mpich-discuss] MPI_Abort not working with multinode jobs launched by hydra-3.2

Doug Johnson djohnson at osc.edu
Tue Dec 13 14:13:28 CST 2016


Hi Min,

Thanks for the clarification about avoiding version mismatch.

Looking more closely at the mpiexec hydra included with Intel MPI, they
include a library named libjmi_pbs.so.1.0 that includes the necessary TM
calls to fully integrate with Torque for job launch.  So, no need for an
out of tree mpiexec hydra.

Doug


Min Si <msi at anl.gov> writes:

> Hi Doug,
>
> We have changed the internal implementation of MPI_Abort in both MPICH
> code and hydra in 3.2. However, the Intel MPI 5.1.3.210 was based on an
> old MPICH version v3.1.2. Thus if you compile the program with Intel MPI
> which is based on MPICH v3.1.2 and execute the binary with hydra v3.2,
> processes in remote nodes might not be able to exit.
>
> We do not support the usage with mismatched versions of MPICH and hydra.
> You should try a new version of Intel MPI if you want to use hydra 3.2.
> AFAIK, Intel MPI 2017.0.064 is based on MPICH v3.2.
>
> Min
>
> On 12/12/16 7:16 AM, Doug Johnson wrote:
>> Hi,
>>
>> We've encountered a problem with hydra-3.2 and Intel MPI 5.1.3.210 with
>> multi-node MPI programs.  A call to MPI_Abort results in the all MPI
>> ranks running on the same node as the rank that called MPI_Abort to
>> exit, but leaves the other ranks running.  The program hangs on the
>> other nodes interminably (at least until the time limit of the batch job
>> is reached.)  The behavior is the same with hydra-3.3a2.  The problem
>> does not exist when using hydra-3.1.4, all processes exit on all nodes.
>>
>> Reverting commit 9882227414439a4a79edd49ec10261742bb60108 fixes this
>> problem with 3.2.
>>
>> The hydra shipped with Intel MPI does not exhibit this problem, but we
>> are using an out-of-tree hydra as want the pbs launcher enabled.  Is
>> there a better mechanism for process cleanup other than reverting the
>> patch above?  Let me know if there's other information needed.
>>
>> Thanks,
>> Doug
>> _______________________________________________
>> discuss mailing list     discuss at mpich.org
>> To manage subscription options or unsubscribe:
>> https://lists.mpich.org/mailman/listinfo/discuss
>
> _______________________________________________
> discuss mailing list     discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss
_______________________________________________
discuss mailing list     discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss


More information about the discuss mailing list