[mpich-discuss] MPI_Abort not working with multinode jobs launched by hydra-3.2

Doug Johnson djohnson at osc.edu
Mon Dec 12 07:16:44 CST 2016


Hi,

We've encountered a problem with hydra-3.2 and Intel MPI 5.1.3.210 with
multi-node MPI programs.  A call to MPI_Abort results in the all MPI
ranks running on the same node as the rank that called MPI_Abort to
exit, but leaves the other ranks running.  The program hangs on the
other nodes interminably (at least until the time limit of the batch job
is reached.)  The behavior is the same with hydra-3.3a2.  The problem
does not exist when using hydra-3.1.4, all processes exit on all nodes.

Reverting commit 9882227414439a4a79edd49ec10261742bb60108 fixes this
problem with 3.2.

The hydra shipped with Intel MPI does not exhibit this problem, but we
are using an out-of-tree hydra as want the pbs launcher enabled.  Is
there a better mechanism for process cleanup other than reverting the
patch above?  Let me know if there's other information needed.

Thanks,
Doug
_______________________________________________
discuss mailing list     discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss


More information about the discuss mailing list