[mpich-devel] Fatal error in PMPI_Isend: Unknown error class

Brent Morgan brent.taylormorgan at gmail.com
Thu Feb 18 01:02:31 CST 2021


Hi all,

As usual, we resolved our own issue.  It turns out there was an overloading
of MPI_Isend/MPI_Irecv with sending/receiving a vector of ~100 elements.
We improved this by playing with the grid configuration of
sub-communicators, but that failed further down the road.  We just switched
to mpi_send/mpi_recv which we found to be more robust, but our experiments
show this is slower (for reasons that differ between MPI_send and
MPI_isend).  The reason why we are using send/recv and not BCAST, is
because BCAST led to undesired timing behavior.  It turns out (that we
learned on our own), that broadcast sends data down a tree-like arrangement
of ranks (when messages are small and communication time is dominated by
network latency) or down a ring of ranks (with large messages). If a rank
on an intermediate tree level is late, it delays the broadcast to all other
ranks beneath it in the sub-tree. Similarly, in the ring arrangement, a
late rank delays all ranks after it in the ring. This can lead to delays in
subgroups of ranks if the ranks are not aligned in time (like it is for us).

Our team is wondering why there isn't *any* support on these email
discussion boards- is this the official place to get help for MPICH?  Are
there any other places like github discussions?  I would imagine NASA or
Argonne National Labs or other government laboratories are needing to
communicate issues that arise, or need to send/receive much more data than
a vector of 100 elements without issues. The more we implement MPICH on our
end, the less confident we are about this mpi implementation.  Our errors
we have observed have no documented error codes or error traces; we have
encountered many issues thus far (and reported them, with no responses)-
which led to us scanning mpich source code and trying random things to
arrive at solutions, which is far from ideal in any software development
environment.

We've arrived in a position that if we continue to experience more issues,
we are being forced to switch to another implementation (openmpi) because
of all the questions we've submitted, none of them were answered or
considered at all in neither discuss at mpich.org or devel at mpich.org- and I
see others' questions go unanswered as well.  We don't understand why there
isn't any documentation (at all) on some things we've encountered (and
others as well).

Best,
Brent

On Sun, Feb 14, 2021 at 12:20 AM Brent Morgan <brent.taylormorgan at gmail.com>
wrote:

> Hello all,
>
> I am seeing this:
>
> "Abort(105) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Isend:
> Unknown error class"
>
> This is after my float vector I am sending to the nodes has increased to a
> size of ~70 elements.  I cannot find any documentation about what this
> means.  I tracked down where this fails and upon sending it to the 300th
> process (of 600), the MPI_ISend() command dies and this error shows.
>
> Is there anything I can do to further diagnose the issue?
>
> Best,
> Brent
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/devel/attachments/20210218/0a1ca461/attachment.html>


More information about the devel mailing list