[mpich-devel] Fatal error in PMPI_Isend: Unknown error class

Zhou, Hui zhouh at anl.gov
Thu Feb 18 08:47:17 CST 2021


Dear Brent Morgan,

I am glad you resolved your own issue. But I guess you are having the similar after taste that your resolution is less than ideal. Is that the reason for your ranting? It appears the original issue was the MPI_Bcast algorithm, that due to your particular situation, neither the tree algorithm or ring algorithm are not performing? For the record, this is the first time I get a clue on where the issue is at. I recall the past emails were just complaining Isend and Irecv. There are two key steps before a good communication or resolution can happen. First, all the key information need be communicated. I am not shifting the blame to users. This can be very difficult especially when we don’t know what information is critical. Very difficult or impossible before we understand what is the issue – chicken and egg problem. So the best we can do for one, provide as much details as we (means you) think is relevant, and for two, keep finding new details that we omitted, and for three, be understandable. The second key step is, of course, we (whoever you are asking for help from) need know the solution. Trust me, if we *understand* the issue and *know* the solution, we can’t wait to help you. But when we don’t understand your issue (due to lack of information or probes) or don’t know the solution (research problems), you may feel that we are completely ignoring your ask for help. What response would you suggest (when we have no clue)?
--
Hui Zhou


From: Brent Morgan via devel <devel at mpich.org>
Date: Thursday, February 18, 2021 at 1:02 AM
To: discuss at mpich.org <discuss at mpich.org>, devel at mpich.org <devel at mpich.org>, Robert Katona <robert.katona at hotmail.com>
Cc: Brent Morgan <brent.taylormorgan at gmail.com>
Subject: Re: [mpich-devel] Fatal error in PMPI_Isend: Unknown error class
Hi all,

As usual, we resolved our own issue.  It turns out there was an overloading of MPI_Isend/MPI_Irecv with sending/receiving a vector of ~100 elements.  We improved this by playing with the grid configuration of sub-communicators, but that failed further down the road.  We just switched to mpi_send/mpi_recv which we found to be more robust, but our experiments show this is slower (for reasons that differ between MPI_send and MPI_isend).  The reason why we are using send/recv and not BCAST, is because BCAST led to undesired timing behavior.  It turns out (that we learned on our own), that broadcast sends data down a tree-like arrangement of ranks (when messages are small and communication time is dominated by network latency) or down a ring of ranks (with large messages). If a rank on an intermediate tree level is late, it delays the broadcast to all other ranks beneath it in the sub-tree. Similarly, in the ring arrangement, a late rank delays all ranks after it in the ring. This can lead to delays in subgroups of ranks if the ranks are not aligned in time (like it is for us).

Our team is wondering why there isn't any support on these email discussion boards- is this the official place to get help for MPICH?  Are there any other places like github discussions?  I would imagine NASA or Argonne National Labs or other government laboratories are needing to communicate issues that arise, or need to send/receive much more data than a vector of 100 elements without issues. The more we implement MPICH on our end, the less confident we are about this mpi implementation.  Our errors we have observed have no documented error codes or error traces; we have encountered many issues thus far (and reported them, with no responses)- which led to us scanning mpich source code and trying random things to arrive at solutions, which is far from ideal in any software development environment.

We've arrived in a position that if we continue to experience more issues, we are being forced to switch to another implementation (openmpi) because of all the questions we've submitted, none of them were answered or considered at all in neither discuss at mpich.org<mailto:discuss at mpich.org> or devel at mpich.org- and I see others' questions go unanswered as well.  We don't understand why there isn't any documentation (at all) on some things we've encountered (and others as well).

Best,
Brent

On Sun, Feb 14, 2021 at 12:20 AM Brent Morgan <brent.taylormorgan at gmail.com<mailto:brent.taylormorgan at gmail.com>> wrote:
Hello all,

I am seeing this:

"Abort(105) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Isend: Unknown error class"

This is after my float vector I am sending to the nodes has increased to a size of ~70 elements.  I cannot find any documentation about what this means.  I tracked down where this fails and upon sending it to the 300th process (of 600), the MPI_ISend() command dies and this error shows.

Is there anything I can do to further diagnose the issue?

Best,
Brent
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/devel/attachments/20210218/e4a2beb3/attachment.html>


More information about the devel mailing list