[mpich-discuss] Getting an error in MPI_Comm_accept - ideas?

Alexander Rast alex.rast.technical at gmail.com
Fri Sep 21 10:39:33 CDT 2018


All,

I'm running an MPI_Comm_accept in a separate thread whose purpose is to
allow connections from other (hitherto unknown) MPI universes. A well-known
issue of such configurations is that MPI_Comm_accept being blocking, in
order for the application to exit, a connection MUST be made even if no
actual other universes attempted to connect. In this situation (see Gropp,
et al. "Using Advanced MPI", section 6.5) it seems the 'customary' solution
is to have the local universe connect to itself and then shut down.

However, I'm getting the following error on exit:

Fatal error in PMPI_Comm_accept: Message truncated, error stack:
PMPI_Comm_accept(129).............: MPI_Comm_accept(port="tag#0$description#
aesop.cl.cam.ac.uk$port#43337$ifname#128.232.98.176$", MPI_INFO_NULL,
root=6, MPI_COMM_WORLD, newcomm=0x1473f24) failed
MPID_Comm_accept(153).............:
MPIDI_Comm_accept(1005)...........:
MPIR_Bcast_intra(1249)............:
MPIR_SMP_Bcast(1088)..............:
MPIR_Bcast_binomial(239)..........:
MPIDI_CH3U_Receive_data_found(131): Message from rank 0 and tag 2
truncated; 260 bytes received but buffer size is 12

You typically get several of these messages, the number seems to vary from
trial to trial, I'm guessing it comes from different MPI processes
(although there is no fixed relationship between number of processes
started and number of error messages)

Does anyone have any suggestions on what types of problems might cause this
error? (I'm not expecting you to identify and debug the problem
specifically, unless, perhaps, this error is indicative of some particular
mistake, just would like any hints on where to look)

If it helps, here are the 3 main relevant functions involved - there is, of
course, a lot more going on in the application besides these, but the error
is occurring at shutdown and by this point the rest of the application is
quiescent. Also note that the Connect routine is NOT called at shutdown;
it's invoked when a 'real' universe wants to connect.

//------------------------------------------------------------------------------

void* CommonBase::Accept(void* par)
/* Blocking routine to connect to another MPI universe by publishing a port.
   This operates in a separate thread to avoid blocking the whole process.
*/
{
CommonBase* parent=static_cast<CommonBase*>(par);

while (parent->AcceptConns.load(std::memory_order_relaxed))
{
// run the blocking accept itself.
if
(MPI_Comm_accept(parent->MPIPort.load(std::memory_order_seq_cst),MPI_INFO_NULL,parent->Lrank.load(std::memory_order_relaxed),MPI_COMM_WORLD,parent->Tcomm.load(std::memory_order_seq_cst)))
{
   printf("Error: attempt to connect to another MPI universe failed\n");
   parent->AcceptConns.store(false,std::memory_order_relaxed);
   break;
}
// Now trigger the Connect process in the main thread to complete the setup
PMsg_p Creq;
string N("");              // zero-length string indicates a server-side
connection
Creq.Put(0,&N);
Creq.Key(Q::SYST,Q::CONN);
Creq.Src(0);
Creq.comm = MPI_COMM_SELF;
Creq.Send(0);
while (*(parent->Tcomm.load(std::memory_order_seq_cst)) != MPI_COMM_NULL);
// block until connect has succeeded
}
pthread_exit(par);
return par;
}

//------------------------------------------------------------------------------

unsigned CommonBase::Connect(string svc)
// connects this process' MPI universe to a remote universe that has
published
// a name to access it by.
{
int error = MPI_SUCCESS;
// a server has its port already so can just open a comm
if (svc=="") Comms.push_back(*Tcomm.load(std::memory_order_seq_cst));
else // clients need to look up the service name
{
   MPI_Comm newcomm;
   char port[MPI_MAX_PORT_NAME];
   // Get the published port for the service name asked for.
   // Exit if we don't get a port, probably because the remote universe
isn't
   // initialised yet (we can always retry).
   if (error = MPI_Lookup_name(svc.c_str(),MPI_INFO_NULL,port)) return
error;
   // now try to establish the connection itself. Again, we can always
retry.
   if (error =
MPI_Comm_connect(port,MPI_INFO_NULL,0,MPI_COMM_WORLD,&newcomm)) return
error;
   Comms.push_back(newcomm); // as long as we succeeded, add to the list of
comms
}
int rUsize;
MPI_Comm_remote_size(Comms.back(), &rUsize);
Usize.push_back(rUsize);       // record the size of the remote universe
FnMapx.push_back(new FnMap_t); // give the new comm some function tables to
use
pPmap.push_back(new ProcMap(this));  // and a new processor map for the
remote group
PMsg_p prMsg;
SendPMap(Comms.back(), &prMsg);        // Send our process data to the
remote group
int fIdx=FnMapx.size()-1;
// populate the new function table with the global functions
(*FnMapx[fIdx])[Msg_p::KEY(Q::EXIT                )] = &CommonBase::OnExit;
(*FnMapx[fIdx])[Msg_p::KEY(Q::PMAP                )] = &CommonBase::OnPmap;
(*FnMapx[fIdx])[Msg_p::KEY(Q::SYST,Q::PING,Q::ACK )] =
&CommonBase::OnSystPingAck;
(*FnMapx[fIdx])[Msg_p::KEY(Q::SYST,Q::PING,Q::REQ )] =
&CommonBase::OnSystPingReq;
(*FnMapx[fIdx])[Msg_p::KEY(Q::TEST,Q::FLOO        )] =
&CommonBase::OnTestFloo;
if (svc=="") *Tcomm.load(std::memory_order_seq_cst) = MPI_COMM_NULL;  //
release any Accept comm.
return error;
}

//------------------------------------------------------------------------------

unsigned CommonBase::OnExit(PMsg_p * Z,unsigned cIdx)
// Do not post anything further here - the LogServer may have already gone
{
AcceptConns.store(false,std::memory_order_relaxed); // stop accepting
connections
if (acpt_running)
{
   printf("(%s)::CommonBase::OnExit closing down Accept MPI
request\n",Sderived.c_str());
   fflush(stdout);
   // have to close the Accept thread via a matching MPI_Comm_connect
   // because the MPI interface has made no provision for a nonblocking
accept, thus
   // otherwise the Accept thread will block forever waiting for a message
that will
   // never come because we are shutting down. See Gropp, et al. "Using
Advanced MPI"
   MPI_Comm dcomm;

MPI_Comm_connect(MPIPort.load(std::memory_order_seq_cst),MPI_INFO_NULL,Lrank.load(std::memory_order_relaxed),MPI_COMM_WORLD,&dcomm);
   pthread_join(MPI_accept,NULL);
   acpt_running = false;
}
if (Urank == Lrank.load(std::memory_order_relaxed))
{

MPI_Unpublish_name(MPISvc,MPI_INFO_NULL,MPIPort.load(std::memory_order_seq_cst));
   MPI_Close_port(MPIPort.load(std::memory_order_seq_cst));
}
return 1;
}
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20180921/9502f45e/attachment.html>
-------------- next part --------------
_______________________________________________
discuss mailing list     discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss


More information about the discuss mailing list