[mpich-discuss] Getting an error in MPI_Comm_accept - ideas?

Amer, Abdelhalim aamer at anl.gov
Thu Sep 27 08:37:28 CDT 2018


Hi,

Can you pack the smallest test example in a single file so that we
compile and run it? Also please give us more information about the MPICH
version you are using and how it was built (run the `mpichversion`
binary to get this information). It would be better to use the latest
version (3.2.1) before reporting back. All of this is to help us
reproduce your problem or simply solving it by upgrading to a newer
version of MPICH.

Halim
www.mcs.anl.gov/~aamer

On 9/21/18 10:39 AM, Alexander Rast wrote:
> All,
> 
> I'm running an MPI_Comm_accept in a separate thread whose purpose is to
> allow connections from other (hitherto unknown) MPI universes. A
> well-known issue of such configurations is that MPI_Comm_accept being
> blocking, in order for the application to exit, a connection MUST be
> made even if no actual other universes attempted to connect. In this
> situation (see Gropp, et al. "Using Advanced MPI", section 6.5) it seems
> the 'customary' solution is to have the local universe connect to itself
> and then shut down.
>  
> However, I'm getting the following error on exit:
> 
> Fatal error in PMPI_Comm_accept: Message truncated, error stack:
> PMPI_Comm_accept(129).............:
> MPI_Comm_accept(port="tag#0$description#aesop.cl.cam.ac.uk
> <http://aesop.cl.cam.ac.uk>$port#43337$ifname#128.232.98.176$",
> MPI_INFO_NULL, root=6, MPI_COMM_WORLD, newcomm=0x1473f24) failed
> MPID_Comm_accept(153).............:
> MPIDI_Comm_accept(1005)...........:
> MPIR_Bcast_intra(1249)............:
> MPIR_SMP_Bcast(1088)..............:
> MPIR_Bcast_binomial(239)..........:
> MPIDI_CH3U_Receive_data_found(131): Message from rank 0 and tag 2
> truncated; 260 bytes received but buffer size is 12
> 
> You typically get several of these messages, the number seems to vary
> from trial to trial, I'm guessing it comes from different MPI processes
> (although there is no fixed relationship between number of processes
> started and number of error messages)
> 
> Does anyone have any suggestions on what types of problems might cause
> this error? (I'm not expecting you to identify and debug the problem
> specifically, unless, perhaps, this error is indicative of some
> particular mistake, just would like any hints on where to look)
> 
> If it helps, here are the 3 main relevant functions involved - there is,
> of course, a lot more going on in the application besides these, but the
> error is occurring at shutdown and by this point the rest of the
> application is quiescent. Also note that the Connect routine is NOT
> called at shutdown; it's invoked when a 'real' universe wants to connect.
> 
> //------------------------------------------------------------------------------
> 
> void* CommonBase::Accept(void* par)
> /* Blocking routine to connect to another MPI universe by publishing a port.
>    This operates in a separate thread to avoid blocking the whole process.
> */
> {
> CommonBase* parent=static_cast<CommonBase*>(par);
> 
> while (parent->AcceptConns.load(std::memory_order_relaxed))
> {
> // run the blocking accept itself.
> if
> (MPI_Comm_accept(parent->MPIPort.load(std::memory_order_seq_cst),MPI_INFO_NULL,parent->Lrank.load(std::memory_order_relaxed),MPI_COMM_WORLD,parent->Tcomm.load(std::memory_order_seq_cst)))
> {
>    printf("Error: attempt to connect to another MPI universe failed\n");
>    parent->AcceptConns.store(false,std::memory_order_relaxed);
>    break;
> }
> // Now trigger the Connect process in the main thread to complete the setup
> PMsg_p Creq;
> string N("");              // zero-length string indicates a server-side
> connection
> Creq.Put(0,&N);
> Creq.Key(Q::SYST,Q::CONN);
> Creq.Src(0);
> Creq.comm = MPI_COMM_SELF;
> Creq.Send(0);
> while (*(parent->Tcomm.load(std::memory_order_seq_cst)) !=
> MPI_COMM_NULL); // block until connect has succeeded
> }
> pthread_exit(par);
> return par;
> }
> 
> //------------------------------------------------------------------------------
> 
> unsigned CommonBase::Connect(string svc)
> // connects this process' MPI universe to a remote universe that has
> published
> // a name to access it by.
> {
> int error = MPI_SUCCESS;
> // a server has its port already so can just open a comm
> if (svc=="") Comms.push_back(*Tcomm.load(std::memory_order_seq_cst));
> else // clients need to look up the service name
> {
>    MPI_Comm newcomm;
>    char port[MPI_MAX_PORT_NAME];
>    // Get the published port for the service name asked for.
>    // Exit if we don't get a port, probably because the remote universe
> isn't
>    // initialised yet (we can always retry).
>    if (error = MPI_Lookup_name(svc.c_str(),MPI_INFO_NULL,port)) return
> error;
>    // now try to establish the connection itself. Again, we can always
> retry.
>    if (error =
> MPI_Comm_connect(port,MPI_INFO_NULL,0,MPI_COMM_WORLD,&newcomm)) return
> error;
>    Comms.push_back(newcomm); // as long as we succeeded, add to the list
> of comms
> }
> int rUsize;
> MPI_Comm_remote_size(Comms.back(), &rUsize);
> Usize.push_back(rUsize);       // record the size of the remote universe
> FnMapx.push_back(new FnMap_t); // give the new comm some function tables
> to use
> pPmap.push_back(new ProcMap(this));  // and a new processor map for the
> remote group
> PMsg_p prMsg;
> SendPMap(Comms.back(), &prMsg);        // Send our process data to the
> remote group
> int fIdx=FnMapx.size()-1;
> // populate the new function table with the global functions
> (*FnMapx[fIdx])[Msg_p::KEY(Q::EXIT                )] = &CommonBase::OnExit;
> (*FnMapx[fIdx])[Msg_p::KEY(Q::PMAP                )] = &CommonBase::OnPmap;
> (*FnMapx[fIdx])[Msg_p::KEY(Q::SYST,Q::PING,Q::ACK )] =
> &CommonBase::OnSystPingAck;
> (*FnMapx[fIdx])[Msg_p::KEY(Q::SYST,Q::PING,Q::REQ )] =
> &CommonBase::OnSystPingReq;
> (*FnMapx[fIdx])[Msg_p::KEY(Q::TEST,Q::FLOO        )] =
> &CommonBase::OnTestFloo;
> if (svc=="") *Tcomm.load(std::memory_order_seq_cst) = MPI_COMM_NULL;  //
> release any Accept comm.
> return error;
> }
> 
> //------------------------------------------------------------------------------
> 
> unsigned CommonBase::OnExit(PMsg_p * Z,unsigned cIdx)
> // Do not post anything further here - the LogServer may have already gone
> {
> AcceptConns.store(false,std::memory_order_relaxed); // stop accepting
> connections
> if (acpt_running)
> {
>    printf("(%s)::CommonBase::OnExit closing down Accept MPI
> request\n",Sderived.c_str());
>    fflush(stdout);
>    // have to close the Accept thread via a matching MPI_Comm_connect
>    // because the MPI interface has made no provision for a nonblocking
> accept, thus
>    // otherwise the Accept thread will block forever waiting for a
> message that will
>    // never come because we are shutting down. See Gropp, et al. "Using
> Advanced MPI"
>    MPI_Comm dcomm;
>    
> MPI_Comm_connect(MPIPort.load(std::memory_order_seq_cst),MPI_INFO_NULL,Lrank.load(std::memory_order_relaxed),MPI_COMM_WORLD,&dcomm);
>    pthread_join(MPI_accept,NULL);
>    acpt_running = false;
> }
> if (Urank == Lrank.load(std::memory_order_relaxed))
> {
>   
> MPI_Unpublish_name(MPISvc,MPI_INFO_NULL,MPIPort.load(std::memory_order_seq_cst));
>    MPI_Close_port(MPIPort.load(std::memory_order_seq_cst));
> }
> return 1;
> }
> 
> 
> 
> _______________________________________________
> discuss mailing list     discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss
> 
_______________________________________________
discuss mailing list     discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss


More information about the discuss mailing list