[mpich-discuss] Unexplainable stall with MPI ports

Florian Lindner mailinglists at xgm.de
Thu Jul 26 08:30:23 CDT 2018


Hello,

I have a unexplainable behavior with an application that uses MPI ports.

The entire test application is available at https://github.com/floli/MPI_Ports (dependencies: cmake, boost).

MPICH is 3.2.1 and Arch Linux

Everything is about mpiports.cpp, launched with

/mpirun -n 4 ./mpiports -p A --peers 2 --commType many  
resp. -p B

unfortunately, the issue does not occur with less ranks an it always does not appear every time... :-/

  if (options.participant == A) { // receives connections
    if (options.commType == many) {
      portName = lookupPort(options, rank); // gets the port address, always the same for one rank
      for (auto r : comRanks) { // the ranks that connect to me
        MPI_Comm icomm;
        INFO << "Accepting connection on " << portName;
        DEBUG << "SIZE = " << portName.size();
        MPI_Comm_accept(portName.c_str(), MPI_INFO_NULL, 0, MPI_COMM_SELF, &icomm); // <- Rank 1, participant A stops here
        DEBUG << "Accepted connection on " << portName;
        DEBUG << "icomm size = " << getRemoteCommSize(icomm);
        int connectedRank = -1;
        MPI_Recv(&connectedRank, 1, MPI_INT, 0, MPI_ANY_TAG, icomm, MPI_STATUS_IGNORE);
        MPI_Send(&rank, 1, MPI_INT, 0, 0, icomm);
        DEBUG << "Received rank number " << connectedRank;
        comms[connectedRank] = icomm;
      }
    }
  }


  if (options.participant == B) { // connects to the intercomms
    sleep(1000);
      if (options.commType == many) {
      for (auto r : comRanks) {  // the ranks I connect to
        MPI_Comm icomm;
        portName = lookupPort(options, r);  // reads ports for connecting to rank r
        INFO << "Connecting to rank " << r << " on " << portName;
        MPI_Comm_connect(portName.c_str(), MPI_INFO_NULL, 0, MPI_COMM_SELF, &icomm);
        DEBUG << "icomm size = " << getRemoteCommSize(icomm);
        DEBUG << "Connected to rank " << r << " on " << portName;
        MPI_Send(&rank, 1, MPI_INT, 0, 0, icomm);
        int connectedRank = -1;
        MPI_Recv(&connectedRank, 1, MPI_INT, 0, MPI_ANY_TAG, icomm, MPI_STATUS_IGNORE); // <- Rank 0,1 participant B stop here
        comms[connectedRank] = icomm;        
      }
    }
  }


I wrote down my observations on where the ranks stop:

| Particpant | Rank | Where              | comRanks |  Port | comms |
|------------+------+--------------------+----------+-------+-------|
| A          |    0 | Start dataExchange | [0,1]    | 39741 | [0,1] |
| A          |    1 | Accept             | [0,1,2]  | 48187 | [2]   |
| A          |    2 | Start dataExchange | [2,3]    | 35863 | [2,3] |
| A          |    3 | Start dataExchange | [3]      | 39791 | [3]   |
|            |      |                    |          |       |       |
| B          |    0 | Recv connectedRank | [0,1]    | 39741 | []    |
| B          |    1 | Recv connectedRank | [0,1]    | 39741 | []    |
| B          |    2 | Start dataExchange | [1,2]    | 35863 | [1,2] |
| B          |    3 | Start dataExchange | [2,3]    | 39791 | [2,3] |

comRanks on A means: Ranks (on the other participant) that connect to be
comRanks on B means: Rank I connect to

Start dataexchange is an MPI_Barrier, it's located below the code I have pasted here.

It seems that Rank A,0 successfully received the connectRank = 0, 1 from B,0 and B,1 and also the Send operations returned. However, B,0 and B,1 both connected to rank 0 (as seen by the identical port number).
Also, A,1 waits for incoming connections, only B,2 has connected so far. Of course, since B,0 and B,1 wait to receiving connectedRank from A,0.

I know, that's a lot to ask, but I would be very grateful, if you would have a look and could provide hints on what is going wrong here...

Thanks!
Florian
_______________________________________________
discuss mailing list     discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss


More information about the discuss mailing list