[mpich-discuss] Unexplainable stall with MPI ports
Florian Lindner
mailinglists at xgm.de
Thu Jul 26 08:30:23 CDT 2018
Hello,
I have a unexplainable behavior with an application that uses MPI ports.
The entire test application is available at https://github.com/floli/MPI_Ports (dependencies: cmake, boost).
MPICH is 3.2.1 and Arch Linux
Everything is about mpiports.cpp, launched with
/mpirun -n 4 ./mpiports -p A --peers 2 --commType many
resp. -p B
unfortunately, the issue does not occur with less ranks an it always does not appear every time... :-/
if (options.participant == A) { // receives connections
if (options.commType == many) {
portName = lookupPort(options, rank); // gets the port address, always the same for one rank
for (auto r : comRanks) { // the ranks that connect to me
MPI_Comm icomm;
INFO << "Accepting connection on " << portName;
DEBUG << "SIZE = " << portName.size();
MPI_Comm_accept(portName.c_str(), MPI_INFO_NULL, 0, MPI_COMM_SELF, &icomm); // <- Rank 1, participant A stops here
DEBUG << "Accepted connection on " << portName;
DEBUG << "icomm size = " << getRemoteCommSize(icomm);
int connectedRank = -1;
MPI_Recv(&connectedRank, 1, MPI_INT, 0, MPI_ANY_TAG, icomm, MPI_STATUS_IGNORE);
MPI_Send(&rank, 1, MPI_INT, 0, 0, icomm);
DEBUG << "Received rank number " << connectedRank;
comms[connectedRank] = icomm;
}
}
}
if (options.participant == B) { // connects to the intercomms
sleep(1000);
if (options.commType == many) {
for (auto r : comRanks) { // the ranks I connect to
MPI_Comm icomm;
portName = lookupPort(options, r); // reads ports for connecting to rank r
INFO << "Connecting to rank " << r << " on " << portName;
MPI_Comm_connect(portName.c_str(), MPI_INFO_NULL, 0, MPI_COMM_SELF, &icomm);
DEBUG << "icomm size = " << getRemoteCommSize(icomm);
DEBUG << "Connected to rank " << r << " on " << portName;
MPI_Send(&rank, 1, MPI_INT, 0, 0, icomm);
int connectedRank = -1;
MPI_Recv(&connectedRank, 1, MPI_INT, 0, MPI_ANY_TAG, icomm, MPI_STATUS_IGNORE); // <- Rank 0,1 participant B stop here
comms[connectedRank] = icomm;
}
}
}
I wrote down my observations on where the ranks stop:
| Particpant | Rank | Where | comRanks | Port | comms |
|------------+------+--------------------+----------+-------+-------|
| A | 0 | Start dataExchange | [0,1] | 39741 | [0,1] |
| A | 1 | Accept | [0,1,2] | 48187 | [2] |
| A | 2 | Start dataExchange | [2,3] | 35863 | [2,3] |
| A | 3 | Start dataExchange | [3] | 39791 | [3] |
| | | | | | |
| B | 0 | Recv connectedRank | [0,1] | 39741 | [] |
| B | 1 | Recv connectedRank | [0,1] | 39741 | [] |
| B | 2 | Start dataExchange | [1,2] | 35863 | [1,2] |
| B | 3 | Start dataExchange | [2,3] | 39791 | [2,3] |
comRanks on A means: Ranks (on the other participant) that connect to be
comRanks on B means: Rank I connect to
Start dataexchange is an MPI_Barrier, it's located below the code I have pasted here.
It seems that Rank A,0 successfully received the connectRank = 0, 1 from B,0 and B,1 and also the Send operations returned. However, B,0 and B,1 both connected to rank 0 (as seen by the identical port number).
Also, A,1 waits for incoming connections, only B,2 has connected so far. Of course, since B,0 and B,1 wait to receiving connectedRank from A,0.
I know, that's a lot to ask, but I would be very grateful, if you would have a look and could provide hints on what is going wrong here...
Thanks!
Florian
_______________________________________________
discuss mailing list discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss
More information about the discuss
mailing list