[mpich-discuss] MPI_Info key host error in SLURM
Raffenetti, Kenneth J.
raffenet at mcs.anl.gov
Wed Mar 24 14:21:43 CDT 2021
On 3/24/21, 10:44 AM, "Iker Martín Álvarez" <martini at uji.es> wrote:
Sorry for the late reply,
I tried using that option, but the outcome was the same.
However, with the info key "hostfile" and creating a file with the the following text, it works as expected. In this example, creating 10 and 5 processes respectively in n00 and n01.
n00:10
n01:5
Interesting. I'm glad you could get it working. There must be some bug with the other method.
When trying this key, I noticed that if the parent processes try to finish their execution while their sons do some work, the parents are blocked in MPI_Finalize until the sons call this function too.
This happens even if for both groups all their processes call MPI_Comm_disconnect. I think this is happening because both groups are still connected and therefore the parents wait until the sons call this function.
Could it be that I am missing something?
I based my conclusión on what is told here:
https://www.mpi-forum.org/docs/mpi-2.2/mpi22-report/node226.htm
If both groups call MPI_Comm_disconnect, then theoretically they shouldn't block in MPI_Finalize like you describe. However our implementation may not fully complete the disconnect until finalize is called. We recently began some work to overhaul this part of MPICH. Hopefully a future release will contain a more robust disconnect mechanism.
Ken
More information about the discuss
mailing list