[mpich-discuss] MPI_Info key host error in SLURM

Raffenetti, Kenneth J. raffenet at mcs.anl.gov
Wed Mar 24 14:21:43 CDT 2021


On 3/24/21, 10:44 AM, "Iker Martín Álvarez" <martini at uji.es> wrote:

    Sorry for the late reply,
    I tried using that option, but the outcome was the same.
    However, with the info key "hostfile" and creating a file with the the following text, it works as expected. In this example, creating 10 and 5 processes respectively in n00 and n01.
    
    n00:10
    n01:5

Interesting. I'm glad you could get it working. There must be some bug with the other method.
    
    When trying this key, I noticed that if the parent processes try to finish their execution while their sons do some work, the parents are blocked in MPI_Finalize until the sons call this function too. 
    This happens even if for both groups all their processes call MPI_Comm_disconnect. I think this is happening because both groups are still connected and therefore the parents wait until the sons call this function.
    Could it be that I am missing something?
    
    I based my conclusión on what is told here:
    https://www.mpi-forum.org/docs/mpi-2.2/mpi22-report/node226.htm
    
If both groups call MPI_Comm_disconnect, then theoretically they shouldn't block in MPI_Finalize like you describe. However our implementation may not fully complete the disconnect until finalize is called. We recently began some work to overhaul this part of MPICH. Hopefully a future release will contain a more robust disconnect mechanism.

Ken
 



More information about the discuss mailing list