[mpich-discuss] MPI_Reduce on an inter-communicator hangs

Thakur, Rajeev thakur at anl.gov
Sun Apr 21 17:23:00 CDT 2024


I haven’t run the code, but shouldn’t the workers pass 0 (rank of master) as the root instead of MPI_PROC_NULL?

Rajeev

From: "Mccall, Kurt E. (MSFC-EV41) via discuss" <discuss at mpich.org>
Reply-To: "discuss at mpich.org" <discuss at mpich.org>
Date: Sunday, April 21, 2024 at 4:53 PM
To: "discuss at mpich.org" <discuss at mpich.org>
Cc: "Mccall, Kurt E. (MSFC-EV41)" <kurt.e.mccall at nasa.gov>
Subject: [mpich-discuss] MPI_Reduce on an inter-communicator hangs

I am calling MPI_Reduce on a set of inter-communicators created by MPI_Comm_spawn, each with one process in the local group (the single manager) and two processes in the remote group (the workers). The inter- communicators are visited one at
ZjQcmQRYFpfptBannerStart
This Message Is From an External Sender
This message came from outside your organization.
ZjQcmQRYFpfptBannerEnd
I am calling MPI_Reduce on a set of inter-communicators created by MPI_Comm_spawn,  each with one
process in the local group (the single manager) and two processes in the remote group (the workers).    The inter-
communicators are visited one at a time in the manager.

All workers enter and exit MPI_Reduce without blocking,  but the manager enters the first MPI_Reduce for the
first inter-communicator and never returns.   What am I doing wrong?   I am using MPICH 4.1.2.

Here is my manager code:

#define N_PROC 4
#define N_IN_GROUP 2

int main(int argc, char *argv[])
{
    int rank, world_size, error_codes[N_PROC];
    MPI_Comm intercoms[N_PROC];
    char hostname[64];

    MPI_Init(&argc, &argv);
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);
    MPI_Comm_size(MPI_COMM_WORLD, &world_size);

    gethostname(hostname, sizeof(hostname));
    char *p = strstr(hostname, ".");
    *p = '\0';

    MPI_Info info;
    MPI_Info_create(&info);
    MPI_Info_set(info, "host", hostname);
    MPI_Info_set(info, "bind_to", "core");

    for (int i = 0; i < N_PROC; ++i)
    {
        MPI_Comm_spawn("test_reduce_work", argv, N_IN_GROUP, info,
            0, MPI_COMM_SELF, &intercoms[i], &error_codes[i]);
    }

    sleep(10);

    unsigned array[100]{0};

    for (int i = 0; i < N_PROC; ++i)
    {
        cout << "MANAGER: starting reduction " << i << "\n";

        MPI_Reduce(NULL, array, 100, MPI_UNSIGNED, MPI_SUM, MPI_ROOT,
            intercoms[i]);

        cout << "MANAGER: finished reduction " << i << "\n";   // we never reach this point
    }

    for (int i = 0; i < 100; ++i) cout << array[i] << " ";
    cout << endl;

    MPI_Finalize();
}


And here is my worker code:


int main(int argc, char *argv[])
{
    int rank, world_size;
    MPI_Comm manager_intercom;

    MPI_Init(&argc, &argv);
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);
    MPI_Comm_size(MPI_COMM_WORLD, &world_size);

    MPI_Comm_get_parent(&manager_intercom);

    unsigned array[100]{1};

    cout << "WORKER: starting reduction\n";

    MPI_Reduce(array, NULL, 100,  MPI_UNSIGNED, MPI_SUM, MPI_PROC_NULL,
        manager_intercom);

    cout << "WORKER: finishing reduction\n";

    sleep(10);

    MPI_Finalize();
}


Finally, here it the invocation:



$ mpiexec -launcher ssh -print-all-exitcodes -wdir /home/kmccall/test_dir -np 1 -ppn 1 test_reduce_man

Thanks,
Kurt

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20240421/9c4a53ae/attachment.html>


More information about the discuss mailing list