[mpich-discuss] having problem running MPICH on multiple nodes

Bland, Wesley B. wbland at anl.gov
Tue Nov 25 21:02:38 CST 2014


Can you also provide your config.log and any CVARs or other relevant environment variables that you might be setting (for instance, in relation to fault tolerance)?

Thanks,
Wesley

On Nov 25, 2014, at 3:58 PM, Amin Hassani <ahassani at cis.uab.edu<mailto:ahassani at cis.uab.edu>> wrote:

This is the simplest code I have that doesn't run.


#include <mpi.h>
#include <stdio.h>
#include <malloc.h>
#include <unistd.h>
#include <stdlib.h>

int main(int argc, char** argv)
{
    int rank, size;
    int i, j, k;
    double t1, t2;
    int rc;

    MPI_Init(&argc, &argv);
    MPI_Comm world = MPI_COMM_WORLD, newworld, newworld2;
    MPI_Comm_rank(world, &rank);
    MPI_Comm_size(world, &size);

    t2 = 1;
    MPI_Allreduce(&t2, &t_avg, 1, MPI_DOUBLE, MPI_SUM, world);
    t_avg = t_avg / size;

    MPI_Finalize();

    return 0;
}​

Amin Hassani,
CIS department at UAB,
Birmingham, AL, USA.

On Tue, Nov 25, 2014 at 2:46 PM, "Antonio J. Peña" <apenya at mcs.anl.gov<mailto:apenya at mcs.anl.gov>> wrote:

Hi Amin,

Can you share with us a minimal piece of code with which you can reproduce this issue?

Thanks,
  Antonio



On 11/25/2014 12:52 PM, Amin Hassani wrote:
Hi,

I am having problem running MPICH, on multiple nodes. When I run an multiple MPI processes on one node, it totally works, but when I try to run on multiple nodes, it fails with the error below.
My machines have Debian OS, Both infiniband and TCP interconnects. I'm guessing it has something do to with the TCP network, but I can run openmpi on these machines with no problem. But for some reason I cannot run MPICH on multiple nodes. Please let me know if more info is needed from my side. I'm guessing there are some configuration that I am missing. I used MPICH 3.1.3 for this test. I googled this problem but couldn't find any solution.

​In my MPI program, I am doing a simple allreduce over MPI_COMM_WORLD​.

​my host file (hosts-hydra) is something like this:
oakmnt-0-a:1
oakmnt-0-b:1 ​

​I get this error:​

$ mpirun -hostfile hosts-hydra -np 2  test_dup
Assertion failed in file ../src/mpi/coll/helper_fns.c at line 490: status->MPI_TAG == recvtag
Assertion failed in file ../src/mpi/coll/helper_fns.c at line 490: status->MPI_TAG == recvtag
internal ABORT - process 1
internal ABORT - process 0

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   PID 30744 RUNNING AT oakmnt-0-b
=   EXIT CODE: 1
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
[mpiexec at vulcan13] HYDU_sock_read (../../../../src/pm/hydra/utils/sock/sock.c:239): read error (Bad file descriptor)
[mpiexec at vulcan13] control_cb (../../../../src/pm/hydra/pm/pmiserv/pmiserv_cb.c:199): unable to read command from proxy
[mpiexec at vulcan13] HYDT_dmxu_poll_wait_for_event (../../../../src/pm/hydra/tools/demux/demux_poll.c:76): callback returned error status
[mpiexec at vulcan13] HYD_pmci_wait_for_completion (../../../../src/pm/hydra/pm/pmiserv/pmiserv_pmci.c:198): error waiting for event
[mpiexec at vulcan13] main (../../../../src/pm/hydra/ui/mpich/mpiexec.c:344): process manager error waiting for completion

Thanks.
Amin Hassani,
CIS department at UAB,
Birmingham, AL, USA.



_______________________________________________
discuss mailing list     discuss at mpich.org<mailto:discuss at mpich.org>
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss



--
Antonio J. Peña
Postdoctoral Appointee
Mathematics and Computer Science Division
Argonne National Laboratory
9700 South Cass Avenue, Bldg. 240, Of. 3148
Argonne, IL 60439-4847
apenya at mcs.anl.gov<mailto:apenya at mcs.anl.gov>
www.mcs.anl.gov/~apenya<http://www.mcs.anl.gov/~apenya>

_______________________________________________
discuss mailing list     discuss at mpich.org<mailto:discuss at mpich.org>
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss

_______________________________________________
discuss mailing list     discuss at mpich.org<mailto:discuss at mpich.org>
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20141126/2891e491/attachment.html>
-------------- next part --------------
_______________________________________________
discuss mailing list     discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss


More information about the discuss mailing list