[mpich-discuss] Increasing MPI ranks

Jed Brown jed at jedbrown.org
Wed Mar 12 18:52:43 CDT 2014


Jeffrey Larson <jmlarson at anl.gov> writes:

> I am not calling the cpi.py script directly. The master is spawning those
> processes. So I call
>
> $ mpiexec -n 30 python master.py
>
> Then each of the 30 ranks should spawn a cpi.py process. But with the
> attached master.py and cpi.py (directly from the mpi4py tutorial), you can
> see the errors I get:
>
> [jlarson at mintthinkpad tutorial_example]$ mpiexec -n 30 python master.py
> [mpiexec at mintthinkpad] control_cb (pm/pmiserv/pmiserv_cb.c:200): assert
> (!closed) failed
> [mpiexec at mintthinkpad] HYDT_dmxu_poll_wait_for_event
> (tools/demux/demux_poll.c:76): callback returned error status
> [mpiexec at mintthinkpad] HYD_pmci_wait_for_completion
> (pm/pmiserv/pmiserv_pmci.c:198): error waiting for event
> [mpiexec at mintthinkpad] main (ui/mpich/mpiexec.c:336): process manager error
> waiting for completion
>
> As was previously stated, this appears to be an mpi4py problem and not a
> mpich question.

The problem seems to be related to passing arguments.  This works for
me:

diff --git a/master.py b/master.py
index 620c484..e103356 100755
--- a/master.py
+++ b/master.py
@@ -3,7 +3,7 @@ from mpi4py import MPI
 import numpy
 import sys
 
-comm = MPI.COMM_SELF.Spawn(sys.executable, args=['cpi.py'], maxprocs=5)
+comm = MPI.COMM_SELF.Spawn('./cpi.py', None, maxprocs=5)
 
 N = numpy.array(100, 'i')
 comm.Bcast([N, MPI.INT], root=MPI.ROOT)


The original also works fine with Open MPI.  I managed to reproduce with
pure C and MPICH.

diff --git i/demo/spawning/cpi-master.c w/demo/spawning/cpi-master.c
index 7bff4e3..6f8715b 100644
--- i/demo/spawning/cpi-master.c
+++ w/demo/spawning/cpi-master.c
@@ -6,6 +6,7 @@
 int main(int argc, char *argv[])
 {
   char cmd[32] = "cpi-worker-c.exe";
+  char *wargs[] = {"a",NULL};
   MPI_Comm worker;
   int n;
   double pi;
@@ -15,7 +16,7 @@ int main(int argc, char *argv[])
   if (argc > 1) strcpy(cmd, argv[1]);
   printf("%s -> %s\n", argv[0], cmd);
 
-  MPI_Comm_spawn(cmd, MPI_ARGV_NULL, 5,
+  MPI_Comm_spawn(cmd, wargs, 5,
                  MPI_INFO_NULL, 0,
                  MPI_COMM_SELF, &worker,
                  MPI_ERRCODES_IGNORE);


The attached includes a sleep(1) in the client to make the crash more
reproducible.

$ make CC=/opt/mpich/bin/mpicc cpi-master cpi-worker              
/opt/mpich/bin/mpicc     cpi-master.c   -o cpi-master
/opt/mpich/bin/mpicc     cpi-worker.c   -o cpi-worker
$ /opt/mpich/bin/mpiexec -n 30 ./cpi-master
./cpi-master -> ./cpi-worker
./cpi-master -> ./cpi-worker
./cpi-master -> ./cpi-worker
./cpi-master -> ./cpi-worker
./cpi-master -> ./cpi-worker
./cpi-master -> ./cpi-worker
./cpi-master -> ./cpi-worker
./cpi-master -> ./cpi-worker
./cpi-master -> ./cpi-worker
./cpi-master -> ./cpi-worker
./cpi-master -> ./cpi-worker
./cpi-master -> ./cpi-worker
./cpi-master -> ./cpi-worker
./cpi-master -> ./cpi-worker
./cpi-master -> ./cpi-worker
./cpi-master -> ./cpi-worker
./cpi-master -> ./cpi-worker
./cpi-master -> ./cpi-worker
./cpi-master -> ./cpi-worker
./cpi-master -> ./cpi-worker
./cpi-master -> ./cpi-worker
./cpi-master -> ./cpi-worker
./cpi-master -> ./cpi-worker
./cpi-master -> ./cpi-worker
./cpi-master -> ./cpi-worker
./cpi-master -> ./cpi-worker
./cpi-master -> ./cpi-worker
./cpi-master -> ./cpi-worker
./cpi-master -> ./cpi-worker
./cpi-master -> ./cpi-worker
Assertion failed in file ../src/mpi/coll/helper_fns.c at line 491: status->MPI_TAG == recvtag
internal ABORT - process 0
Assertion failed in file ../src/mpi/coll/helper_fns.c at line 491: status->MPI_TAG == recvtag
internal ABORT - process 0

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   PID 7329 RUNNING AT batura
=   EXIT CODE: 1
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
[proxy:27:0 at batura] HYD_pmcd_pmip_control_cmd_cb (../../../../src/pm/hydra/pm/pmiserv/pmip_cb.c:886): assert (!closed) failed
[proxy:27:0 at batura] HYDT_dmxu_poll_wait_for_event (../../../../src/pm/hydra/tools/demux/demux_poll.c:76): callback returned error status
[proxy:27:0 at batura] main (../../../../src/pm/hydra/pm/pmiserv/pmip.c:206): demux engine error waiting for event
[proxy:28:0 at batura] HYD_pmcd_pmip_control_cmd_cb (../../../../src/pm/hydra/pm/pmiserv/pmip_cb.c:886): assert (!closed) failed
[proxy:28:0 at batura] HYDT_dmxu_poll_wait_for_event (../../../../src/pm/hydra/tools/demux/demux_poll.c:76): callback returned error status
[proxy:28:0 at batura] main (../../../../src/pm/hydra/pm/pmiserv/pmip.c:206): demux engine error waiting for event
[proxy:29:0 at batura] HYD_pmcd_pmip_control_cmd_cb (../../../../src/pm/hydra/pm/pmiserv/pmip_cb.c:886): assert (!closed) failed
[proxy:29:0 at batura] HYDT_dmxu_poll_wait_for_event (../../../../src/pm/hydra/tools/demux/demux_poll.c:76): callback returned error status
[proxy:29:0 at batura] main (../../../../src/pm/hydra/pm/pmiserv/pmip.c:206): demux engine error waiting for event
[mpiexec at batura] HYDT_bscu_wait_for_completion (../../../../src/pm/hydra/tools/bootstrap/utils/bscu_wait.c:76): one of the processes terminated badly; aborting
[mpiexec at batura] HYDT_bsci_wait_for_completion (../../../../src/pm/hydra/tools/bootstrap/src/bsci_wait.c:23): launcher returned error waiting for completion
[mpiexec at batura] HYD_pmci_wait_for_completion (../../../../src/pm/hydra/pm/pmiserv/pmiserv_pmci.c:218): launcher returned error waiting for completion
[mpiexec at batura] main (../../../../src/pm/hydra/ui/mpich/mpiexec.c:336): process manager error waiting for completion


> Since you are curious about the application, I the motivating example
> involves the numerical optimization of the output from an expensive
> simulation. I do not have access to the simulation code, so my master will
> tell the workers where they need to evaluate the expensive simulation. Then
> the simulation might itself depend heavily on MPI.

How will the code you don't "have access to" report back?  Presumably it
was not written to use MPI_Comm_get_parent?  When possible, it's better
for the simulation to take an MPI_Comm argument on which to run.


-------------- next part --------------
A non-text attachment was scrubbed...
Name: cpi-master.c
Type: text/x-csrc
Size: 737 bytes
Desc: not available
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20140312/d22dbe66/attachment.bin>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: cpi-worker.c
Type: text/x-csrc
Size: 609 bytes
Desc: not available
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20140312/d22dbe66/attachment-0001.bin>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 835 bytes
Desc: not available
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20140312/d22dbe66/attachment.sig>


More information about the discuss mailing list