[mpich-discuss] Fatal error in PMPI_Reduce
Michael Colonno
mcolonno at stanford.edu
Fri Jan 11 17:47:11 CST 2013
Follow up: I rebuilt MPICH2 3.0.1 without any link to SLURM and
repeated the experiment with the same result:
Fatal error in PMPI_Reduce: A process has failed, error stack:
PMPI_Reduce(1217)...............: MPI_Reduce(sbuf=0x7fffee6f1ca0,
rbuf=0x7fffee6f1ca8, count=1, MPI_DOUBLE, MPI_SUM, root=0, MPI_COMM_WORLD)
failed
MPIR_Reduce_impl(1029)..........:
MPIR_Reduce_intra(779)..........:
MPIR_Reduce_impl(1029)..........:
MPIR_Reduce_intra(835)..........:
MPIR_Reduce_binomial(144).......:
MPIDI_CH3U_Recvq_FDU_or_AEP(612): Communication error with rank 16
MPIR_Reduce_intra(799)..........:
MPIR_Reduce_impl(1029)..........:
MPIR_Reduce_intra(835)..........:
MPIR_Reduce_binomial(206).......: Failure during collective
============================================================================
=======
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= EXIT CODE: 1
= CLEANING UP REMAINING PROCESSES
= YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
============================================================================
=======
[proxy:0:1 at n2] HYD_pmcd_pmip_control_cmd_cb (./pm/pmiserv/pmip_cb.c:886):
assert (!closed) failed
[proxy:0:1 at n2] HYDT_dmxu_poll_wait_for_event
(./tools/demux/demux_poll.c:77): callback returned error status
[proxy:0:1 at n2] main (./pm/pmiserv/pmip.c:206): demux engine error waiting
for event
[mpiexec at n1] HYDT_bscu_wait_for_completion
(./tools/bootstrap/utils/bscu_wait.c:76): one of the processes terminated
badly; aborting
[mpiexec at n1] HYDT_bsci_wait_for_completion
(./tools/bootstrap/src/bsci_wait.c:23): launcher returned error waiting for
completion
[mpiexec at n1] HYD_pmci_wait_for_completion (./pm/pmiserv/pmiserv_pmci.c:217):
launcher returned error waiting for completion
[mpiexec at n1] main (./ui/mpich/mpiexec.c:330): process manager error waiting
for completion
This is now a vanilla install trying to run the cxxcpi example
code. Any help is appreciated.
Thanks,
~Mike C.
From: Michael Colonno [mailto:mcolonno at stanford.edu]
Sent: Friday, January 11, 2013 1:32 PM
To: 'discuss at mpich.org'
Subject: Fatal error in PMPI_Reduce
Hi All ~
I've compiled MPICH2 3.0 with the Intel compiler (v. 13) on a
CentOS 6.3 x64 system using SLURM as the process manager. My configure was
simply:
./configure --with-pmi=slurm --with-pm=no --prefix=/usr/local/apps/MPICH2
No errors during build or install. When I compile and run the example
program cxxcpi I get (truncated):
$ srun -n32 /usr/local/apps/cxxcpi
Fatal error in PMPI_Reduce: A process has failed, error stack:
PMPI_Reduce(1217)...............: MPI_Reduce(sbuf=0x7fff4ad18120,
rbuf=0x7fff4ad18128, count=1, MPI_DOUBLE, MPI_SUM, root=0, MPI_COMM_WORLD)
failed
MPIR_Reduce_impl(1029)..........:
MPIR_Reduce_intra(779)..........:
MPIR_Reduce_impl(1029)..........:
MPIR_Reduce_intra(835)..........:
MPIR_Reduce_binomial(144).......:
MPIDI_CH3U_Recvq_FDU_or_AEP(612): Communication error with rank 16
MPIR_Reduce_intra(799)..........:
MPIR_Reduce_impl(1029)..........:
MPIR_Reduce_intra(835)..........:
MPIR_Reduce_binomial(206).......: Failure during collective
srun: error: task 0: Exited with exit code 1
This error is experienced with many of my MPI programs. A
different application yields:
PMPI_Bcast(1525)......: MPI_Bcast(buf=0x7fff545be5fc, count=1, MPI_INT,
root=0, MPI_COMM_WORLD) failed
MPIR_Bcast_impl(1369).:
MPIR_Bcast_intra(1160):
MPIR_SMP_Bcast(1077)..: Failure during collective
Can anyone point me in the right direction?
Thanks,
~Mike C.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20130111/02141030/attachment.html>
More information about the discuss
mailing list