[mpich-discuss] Fatal error in PMPI_Reduce

Michael Colonno mcolonno at stanford.edu
Fri Jan 11 17:47:11 CST 2013


            Follow up: I rebuilt MPICH2 3.0.1 without any link to SLURM and
repeated the experiment with the same result: 

 

Fatal error in PMPI_Reduce: A process has failed, error stack:

PMPI_Reduce(1217)...............: MPI_Reduce(sbuf=0x7fffee6f1ca0,
rbuf=0x7fffee6f1ca8, count=1, MPI_DOUBLE, MPI_SUM, root=0, MPI_COMM_WORLD)
failed

MPIR_Reduce_impl(1029)..........:

MPIR_Reduce_intra(779)..........:

MPIR_Reduce_impl(1029)..........:

MPIR_Reduce_intra(835)..........:

MPIR_Reduce_binomial(144).......:

MPIDI_CH3U_Recvq_FDU_or_AEP(612): Communication error with rank 16

MPIR_Reduce_intra(799)..........:

MPIR_Reduce_impl(1029)..........:

MPIR_Reduce_intra(835)..........:

MPIR_Reduce_binomial(206).......: Failure during collective

 

============================================================================
=======

=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES

=   EXIT CODE: 1

=   CLEANING UP REMAINING PROCESSES

=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES

============================================================================
=======

[proxy:0:1 at n2] HYD_pmcd_pmip_control_cmd_cb (./pm/pmiserv/pmip_cb.c:886):
assert (!closed) failed

[proxy:0:1 at n2] HYDT_dmxu_poll_wait_for_event
(./tools/demux/demux_poll.c:77): callback returned error status

[proxy:0:1 at n2] main (./pm/pmiserv/pmip.c:206): demux engine error waiting
for event

[mpiexec at n1] HYDT_bscu_wait_for_completion
(./tools/bootstrap/utils/bscu_wait.c:76): one of the processes terminated
badly; aborting

[mpiexec at n1] HYDT_bsci_wait_for_completion
(./tools/bootstrap/src/bsci_wait.c:23): launcher returned error waiting for
completion

[mpiexec at n1] HYD_pmci_wait_for_completion (./pm/pmiserv/pmiserv_pmci.c:217):
launcher returned error waiting for completion

[mpiexec at n1] main (./ui/mpich/mpiexec.c:330): process manager error waiting
for completion

 

            This is now a vanilla install trying to run the cxxcpi example
code. Any help is appreciated. 

 

            Thanks,

            ~Mike C. 

 

From: Michael Colonno [mailto:mcolonno at stanford.edu] 
Sent: Friday, January 11, 2013 1:32 PM
To: 'discuss at mpich.org'
Subject: Fatal error in PMPI_Reduce

 

            Hi All ~

 

            I've compiled MPICH2 3.0 with the Intel compiler (v. 13) on a
CentOS 6.3 x64 system using SLURM as the process manager. My configure was
simply: 

 

./configure --with-pmi=slurm --with-pm=no --prefix=/usr/local/apps/MPICH2

 

No errors during build or install. When I compile and run the example
program cxxcpi I get (truncated): 

 

$ srun -n32 /usr/local/apps/cxxcpi

Fatal error in PMPI_Reduce: A process has failed, error stack:

PMPI_Reduce(1217)...............: MPI_Reduce(sbuf=0x7fff4ad18120,
rbuf=0x7fff4ad18128, count=1, MPI_DOUBLE, MPI_SUM, root=0, MPI_COMM_WORLD)
failed

MPIR_Reduce_impl(1029)..........:

MPIR_Reduce_intra(779)..........:

MPIR_Reduce_impl(1029)..........:

MPIR_Reduce_intra(835)..........:

MPIR_Reduce_binomial(144).......:

MPIDI_CH3U_Recvq_FDU_or_AEP(612): Communication error with rank 16

MPIR_Reduce_intra(799)..........:

MPIR_Reduce_impl(1029)..........:

MPIR_Reduce_intra(835)..........:

MPIR_Reduce_binomial(206).......: Failure during collective

srun: error: task 0: Exited with exit code 1

 

            This error is experienced with many of my MPI programs. A
different application yields: 

 

PMPI_Bcast(1525)......: MPI_Bcast(buf=0x7fff545be5fc, count=1, MPI_INT,
root=0, MPI_COMM_WORLD) failed

MPIR_Bcast_impl(1369).:

MPIR_Bcast_intra(1160):

MPIR_SMP_Bcast(1077)..: Failure during collective 

 

            Can anyone point me in the right direction? 

 

            Thanks,

            ~Mike C.  

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20130111/02141030/attachment.html>


More information about the discuss mailing list