[mpich-discuss] Fatal error in PMPI_Reduce

Pavan Balaji balaji at mcs.anl.gov
Fri Jan 11 20:27:41 CST 2013


Ok, this doesn't seem to have anything to do with slurm.  Can you try
running simple programs from the mpich install to make sure it's
correctly installed?  For example, can you try examples/cpi in mpich?

 -- Pavan

On 01/11/2013 05:47 PM US Central Time, Michael Colonno wrote:
>             Follow up: I rebuilt MPICH2 3.0.1 without any link to SLURM
> and repeated the experiment with the same result:
> 
>  
> 
> Fatal error in PMPI_Reduce: A process has failed, error stack:
> 
> PMPI_Reduce(1217)...............: MPI_Reduce(sbuf=0x7fffee6f1ca0,
> rbuf=0x7fffee6f1ca8, count=1, MPI_DOUBLE, MPI_SUM, root=0,
> MPI_COMM_WORLD) failed
> 
> MPIR_Reduce_impl(1029)..........:
> 
> MPIR_Reduce_intra(779)..........:
> 
> MPIR_Reduce_impl(1029)..........:
> 
> MPIR_Reduce_intra(835)..........:
> 
> MPIR_Reduce_binomial(144).......:
> 
> MPIDI_CH3U_Recvq_FDU_or_AEP(612): Communication error with rank 16
> 
> MPIR_Reduce_intra(799)..........:
> 
> MPIR_Reduce_impl(1029)..........:
> 
> MPIR_Reduce_intra(835)..........:
> 
> MPIR_Reduce_binomial(206).......: Failure during collective
> 
>  
> 
> ===================================================================================
> 
> =   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
> 
> =   EXIT CODE: 1
> 
> =   CLEANING UP REMAINING PROCESSES
> 
> =   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
> 
> ===================================================================================
> 
> [proxy:0:1 at n2] HYD_pmcd_pmip_control_cmd_cb
> (./pm/pmiserv/pmip_cb.c:886): assert (!closed) failed
> 
> [proxy:0:1 at n2] HYDT_dmxu_poll_wait_for_event
> (./tools/demux/demux_poll.c:77): callback returned error status
> 
> [proxy:0:1 at n2] main (./pm/pmiserv/pmip.c:206): demux engine error
> waiting for event
> 
> [mpiexec at n1] HYDT_bscu_wait_for_completion
> (./tools/bootstrap/utils/bscu_wait.c:76): one of the processes
> terminated badly; aborting
> 
> [mpiexec at n1] HYDT_bsci_wait_for_completion
> (./tools/bootstrap/src/bsci_wait.c:23): launcher returned error waiting
> for completion
> 
> [mpiexec at n1] HYD_pmci_wait_for_completion
> (./pm/pmiserv/pmiserv_pmci.c:217): launcher returned error waiting for
> completion
> 
> [mpiexec at n1] main (./ui/mpich/mpiexec.c:330): process manager error
> waiting for completion
> 
>  
> 
>             This is now a vanilla install trying to run the cxxcpi
> example code. Any help is appreciated.
> 
>  
> 
>             Thanks,
> 
>             ~Mike C.
> 
>  
> 
> *From:*Michael Colonno [mailto:mcolonno at stanford.edu]
> *Sent:* Friday, January 11, 2013 1:32 PM
> *To:* 'discuss at mpich.org'
> *Subject:* Fatal error in PMPI_Reduce
> 
>  
> 
>             Hi All ~
> 
>  
> 
>             I've compiled MPICH2 3.0 with the Intel compiler (v. 13) on
> a CentOS 6.3 x64 system using SLURM as the process manager. My configure
> was simply:
> 
>  
> 
> ./configure --with-pmi=slurm --with-pm=no --prefix=/usr/local/apps/MPICH2
> 
>  
> 
> No errors during build or install. When I compile and run the example
> program cxxcpi I get (truncated):
> 
>  
> 
> $ srun -n32 /usr/local/apps/cxxcpi
> 
> Fatal error in PMPI_Reduce: A process has failed, error stack:
> 
> PMPI_Reduce(1217)...............: MPI_Reduce(sbuf=0x7fff4ad18120,
> rbuf=0x7fff4ad18128, count=1, MPI_DOUBLE, MPI_SUM, root=0,
> MPI_COMM_WORLD) failed
> 
> MPIR_Reduce_impl(1029)..........:
> 
> MPIR_Reduce_intra(779)..........:
> 
> MPIR_Reduce_impl(1029)..........:
> 
> MPIR_Reduce_intra(835)..........:
> 
> MPIR_Reduce_binomial(144).......:
> 
> MPIDI_CH3U_Recvq_FDU_or_AEP(612): Communication error with rank 16
> 
> MPIR_Reduce_intra(799)..........:
> 
> MPIR_Reduce_impl(1029)..........:
> 
> MPIR_Reduce_intra(835)..........:
> 
> MPIR_Reduce_binomial(206).......: Failure during collective
> 
> srun: error: task 0: Exited with exit code 1
> 
>  
> 
>             This error is experienced with many of my MPI programs. A
> different application yields:
> 
>  
> 
> PMPI_Bcast(1525)......: MPI_Bcast(buf=0x7fff545be5fc, count=1, MPI_INT,
> root=0, MPI_COMM_WORLD) failed
> 
> MPIR_Bcast_impl(1369).:
> 
> MPIR_Bcast_intra(1160):
> 
> MPIR_SMP_Bcast(1077)..: Failure during collective
> 
>  
> 
>             Can anyone point me in the right direction?
> 
>  
> 
>             Thanks,
> 
>             ~Mike C.  
> 
> 
> 
> _______________________________________________
> discuss mailing list     discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss
> 

-- 
Pavan Balaji
http://www.mcs.anl.gov/~balaji



More information about the discuss mailing list