[mpich-discuss] Fatal error in PMPI_Reduce

Michael Colonno mcolonno at stanford.edu
Fri Jan 11 21:18:36 CST 2013


The first output below is from cxxcpi; I can also run cpi is it's helpful.

Thanks,
Mike C.

On Jan 11, 2013, at 6:27 PM, Pavan Balaji <balaji at mcs.anl.gov> wrote:

> 
> Ok, this doesn't seem to have anything to do with slurm.  Can you try
> running simple programs from the mpich install to make sure it's
> correctly installed?  For example, can you try examples/cpi in mpich?
> 
> -- Pavan
> 
> On 01/11/2013 05:47 PM US Central Time, Michael Colonno wrote:
>>            Follow up: I rebuilt MPICH2 3.0.1 without any link to SLURM
>> and repeated the experiment with the same result:
>> 
>> 
>> 
>> Fatal error in PMPI_Reduce: A process has failed, error stack:
>> 
>> PMPI_Reduce(1217)...............: MPI_Reduce(sbuf=0x7fffee6f1ca0,
>> rbuf=0x7fffee6f1ca8, count=1, MPI_DOUBLE, MPI_SUM, root=0,
>> MPI_COMM_WORLD) failed
>> 
>> MPIR_Reduce_impl(1029)..........:
>> 
>> MPIR_Reduce_intra(779)..........:
>> 
>> MPIR_Reduce_impl(1029)..........:
>> 
>> MPIR_Reduce_intra(835)..........:
>> 
>> MPIR_Reduce_binomial(144).......:
>> 
>> MPIDI_CH3U_Recvq_FDU_or_AEP(612): Communication error with rank 16
>> 
>> MPIR_Reduce_intra(799)..........:
>> 
>> MPIR_Reduce_impl(1029)..........:
>> 
>> MPIR_Reduce_intra(835)..........:
>> 
>> MPIR_Reduce_binomial(206).......: Failure during collective
>> 
>> 
>> 
>> ===================================================================================
>> 
>> =   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
>> 
>> =   EXIT CODE: 1
>> 
>> =   CLEANING UP REMAINING PROCESSES
>> 
>> =   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
>> 
>> ===================================================================================
>> 
>> [proxy:0:1 at n2] HYD_pmcd_pmip_control_cmd_cb
>> (./pm/pmiserv/pmip_cb.c:886): assert (!closed) failed
>> 
>> [proxy:0:1 at n2] HYDT_dmxu_poll_wait_for_event
>> (./tools/demux/demux_poll.c:77): callback returned error status
>> 
>> [proxy:0:1 at n2] main (./pm/pmiserv/pmip.c:206): demux engine error
>> waiting for event
>> 
>> [mpiexec at n1] HYDT_bscu_wait_for_completion
>> (./tools/bootstrap/utils/bscu_wait.c:76): one of the processes
>> terminated badly; aborting
>> 
>> [mpiexec at n1] HYDT_bsci_wait_for_completion
>> (./tools/bootstrap/src/bsci_wait.c:23): launcher returned error waiting
>> for completion
>> 
>> [mpiexec at n1] HYD_pmci_wait_for_completion
>> (./pm/pmiserv/pmiserv_pmci.c:217): launcher returned error waiting for
>> completion
>> 
>> [mpiexec at n1] main (./ui/mpich/mpiexec.c:330): process manager error
>> waiting for completion
>> 
>> 
>> 
>>            This is now a vanilla install trying to run the cxxcpi
>> example code. Any help is appreciated.
>> 
>> 
>> 
>>            Thanks,
>> 
>>            ~Mike C.
>> 
>> 
>> 
>> *From:*Michael Colonno [mailto:mcolonno at stanford.edu]
>> *Sent:* Friday, January 11, 2013 1:32 PM
>> *To:* 'discuss at mpich.org'
>> *Subject:* Fatal error in PMPI_Reduce
>> 
>> 
>> 
>>            Hi All ~
>> 
>> 
>> 
>>            I've compiled MPICH2 3.0 with the Intel compiler (v. 13) on
>> a CentOS 6.3 x64 system using SLURM as the process manager. My configure
>> was simply:
>> 
>> 
>> 
>> ./configure --with-pmi=slurm --with-pm=no --prefix=/usr/local/apps/MPICH2
>> 
>> 
>> 
>> No errors during build or install. When I compile and run the example
>> program cxxcpi I get (truncated):
>> 
>> 
>> 
>> $ srun -n32 /usr/local/apps/cxxcpi
>> 
>> Fatal error in PMPI_Reduce: A process has failed, error stack:
>> 
>> PMPI_Reduce(1217)...............: MPI_Reduce(sbuf=0x7fff4ad18120,
>> rbuf=0x7fff4ad18128, count=1, MPI_DOUBLE, MPI_SUM, root=0,
>> MPI_COMM_WORLD) failed
>> 
>> MPIR_Reduce_impl(1029)..........:
>> 
>> MPIR_Reduce_intra(779)..........:
>> 
>> MPIR_Reduce_impl(1029)..........:
>> 
>> MPIR_Reduce_intra(835)..........:
>> 
>> MPIR_Reduce_binomial(144).......:
>> 
>> MPIDI_CH3U_Recvq_FDU_or_AEP(612): Communication error with rank 16
>> 
>> MPIR_Reduce_intra(799)..........:
>> 
>> MPIR_Reduce_impl(1029)..........:
>> 
>> MPIR_Reduce_intra(835)..........:
>> 
>> MPIR_Reduce_binomial(206).......: Failure during collective
>> 
>> srun: error: task 0: Exited with exit code 1
>> 
>> 
>> 
>>            This error is experienced with many of my MPI programs. A
>> different application yields:
>> 
>> 
>> 
>> PMPI_Bcast(1525)......: MPI_Bcast(buf=0x7fff545be5fc, count=1, MPI_INT,
>> root=0, MPI_COMM_WORLD) failed
>> 
>> MPIR_Bcast_impl(1369).:
>> 
>> MPIR_Bcast_intra(1160):
>> 
>> MPIR_SMP_Bcast(1077)..: Failure during collective
>> 
>> 
>> 
>>            Can anyone point me in the right direction?
>> 
>> 
>> 
>>            Thanks,
>> 
>>            ~Mike C.  
>> 
>> 
>> 
>> _______________________________________________
>> discuss mailing list     discuss at mpich.org
>> To manage subscription options or unsubscribe:
>> https://lists.mpich.org/mailman/listinfo/discuss
>> 
> 
> -- 
> Pavan Balaji
> http://www.mcs.anl.gov/~balaji



More information about the discuss mailing list