[mpich-discuss] Fatal error in PMPI_Reduce
Pavan Balaji
balaji at mcs.anl.gov
Fri Jan 11 20:27:41 CST 2013
Ok, this doesn't seem to have anything to do with slurm. Can you try
running simple programs from the mpich install to make sure it's
correctly installed? For example, can you try examples/cpi in mpich?
-- Pavan
On 01/11/2013 05:47 PM US Central Time, Michael Colonno wrote:
> Follow up: I rebuilt MPICH2 3.0.1 without any link to SLURM
> and repeated the experiment with the same result:
>
>
>
> Fatal error in PMPI_Reduce: A process has failed, error stack:
>
> PMPI_Reduce(1217)...............: MPI_Reduce(sbuf=0x7fffee6f1ca0,
> rbuf=0x7fffee6f1ca8, count=1, MPI_DOUBLE, MPI_SUM, root=0,
> MPI_COMM_WORLD) failed
>
> MPIR_Reduce_impl(1029)..........:
>
> MPIR_Reduce_intra(779)..........:
>
> MPIR_Reduce_impl(1029)..........:
>
> MPIR_Reduce_intra(835)..........:
>
> MPIR_Reduce_binomial(144).......:
>
> MPIDI_CH3U_Recvq_FDU_or_AEP(612): Communication error with rank 16
>
> MPIR_Reduce_intra(799)..........:
>
> MPIR_Reduce_impl(1029)..........:
>
> MPIR_Reduce_intra(835)..........:
>
> MPIR_Reduce_binomial(206).......: Failure during collective
>
>
>
> ===================================================================================
>
> = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
>
> = EXIT CODE: 1
>
> = CLEANING UP REMAINING PROCESSES
>
> = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
>
> ===================================================================================
>
> [proxy:0:1 at n2] HYD_pmcd_pmip_control_cmd_cb
> (./pm/pmiserv/pmip_cb.c:886): assert (!closed) failed
>
> [proxy:0:1 at n2] HYDT_dmxu_poll_wait_for_event
> (./tools/demux/demux_poll.c:77): callback returned error status
>
> [proxy:0:1 at n2] main (./pm/pmiserv/pmip.c:206): demux engine error
> waiting for event
>
> [mpiexec at n1] HYDT_bscu_wait_for_completion
> (./tools/bootstrap/utils/bscu_wait.c:76): one of the processes
> terminated badly; aborting
>
> [mpiexec at n1] HYDT_bsci_wait_for_completion
> (./tools/bootstrap/src/bsci_wait.c:23): launcher returned error waiting
> for completion
>
> [mpiexec at n1] HYD_pmci_wait_for_completion
> (./pm/pmiserv/pmiserv_pmci.c:217): launcher returned error waiting for
> completion
>
> [mpiexec at n1] main (./ui/mpich/mpiexec.c:330): process manager error
> waiting for completion
>
>
>
> This is now a vanilla install trying to run the cxxcpi
> example code. Any help is appreciated.
>
>
>
> Thanks,
>
> ~Mike C.
>
>
>
> *From:*Michael Colonno [mailto:mcolonno at stanford.edu]
> *Sent:* Friday, January 11, 2013 1:32 PM
> *To:* 'discuss at mpich.org'
> *Subject:* Fatal error in PMPI_Reduce
>
>
>
> Hi All ~
>
>
>
> I've compiled MPICH2 3.0 with the Intel compiler (v. 13) on
> a CentOS 6.3 x64 system using SLURM as the process manager. My configure
> was simply:
>
>
>
> ./configure --with-pmi=slurm --with-pm=no --prefix=/usr/local/apps/MPICH2
>
>
>
> No errors during build or install. When I compile and run the example
> program cxxcpi I get (truncated):
>
>
>
> $ srun -n32 /usr/local/apps/cxxcpi
>
> Fatal error in PMPI_Reduce: A process has failed, error stack:
>
> PMPI_Reduce(1217)...............: MPI_Reduce(sbuf=0x7fff4ad18120,
> rbuf=0x7fff4ad18128, count=1, MPI_DOUBLE, MPI_SUM, root=0,
> MPI_COMM_WORLD) failed
>
> MPIR_Reduce_impl(1029)..........:
>
> MPIR_Reduce_intra(779)..........:
>
> MPIR_Reduce_impl(1029)..........:
>
> MPIR_Reduce_intra(835)..........:
>
> MPIR_Reduce_binomial(144).......:
>
> MPIDI_CH3U_Recvq_FDU_or_AEP(612): Communication error with rank 16
>
> MPIR_Reduce_intra(799)..........:
>
> MPIR_Reduce_impl(1029)..........:
>
> MPIR_Reduce_intra(835)..........:
>
> MPIR_Reduce_binomial(206).......: Failure during collective
>
> srun: error: task 0: Exited with exit code 1
>
>
>
> This error is experienced with many of my MPI programs. A
> different application yields:
>
>
>
> PMPI_Bcast(1525)......: MPI_Bcast(buf=0x7fff545be5fc, count=1, MPI_INT,
> root=0, MPI_COMM_WORLD) failed
>
> MPIR_Bcast_impl(1369).:
>
> MPIR_Bcast_intra(1160):
>
> MPIR_SMP_Bcast(1077)..: Failure during collective
>
>
>
> Can anyone point me in the right direction?
>
>
>
> Thanks,
>
> ~Mike C.
>
>
>
> _______________________________________________
> discuss mailing list discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss
>
--
Pavan Balaji
http://www.mcs.anl.gov/~balaji
More information about the discuss
mailing list