[mpich-discuss] Fatal error in PMPI_Reduce
Michael Colonno
mcolonno at stanford.edu
Sat Jan 12 12:29:33 CST 2013
Hi Pavan ~
Thanks for the reply. I did verify connectivity prior to trying
the program. In this test case I have two systems, each with 16 cores
(cv-hpcn1 and cv-hpcn2). ssh between the two systems appears to be working
as far as I can tell:
cv-hpcn1$ ssh cv-hpcn1 date
Sat Jan 12 10:20:01 PST 2013
cv-hpcn1$ ssh cv-hpcn2 date
Sat Jan 12 10:19:23 PST 2013
cv-hpcn1$ ssh cv-hpcn2
cv-hpcn2$ ssh cv-hpcn1 date
Sat Jan 12 10:20:43 PST 2013
Here's the full test run of cxxcpi:
cv-hpcn1$ /usr/local/apps/MPICH2/standalone/bin/mpirun -n 32 -f
/usr/local/apps/hosts /usr/local/apps/cxxcpi
Process 17 of 32 is on cv-hpcn2
Process 19 of 32 is on cv-hpcn2
Process 20 of 32 is on cv-hpcn2
Process 21 of 32 is on cv-hpcn2
Process 22 of 32 is on cv-hpcn2
Process 23 of 32 is on cv-hpcn2
Process 24 of 32 is on cv-hpcn2
Process 26 of 32 is on cv-hpcn2
Process 28 of 32 is on cv-hpcn2
Process 29 of 32 is on cv-hpcn2
Process 30 of 32 is on cv-hpcn2
Process 31 of 32 is on cv-hpcn2
Process 16 of 32 is on cv-hpcn2
Process 18 of 32 is on cv-hpcn2
Process 25 of 32 is on cv-hpcn2
Process 27 of 32 is on cv-hpcn2
Process 1 of 32 is on cv-hpcn1
Process 2 of 32 is on cv-hpcn1
Process 10 of 32 is on cv-hpcn1
Process 12 of 32 is on cv-hpcn1
Process 14 of 32 is on cv-hpcn1
Process 15 of 32 is on cv-hpcn1
Process 0 of 32 is on cv-hpcn1
Process 5 of 32 is on cv-hpcn1
Process 8 of 32 is on cv-hpcn1
Process 13 of 32 is on cv-hpcn1
Process 11 of 32 is on cv-hpcn1
Process 3 of 32 is on cv-hpcn1
Process 9 of 32 is on cv-hpcn1
Process 7 of 32 is on cv-hpcn1
Process 4 of 32 is on cv-hpcn1
Process 6 of 32 is on cv-hpcn1
Fatal error in PMPI_Reduce: A process has failed, error stack:
PMPI_Reduce(1217)...............: MPI_Reduce(sbuf=0x7ffff0ab2320,
rbuf=0x7ffff0ab2328, count=1, MPI_DOUBLE, MPI_SUM, root=0, MPI_COMM_WORLD)
failed
MPIR_Reduce_impl(1029)..........:
MPIR_Reduce_intra(779)..........:
MPIR_Reduce_impl(1029)..........:
MPIR_Reduce_intra(835)..........:
MPIR_Reduce_binomial(144).......:
MPIDI_CH3U_Recvq_FDU_or_AEP(612): Communication error with rank 16
MPIR_Reduce_intra(799)..........:
MPIR_Reduce_impl(1029)..........:
MPIR_Reduce_intra(835)..........:
MPIR_Reduce_binomial(206).......: Failure during collective
============================================================================
=======
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= EXIT CODE: 1
= CLEANING UP REMAINING PROCESSES
= YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
============================================================================
=======
[proxy:0:1 at cv-hpcn2] HYD_pmcd_pmip_control_cmd_cb
(./pm/pmiserv/pmip_cb.c:886): assert (!closed) failed
[proxy:0:1 at cv-hpcn2] HYDT_dmxu_poll_wait_for_event
(./tools/demux/demux_poll.c:77): callback returned error status
[proxy:0:1 at cv-hpcn2] main (./pm/pmiserv/pmip.c:206): demux engine error
waiting for event
[mpiexec at cv-hpcn2] HYDT_bscu_wait_for_completion
(./tools/bootstrap/utils/bscu_wait.c:76): one of the processes terminated
badly; aborting
[mpiexec at cv-hpcn2] HYDT_bsci_wait_for_completion
(./tools/bootstrap/src/bsci_wait.c:23): launcher returned error waiting for
completion
[mpiexec at cv-hpcn2] HYD_pmci_wait_for_completion
(./pm/pmiserv/pmiserv_pmci.c:217): launcher returned error waiting for
completion
[mpiexec at cv-hpcn2] main (./ui/mpich/mpiexec.c:330): process manager error
waiting for completion
Thanks for any advice. I've built many different versions of
MPICH many times and this one has me scratching me head. I'm rebuilding
everything identically on a pair of little VMs now to see if I can repeat
the behavior outside of these systems.
~Mike C.
-----Original Message-----
From: Pavan Balaji [mailto:balaji at mcs.anl.gov]
Sent: Friday, January 11, 2013 9:20 PM
To: Michael Colonno
Cc: discuss at mpich.org
Subject: Re: [mpich-discuss] Fatal error in PMPI_Reduce
On 01/11/2013 09:18 PM US Central Time, Michael Colonno wrote:
> The first output below is from cxxcpi; I can also run cpi is it's helpful.
Ah, sorry, I didn't realize you were in fact running one of the test
programs distributed in MPICH.
Based on the error, some of the nodes are not able to connect to each other
based on the hostnames they are publishing. Did you try the solutions
listed on the FAQ:
<http://wiki.mpich.org/mpich/index.php/Frequently_Asked_Questions#Q:_My_MPI_
program_aborts_with_an_error_saying_it_cannot_communicate_with_other_process
es>
http://wiki.mpich.org/mpich/index.php/Frequently_Asked_Questions#Q:_My_MPI_p
rogram_aborts_with_an_error_saying_it_cannot_communicate_with_other_processe
s
-- Pavan
--
Pavan Balaji
<http://www.mcs.anl.gov/~balaji> http://www.mcs.anl.gov/~balaji
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20130112/d62996eb/attachment.html>
More information about the discuss
mailing list