[mpich-discuss] Fatal error in PMPI_Reduce

Michael Colonno mcolonno at stanford.edu
Sat Jan 12 12:29:33 CST 2013


            Hi Pavan ~

 

            Thanks for the reply. I did verify connectivity prior to trying
the program. In this test case I have two systems, each with 16 cores
(cv-hpcn1 and cv-hpcn2). ssh between the two systems appears to be working
as far as I can tell:

 

cv-hpcn1$ ssh cv-hpcn1 date

Sat Jan 12 10:20:01 PST 2013

cv-hpcn1$ ssh cv-hpcn2 date

Sat Jan 12 10:19:23 PST 2013

cv-hpcn1$ ssh cv-hpcn2

cv-hpcn2$ ssh cv-hpcn1 date

Sat Jan 12 10:20:43 PST 2013

 

            Here's the full test run of cxxcpi: 

 

cv-hpcn1$ /usr/local/apps/MPICH2/standalone/bin/mpirun -n 32 -f
/usr/local/apps/hosts /usr/local/apps/cxxcpi

Process 17 of 32 is on cv-hpcn2

Process 19 of 32 is on cv-hpcn2

Process 20 of 32 is on cv-hpcn2

Process 21 of 32 is on cv-hpcn2

Process 22 of 32 is on cv-hpcn2

Process 23 of 32 is on cv-hpcn2

Process 24 of 32 is on cv-hpcn2

Process 26 of 32 is on cv-hpcn2

Process 28 of 32 is on cv-hpcn2

Process 29 of 32 is on cv-hpcn2

Process 30 of 32 is on cv-hpcn2

Process 31 of 32 is on cv-hpcn2

Process 16 of 32 is on cv-hpcn2

Process 18 of 32 is on cv-hpcn2

Process 25 of 32 is on cv-hpcn2

Process 27 of 32 is on cv-hpcn2

Process 1 of 32 is on cv-hpcn1

Process 2 of 32 is on cv-hpcn1

Process 10 of 32 is on cv-hpcn1

Process 12 of 32 is on cv-hpcn1

Process 14 of 32 is on cv-hpcn1

Process 15 of 32 is on cv-hpcn1

Process 0 of 32 is on cv-hpcn1

Process 5 of 32 is on cv-hpcn1

Process 8 of 32 is on cv-hpcn1

Process 13 of 32 is on cv-hpcn1

Process 11 of 32 is on cv-hpcn1

Process 3 of 32 is on cv-hpcn1

Process 9 of 32 is on cv-hpcn1

Process 7 of 32 is on cv-hpcn1

Process 4 of 32 is on cv-hpcn1

Process 6 of 32 is on cv-hpcn1

Fatal error in PMPI_Reduce: A process has failed, error stack:

PMPI_Reduce(1217)...............: MPI_Reduce(sbuf=0x7ffff0ab2320,
rbuf=0x7ffff0ab2328, count=1, MPI_DOUBLE, MPI_SUM, root=0, MPI_COMM_WORLD)
failed

MPIR_Reduce_impl(1029)..........:

MPIR_Reduce_intra(779)..........:

MPIR_Reduce_impl(1029)..........:

MPIR_Reduce_intra(835)..........:

MPIR_Reduce_binomial(144).......:

MPIDI_CH3U_Recvq_FDU_or_AEP(612): Communication error with rank 16

MPIR_Reduce_intra(799)..........:

MPIR_Reduce_impl(1029)..........:

MPIR_Reduce_intra(835)..........:

MPIR_Reduce_binomial(206).......: Failure during collective

 

============================================================================
=======

=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES

=   EXIT CODE: 1

=   CLEANING UP REMAINING PROCESSES

=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES

============================================================================
=======

[proxy:0:1 at cv-hpcn2] HYD_pmcd_pmip_control_cmd_cb
(./pm/pmiserv/pmip_cb.c:886): assert (!closed) failed

[proxy:0:1 at cv-hpcn2] HYDT_dmxu_poll_wait_for_event
(./tools/demux/demux_poll.c:77): callback returned error status

[proxy:0:1 at cv-hpcn2] main (./pm/pmiserv/pmip.c:206): demux engine error
waiting for event

[mpiexec at cv-hpcn2] HYDT_bscu_wait_for_completion
(./tools/bootstrap/utils/bscu_wait.c:76): one of the processes terminated
badly; aborting

[mpiexec at cv-hpcn2] HYDT_bsci_wait_for_completion
(./tools/bootstrap/src/bsci_wait.c:23): launcher returned error waiting for
completion

[mpiexec at cv-hpcn2] HYD_pmci_wait_for_completion
(./pm/pmiserv/pmiserv_pmci.c:217): launcher returned error waiting for
completion

[mpiexec at cv-hpcn2] main (./ui/mpich/mpiexec.c:330): process manager error
waiting for completion

 

            Thanks for any advice. I've built many different versions of
MPICH many times and this one has me scratching me head. I'm rebuilding
everything identically on a pair of little VMs now to see if I can repeat
the behavior outside of these systems. 

 

            ~Mike C. 

 

-----Original Message-----
From: Pavan Balaji [mailto:balaji at mcs.anl.gov] 
Sent: Friday, January 11, 2013 9:20 PM
To: Michael Colonno
Cc: discuss at mpich.org
Subject: Re: [mpich-discuss] Fatal error in PMPI_Reduce

 

 

On 01/11/2013 09:18 PM US Central Time, Michael Colonno wrote:

> The first output below is from cxxcpi; I can also run cpi is it's helpful.

 

Ah, sorry, I didn't realize you were in fact running one of the test
programs distributed in MPICH.

 

Based on the error, some of the nodes are not able to connect to each other
based on the hostnames they are publishing.  Did you try the solutions
listed on the FAQ:

 

 
<http://wiki.mpich.org/mpich/index.php/Frequently_Asked_Questions#Q:_My_MPI_
program_aborts_with_an_error_saying_it_cannot_communicate_with_other_process
es>
http://wiki.mpich.org/mpich/index.php/Frequently_Asked_Questions#Q:_My_MPI_p
rogram_aborts_with_an_error_saying_it_cannot_communicate_with_other_processe
s

 

-- Pavan

 

--

Pavan Balaji

 <http://www.mcs.anl.gov/~balaji> http://www.mcs.anl.gov/~balaji

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20130112/d62996eb/attachment.html>


More information about the discuss mailing list