<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:w="urn:schemas-microsoft-com:office:word" xmlns:m="http://schemas.microsoft.com/office/2004/12/omml" xmlns="http://www.w3.org/TR/REC-html40"><head><META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=us-ascii"><meta name=Generator content="Microsoft Word 14 (filtered medium)"><style><!--
/* Style Definitions */
p.MsoNormal, li.MsoNormal, div.MsoNormal
{margin:0in;
margin-bottom:.0001pt;
font-size:11.0pt;
font-family:"Calibri","sans-serif";}
a:link, span.MsoHyperlink
{mso-style-priority:99;
color:blue;
text-decoration:underline;}
a:visited, span.MsoHyperlinkFollowed
{mso-style-priority:99;
color:purple;
text-decoration:underline;}
p.MsoPlainText, li.MsoPlainText, div.MsoPlainText
{mso-style-priority:99;
mso-style-link:"Plain Text Char";
margin:0in;
margin-bottom:.0001pt;
font-size:10.0pt;
font-family:"Arial","sans-serif";}
span.PlainTextChar
{mso-style-name:"Plain Text Char";
mso-style-priority:99;
mso-style-link:"Plain Text";
font-family:"Arial","sans-serif";}
.MsoChpDefault
{mso-style-type:export-only;
font-family:"Calibri","sans-serif";}
@page WordSection1
{size:8.5in 11.0in;
margin:1.0in 1.0in 1.0in 1.0in;}
div.WordSection1
{page:WordSection1;}
--></style><!--[if gte mso 9]><xml>
<o:shapedefaults v:ext="edit" spidmax="1026" />
</xml><![endif]--><!--[if gte mso 9]><xml>
<o:shapelayout v:ext="edit">
<o:idmap v:ext="edit" data="1" />
</o:shapelayout></xml><![endif]--></head><body lang=EN-US link=blue vlink=purple><div class=WordSection1><p class=MsoPlainText> Hi Pavan ~<o:p></o:p></p><p class=MsoPlainText><o:p> </o:p></p><p class=MsoPlainText> Thanks for the reply. I did verify connectivity prior to trying the program. In this test case I have two systems, each with 16 cores (cv-hpcn1 and cv-hpcn2). ssh between the two systems appears to be working as far as I can tell:<o:p></o:p></p><p class=MsoPlainText><o:p> </o:p></p><p class=MsoPlainText><span style='font-family:"Courier New"'>cv-hpcn1$ ssh cv-hpcn1 date<o:p></o:p></span></p><p class=MsoPlainText><span style='font-family:"Courier New"'>Sat Jan 12 10:20:01 PST 2013<o:p></o:p></span></p><p class=MsoPlainText><span style='font-family:"Courier New"'>cv-hpcn1$ ssh cv-hpcn2 date<o:p></o:p></span></p><p class=MsoPlainText><span style='font-family:"Courier New"'>Sat Jan 12 10:19:23 PST 2013<o:p></o:p></span></p><p class=MsoPlainText><span style='font-family:"Courier New"'>cv-hpcn1$ ssh cv-hpcn2<o:p></o:p></span></p><p class=MsoPlainText><span style='font-family:"Courier New"'>cv-hpcn2$ ssh cv-hpcn1 date<o:p></o:p></span></p><p class=MsoPlainText><span style='font-family:"Courier New"'>Sat Jan 12 10:20:43 PST 2013<o:p></o:p></span></p><p class=MsoPlainText><o:p> </o:p></p><p class=MsoPlainText> Here's the full test run of cxxcpi: <o:p></o:p></p><p class=MsoPlainText><o:p> </o:p></p><p class=MsoPlainText><span style='font-family:"Courier New"'>cv-hpcn1$ /usr/local/apps/MPICH2/standalone/bin/mpirun -n 32 -f /usr/local/apps/hosts /usr/local/apps/cxxcpi<o:p></o:p></span></p><p class=MsoPlainText><span style='font-family:"Courier New"'>Process 17 of 32 is on cv-hpcn2<o:p></o:p></span></p><p class=MsoPlainText><span style='font-family:"Courier New"'>Process 19 of 32 is on cv-hpcn2<o:p></o:p></span></p><p class=MsoPlainText><span style='font-family:"Courier New"'>Process 20 of 32 is on cv-hpcn2<o:p></o:p></span></p><p class=MsoPlainText><span style='font-family:"Courier New"'>Process 21 of 32 is on cv-hpcn2<o:p></o:p></span></p><p class=MsoPlainText><span style='font-family:"Courier New"'>Process 22 of 32 is on cv-hpcn2<o:p></o:p></span></p><p class=MsoPlainText><span style='font-family:"Courier New"'>Process 23 of 32 is on cv-hpcn2<o:p></o:p></span></p><p class=MsoPlainText><span style='font-family:"Courier New"'>Process 24 of 32 is on cv-hpcn2<o:p></o:p></span></p><p class=MsoPlainText><span style='font-family:"Courier New"'>Process 26 of 32 is on cv-hpcn2<o:p></o:p></span></p><p class=MsoPlainText><span style='font-family:"Courier New"'>Process 28 of 32 is on cv-hpcn2<o:p></o:p></span></p><p class=MsoPlainText><span style='font-family:"Courier New"'>Process 29 of 32 is on cv-hpcn2<o:p></o:p></span></p><p class=MsoPlainText><span style='font-family:"Courier New"'>Process 30 of 32 is on cv-hpcn2<o:p></o:p></span></p><p class=MsoPlainText><span style='font-family:"Courier New"'>Process 31 of 32 is on cv-hpcn2<o:p></o:p></span></p><p class=MsoPlainText><span style='font-family:"Courier New"'>Process 16 of 32 is on cv-hpcn2<o:p></o:p></span></p><p class=MsoPlainText><span style='font-family:"Courier New"'>Process 18 of 32 is on cv-hpcn2<o:p></o:p></span></p><p class=MsoPlainText><span style='font-family:"Courier New"'>Process 25 of 32 is on cv-hpcn2<o:p></o:p></span></p><p class=MsoPlainText><span style='font-family:"Courier New"'>Process 27 of 32 is on cv-hpcn2<o:p></o:p></span></p><p class=MsoPlainText><span style='font-family:"Courier New"'>Process 1 of 32 is on cv-hpcn1<o:p></o:p></span></p><p class=MsoPlainText><span style='font-family:"Courier New"'>Process 2 of 32 is on cv-hpcn1<o:p></o:p></span></p><p class=MsoPlainText><span style='font-family:"Courier New"'>Process 10 of 32 is on cv-hpcn1<o:p></o:p></span></p><p class=MsoPlainText><span style='font-family:"Courier New"'>Process 12 of 32 is on cv-hpcn1<o:p></o:p></span></p><p class=MsoPlainText><span style='font-family:"Courier New"'>Process 14 of 32 is on cv-hpcn1<o:p></o:p></span></p><p class=MsoPlainText><span style='font-family:"Courier New"'>Process 15 of 32 is on cv-hpcn1<o:p></o:p></span></p><p class=MsoPlainText><span style='font-family:"Courier New"'>Process 0 of 32 is on cv-hpcn1<o:p></o:p></span></p><p class=MsoPlainText><span style='font-family:"Courier New"'>Process 5 of 32 is on cv-hpcn1<o:p></o:p></span></p><p class=MsoPlainText><span style='font-family:"Courier New"'>Process 8 of 32 is on cv-hpcn1<o:p></o:p></span></p><p class=MsoPlainText><span style='font-family:"Courier New"'>Process 13 of 32 is on cv-hpcn1<o:p></o:p></span></p><p class=MsoPlainText><span style='font-family:"Courier New"'>Process 11 of 32 is on cv-hpcn1<o:p></o:p></span></p><p class=MsoPlainText><span style='font-family:"Courier New"'>Process 3 of 32 is on cv-hpcn1<o:p></o:p></span></p><p class=MsoPlainText><span style='font-family:"Courier New"'>Process 9 of 32 is on cv-hpcn1<o:p></o:p></span></p><p class=MsoPlainText><span style='font-family:"Courier New"'>Process 7 of 32 is on cv-hpcn1<o:p></o:p></span></p><p class=MsoPlainText><span style='font-family:"Courier New"'>Process 4 of 32 is on cv-hpcn1<o:p></o:p></span></p><p class=MsoPlainText><span style='font-family:"Courier New"'>Process 6 of 32 is on cv-hpcn1<o:p></o:p></span></p><p class=MsoPlainText><span style='font-family:"Courier New"'>Fatal error in PMPI_Reduce: A process has failed, error stack:<o:p></o:p></span></p><p class=MsoPlainText><span style='font-family:"Courier New"'>PMPI_Reduce(1217)...............: MPI_Reduce(sbuf=0x7ffff0ab2320, rbuf=0x7ffff0ab2328, count=1, MPI_DOUBLE, MPI_SUM, root=0, MPI_COMM_WORLD) failed<o:p></o:p></span></p><p class=MsoPlainText><span style='font-family:"Courier New"'>MPIR_Reduce_impl(1029)..........:<o:p></o:p></span></p><p class=MsoPlainText><span style='font-family:"Courier New"'>MPIR_Reduce_intra(779)..........:<o:p></o:p></span></p><p class=MsoPlainText><span style='font-family:"Courier New"'>MPIR_Reduce_impl(1029)..........:<o:p></o:p></span></p><p class=MsoPlainText><span style='font-family:"Courier New"'>MPIR_Reduce_intra(835)..........:<o:p></o:p></span></p><p class=MsoPlainText><span style='font-family:"Courier New"'>MPIR_Reduce_binomial(144).......:<o:p></o:p></span></p><p class=MsoPlainText><span style='font-family:"Courier New"'>MPIDI_CH3U_Recvq_FDU_or_AEP(612): Communication error with rank 16<o:p></o:p></span></p><p class=MsoPlainText><span style='font-family:"Courier New"'>MPIR_Reduce_intra(799)..........:<o:p></o:p></span></p><p class=MsoPlainText><span style='font-family:"Courier New"'>MPIR_Reduce_impl(1029)..........:<o:p></o:p></span></p><p class=MsoPlainText><span style='font-family:"Courier New"'>MPIR_Reduce_intra(835)..........:<o:p></o:p></span></p><p class=MsoPlainText><span style='font-family:"Courier New"'>MPIR_Reduce_binomial(206).......: Failure during collective<o:p></o:p></span></p><p class=MsoPlainText><span style='font-family:"Courier New"'><o:p> </o:p></span></p><p class=MsoPlainText><span style='font-family:"Courier New"'>===================================================================================<o:p></o:p></span></p><p class=MsoPlainText><span style='font-family:"Courier New"'>= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES<o:p></o:p></span></p><p class=MsoPlainText><span style='font-family:"Courier New"'>= EXIT CODE: 1<o:p></o:p></span></p><p class=MsoPlainText><span style='font-family:"Courier New"'>= CLEANING UP REMAINING PROCESSES<o:p></o:p></span></p><p class=MsoPlainText><span style='font-family:"Courier New"'>= YOU CAN IGNORE THE BELOW CLEANUP MESSAGES<o:p></o:p></span></p><p class=MsoPlainText><span style='font-family:"Courier New"'>===================================================================================<o:p></o:p></span></p><p class=MsoPlainText><span style='font-family:"Courier New"'>[proxy:0:1@cv-hpcn2] HYD_pmcd_pmip_control_cmd_cb (./pm/pmiserv/pmip_cb.c:886): assert (!closed) failed<o:p></o:p></span></p><p class=MsoPlainText><span style='font-family:"Courier New"'>[proxy:0:1@cv-hpcn2] HYDT_dmxu_poll_wait_for_event (./tools/demux/demux_poll.c:77): callback returned error status<o:p></o:p></span></p><p class=MsoPlainText><span style='font-family:"Courier New"'>[proxy:0:1@cv-hpcn2] main (./pm/pmiserv/pmip.c:206): demux engine error waiting for event<o:p></o:p></span></p><p class=MsoPlainText><span style='font-family:"Courier New"'>[mpiexec@cv-hpcn2] HYDT_bscu_wait_for_completion (./tools/bootstrap/utils/bscu_wait.c:76): one of the processes terminated badly; aborting<o:p></o:p></span></p><p class=MsoPlainText><span style='font-family:"Courier New"'>[mpiexec@cv-hpcn2] HYDT_bsci_wait_for_completion (./tools/bootstrap/src/bsci_wait.c:23): launcher returned error waiting for completion<o:p></o:p></span></p><p class=MsoPlainText><span style='font-family:"Courier New"'>[mpiexec@cv-hpcn2] HYD_pmci_wait_for_completion (./pm/pmiserv/pmiserv_pmci.c:217): launcher returned error waiting for completion<o:p></o:p></span></p><p class=MsoPlainText><span style='font-family:"Courier New"'>[mpiexec@cv-hpcn2] main (./ui/mpich/mpiexec.c:330): process manager error waiting for completion<o:p></o:p></span></p><p class=MsoPlainText><o:p> </o:p></p><p class=MsoPlainText> Thanks for any advice. I’ve built many different versions of MPICH many times and this one has me scratching me head. I’m rebuilding everything identically on a pair of little VMs now to see if I can repeat the behavior outside of these systems. <o:p></o:p></p><p class=MsoPlainText><o:p> </o:p></p><p class=MsoPlainText> ~Mike C. <o:p></o:p></p><p class=MsoPlainText><o:p> </o:p></p><p class=MsoPlainText>-----Original Message-----<br>From: Pavan Balaji [mailto:balaji@mcs.anl.gov] <br>Sent: Friday, January 11, 2013 9:20 PM<br>To: Michael Colonno<br>Cc: discuss@mpich.org<br>Subject: Re: [mpich-discuss] Fatal error in PMPI_Reduce</p><p class=MsoPlainText><o:p> </o:p></p><p class=MsoPlainText><o:p> </o:p></p><p class=MsoPlainText>On 01/11/2013 09:18 PM US Central Time, Michael Colonno wrote:<o:p></o:p></p><p class=MsoPlainText>> The first output below is from cxxcpi; I can also run cpi is it's helpful.<o:p></o:p></p><p class=MsoPlainText><o:p> </o:p></p><p class=MsoPlainText>Ah, sorry, I didn't realize you were in fact running one of the test programs distributed in MPICH.<o:p></o:p></p><p class=MsoPlainText><o:p> </o:p></p><p class=MsoPlainText>Based on the error, some of the nodes are not able to connect to each other based on the hostnames they are publishing. Did you try the solutions listed on the FAQ:<o:p></o:p></p><p class=MsoPlainText><o:p> </o:p></p><p class=MsoPlainText><a href="http://wiki.mpich.org/mpich/index.php/Frequently_Asked_Questions#Q:_My_MPI_program_aborts_with_an_error_saying_it_cannot_communicate_with_other_processes"><span style='color:windowtext;text-decoration:none'>http://wiki.mpich.org/mpich/index.php/Frequently_Asked_Questions#Q:_My_MPI_program_aborts_with_an_error_saying_it_cannot_communicate_with_other_processes</span></a><o:p></o:p></p><p class=MsoPlainText><o:p> </o:p></p><p class=MsoPlainText> -- Pavan<o:p></o:p></p><p class=MsoPlainText><o:p> </o:p></p><p class=MsoPlainText>--<o:p></o:p></p><p class=MsoPlainText>Pavan Balaji<o:p></o:p></p><p class=MsoPlainText><a href="http://www.mcs.anl.gov/~balaji"><span style='color:windowtext;text-decoration:none'>http://www.mcs.anl.gov/~balaji</span></a><o:p></o:p></p></div></body></html>