[mpich-discuss] Support for MIC in mpich2-1.5
Pavan Balaji
balaji at mcs.anl.gov
Sat Dec 1 09:46:32 CST 2012
Sorry for the delay in responding, John. I'm catching up on my email.
Looks like some bug in the scif code. I tried to reproduce it on my
machine, but I can't. Can you give a little more information on the setup?
-- Pavan
On 11/30/2012 04:36 PM US Central Time, John Fettig wrote:
> Any thoughts about this?
>
> Regards,
> John
>
>
> On Tue, Nov 13, 2012 at 5:07 PM, John Fettig <john.fettig at gmail.com
> <mailto:john.fettig at gmail.com>> wrote:
>
> On Mon, Nov 5, 2012 at 9:37 PM, Pavan Balaji <balaji at mcs.anl.gov
> <mailto:balaji at mcs.anl.gov>> wrote:
>
>
> On 11/05/12 13:12, John Fettig wrote:
>
> I believe I have a working build, I'll append my cross file
> to the end
> of this email if anybody else wants to try it.
>
>
> Thanks!
>
>
> I have a followup question: is there any support for
> launching jobs
> that use both the MIC and the host CPU?
>
>
> Yes. Once you have setup MPICH on both the host and MIC, you
> can launch jobs across them.
>
> If you didn't pass any configure option, it'll use TCP/IP, which
> is very slow. If you configure with
> --with-device=ch3:nemesis:scif, it'll use the SCIF protocol,
> which is much faster.
>
>
> I compiled examples/hellow.c for both the MIC and the host CPU, and
> copied it to the card. This seems to work:
>
> $ mpiexec -hosts 172.31.1.1:1 <http://172.31.1.1:1>,172.31.1.254:1
> <http://172.31.1.254:1> -n 1 ./hellow.mic : -n 1 ./hellow
> Hello world from process 1 of 2
> Hello world from process 0 of 2
>
> However, if I try to run more processes it crashes:
>
> $ mpiexec -hosts 172.31.1.1:3 <http://172.31.1.1:3>,172.31.1.254:3
> <http://172.31.1.254:3> -n 3 ./hellow.mic : -n 3 ./hellow
> Hello world from process 4 of 6
> Hello world from process 0 of 6
> Hello world from process 3 of 6
> Hello world from process 1 of 6
> 0: 3: 00000033: 00000042: readv err 0
> Fatal error in MPI_Finalize: Other MPI error, error stack:
> MPI_Finalize(293).................: MPI_Finalize failed
> MPI_Finalize(213).................:
> MPID_Finalize(117)................:
> MPIDI_CH3U_VC_WaitForClose(385)...: an error occurred while the
> device was waiting for all open connections to close
> MPIDI_CH3I_Progress(367)..........:
> MPID_nem_mpich2_blocking_recv(904):
> state_commrdy_handler(175)........:
> state_commrdy_handler(138)........:
> MPID_nem_scif_recv_handler(115)...: Communication error with rank 3
> MPID_nem_scif_recv_handler(35)....: scif_scif_read failed
> (scif_scif_read failed with error 'Success')
> 1: 3: 00000033: 00000042: readv err 0
> Fatal error in MPI_Finalize: Other MPI error, error stack:
> MPI_Finalize(293).................: MPI_Finalize failed
> MPI_Finalize(213).................:
> MPID_Finalize(117)................:
> MPIDI_CH3U_VC_WaitForClose(385)...: an error occurred while the
> device was waiting for all open connections to close
> MPIDI_CH3I_Progress(367)..........:
> MPID_nem_mpich2_blocking_recv(904):
> state_commrdy_handler(175)........:
> state_commrdy_handler(138)........:
> MPID_nem_scif_recv_handler(115)...: Communication error with rank 3
> MPID_nem_scif_recv_handler(35)....: scif_scif_read failed
> (scif_scif_read failed with error 'Success')
> Hello world from process 5 of 6
> Fatal error in MPI_Finalize: Other MPI error, error stack:
> MPI_Finalize(293).................: MPI_Finalize failed
> MPI_Finalize(213).................:
> MPID_Finalize(117)................:
> MPIDI_CH3U_VC_WaitForClose(385)...: an error occurred while the
> device was waiting for all open connections to close
> MPIDI_CH3I_Progress(367)..........:
> MPID_nem_mpich2_blocking_recv(904):
> state_commrdy_handler(184)........: poll of socket fds failed
> Fatal error in MPI_Finalize: Other MPI error, error stack:
> MPI_Finalize(293).................: MPI_Finalize failed
> MPI_Finalize(213).................:
> MPID_Finalize(117)................:
> MPIDI_CH3U_VC_WaitForClose(385)...: an error occurred while the
> device was waiting for all open connections to close
> MPIDI_CH3I_Progress(367)..........:
> MPID_nem_mpich2_blocking_recv(904):
> state_commrdy_handler(184)........: poll of socket fds failed
>
> ===================================================================================
> = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
> = EXIT CODE: 1
> = CLEANING UP REMAINING PROCESSES
> = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
> ===================================================================================
> [proxy:0:0 at mic0.local] HYD_pmcd_pmip_control_cmd_cb
> (./pm/pmiserv/pmip_cb.c:883): assert (!closed) failed
> [proxy:0:0 at mic0.local] HYDT_dmxu_poll_wait_for_event
> (./tools/demux/demux_poll.c:77): callback returned error status
> [proxy:0:0 at mic0.local] main (./pm/pmiserv/pmip.c:210): demux engine
> error waiting for event
> [mpiexec at host] HYDT_bscu_wait_for_completion
> (./tools/bootstrap/utils/bscu_wait.c:76): one of the processes
> terminated badly; aborting
> [mpiexec at host] HYDT_bsci_wait_for_completion
> (./tools/bootstrap/src/bsci_wait.c:23): launcher returned error
> waiting for completion
> [mpiexec at host] HYD_pmci_wait_for_completion
> (./pm/pmiserv/pmiserv_pmci.c:216): launcher returned error waiting
> for completion
> [mpiexec at host] main (./ui/mpich/mpiexec.c:325): process manager
> error waiting for completion
>
> Any ideas?
>
> John
>
>
--
Pavan Balaji
http://www.mcs.anl.gov/~balaji
More information about the discuss
mailing list