[mpich-discuss] Support for MIC in mpich2-1.5

Pavan Balaji balaji at mcs.anl.gov
Sat Dec 1 09:46:32 CST 2012


Sorry for the delay in responding, John.  I'm catching up on my email.

Looks like some bug in the scif code.  I tried to reproduce it on my
machine, but I can't.  Can you give a little more information on the setup?

 -- Pavan

On 11/30/2012 04:36 PM US Central Time, John Fettig wrote:
> Any thoughts about this?
> 
> Regards,
> John
> 
> 
> On Tue, Nov 13, 2012 at 5:07 PM, John Fettig <john.fettig at gmail.com
> <mailto:john.fettig at gmail.com>> wrote:
> 
>     On Mon, Nov 5, 2012 at 9:37 PM, Pavan Balaji <balaji at mcs.anl.gov
>     <mailto:balaji at mcs.anl.gov>> wrote:
> 
> 
>         On 11/05/12 13:12, John Fettig wrote:
> 
>             I believe I have a working build, I'll append my cross file
>             to the end
>             of this email if anybody else wants to try it.
> 
> 
>         Thanks!
> 
> 
>             I have a followup question:  is there any support for
>             launching jobs
>             that use both the MIC and the host CPU?
> 
> 
>         Yes.  Once you have setup MPICH on both the host and MIC, you
>         can launch jobs across them.
> 
>         If you didn't pass any configure option, it'll use TCP/IP, which
>         is very slow.  If you configure with
>         --with-device=ch3:nemesis:scif, it'll use the SCIF protocol,
>         which is much faster.
> 
> 
>     I compiled examples/hellow.c for both the MIC and the host CPU, and
>     copied it to the card.  This seems to work:
> 
>     $ mpiexec -hosts 172.31.1.1:1 <http://172.31.1.1:1>,172.31.1.254:1
>     <http://172.31.1.254:1> -n 1 ./hellow.mic : -n 1 ./hellow
>     Hello world from process 1 of 2
>     Hello world from process 0 of 2
> 
>     However, if I try to run more processes it crashes:
> 
>     $ mpiexec -hosts 172.31.1.1:3 <http://172.31.1.1:3>,172.31.1.254:3
>     <http://172.31.1.254:3> -n 3 ./hellow.mic : -n 3 ./hellow
>     Hello world from process 4 of 6
>     Hello world from process 0 of 6
>     Hello world from process 3 of 6
>     Hello world from process 1 of 6
>      0:  3: 00000033: 00000042: readv err 0
>     Fatal error in MPI_Finalize: Other MPI error, error stack:
>     MPI_Finalize(293).................: MPI_Finalize failed
>     MPI_Finalize(213).................:
>     MPID_Finalize(117)................:
>     MPIDI_CH3U_VC_WaitForClose(385)...: an error occurred while the
>     device was waiting for all open connections to close
>     MPIDI_CH3I_Progress(367)..........:
>     MPID_nem_mpich2_blocking_recv(904):
>     state_commrdy_handler(175)........:
>     state_commrdy_handler(138)........:
>     MPID_nem_scif_recv_handler(115)...: Communication error with rank 3
>     MPID_nem_scif_recv_handler(35)....: scif_scif_read failed
>     (scif_scif_read failed with error 'Success')
>      1:  3: 00000033: 00000042: readv err 0
>     Fatal error in MPI_Finalize: Other MPI error, error stack:
>     MPI_Finalize(293).................: MPI_Finalize failed
>     MPI_Finalize(213).................:
>     MPID_Finalize(117)................:
>     MPIDI_CH3U_VC_WaitForClose(385)...: an error occurred while the
>     device was waiting for all open connections to close
>     MPIDI_CH3I_Progress(367)..........:
>     MPID_nem_mpich2_blocking_recv(904):
>     state_commrdy_handler(175)........:
>     state_commrdy_handler(138)........:
>     MPID_nem_scif_recv_handler(115)...: Communication error with rank 3
>     MPID_nem_scif_recv_handler(35)....: scif_scif_read failed
>     (scif_scif_read failed with error 'Success')
>     Hello world from process 5 of 6
>     Fatal error in MPI_Finalize: Other MPI error, error stack:
>     MPI_Finalize(293).................: MPI_Finalize failed
>     MPI_Finalize(213).................:
>     MPID_Finalize(117)................:
>     MPIDI_CH3U_VC_WaitForClose(385)...: an error occurred while the
>     device was waiting for all open connections to close
>     MPIDI_CH3I_Progress(367)..........:
>     MPID_nem_mpich2_blocking_recv(904):
>     state_commrdy_handler(184)........: poll of socket fds failed
>     Fatal error in MPI_Finalize: Other MPI error, error stack:
>     MPI_Finalize(293).................: MPI_Finalize failed
>     MPI_Finalize(213).................:
>     MPID_Finalize(117)................:
>     MPIDI_CH3U_VC_WaitForClose(385)...: an error occurred while the
>     device was waiting for all open connections to close
>     MPIDI_CH3I_Progress(367)..........:
>     MPID_nem_mpich2_blocking_recv(904):
>     state_commrdy_handler(184)........: poll of socket fds failed
> 
>     ===================================================================================
>     =   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
>     =   EXIT CODE: 1
>     =   CLEANING UP REMAINING PROCESSES
>     =   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
>     ===================================================================================
>     [proxy:0:0 at mic0.local] HYD_pmcd_pmip_control_cmd_cb
>     (./pm/pmiserv/pmip_cb.c:883): assert (!closed) failed
>     [proxy:0:0 at mic0.local] HYDT_dmxu_poll_wait_for_event
>     (./tools/demux/demux_poll.c:77): callback returned error status
>     [proxy:0:0 at mic0.local] main (./pm/pmiserv/pmip.c:210): demux engine
>     error waiting for event
>     [mpiexec at host] HYDT_bscu_wait_for_completion
>     (./tools/bootstrap/utils/bscu_wait.c:76): one of the processes
>     terminated badly; aborting
>     [mpiexec at host] HYDT_bsci_wait_for_completion
>     (./tools/bootstrap/src/bsci_wait.c:23): launcher returned error
>     waiting for completion
>     [mpiexec at host] HYD_pmci_wait_for_completion
>     (./pm/pmiserv/pmiserv_pmci.c:216): launcher returned error waiting
>     for completion
>     [mpiexec at host] main (./ui/mpich/mpiexec.c:325): process manager
>     error waiting for completion
> 
>     Any ideas?
> 
>     John
> 
> 

-- 
Pavan Balaji
http://www.mcs.anl.gov/~balaji



More information about the discuss mailing list