[mpich-discuss] Support for MIC in mpich2-1.5

John Fettig john.fettig at gmail.com
Tue Nov 13 16:07:38 CST 2012


On Mon, Nov 5, 2012 at 9:37 PM, Pavan Balaji <balaji at mcs.anl.gov> wrote:

>
> On 11/05/12 13:12, John Fettig wrote:
>
>> I believe I have a working build, I'll append my cross file to the end
>> of this email if anybody else wants to try it.
>>
>
> Thanks!
>
>
>  I have a followup question:  is there any support for launching jobs
>> that use both the MIC and the host CPU?
>>
>
> Yes.  Once you have setup MPICH on both the host and MIC, you can launch
> jobs across them.
>
> If you didn't pass any configure option, it'll use TCP/IP, which is very
> slow.  If you configure with --with-device=ch3:nemesis:scif, it'll use the
> SCIF protocol, which is much faster.
>

I compiled examples/hellow.c for both the MIC and the host CPU, and copied
it to the card.  This seems to work:

$ mpiexec -hosts 172.31.1.1:1,172.31.1.254:1 -n 1 ./hellow.mic : -n 1
./hellow
Hello world from process 1 of 2
Hello world from process 0 of 2

However, if I try to run more processes it crashes:

$ mpiexec -hosts 172.31.1.1:3,172.31.1.254:3 -n 3 ./hellow.mic : -n 3
./hellow
Hello world from process 4 of 6
Hello world from process 0 of 6
Hello world from process 3 of 6
Hello world from process 1 of 6
 0:  3: 00000033: 00000042: readv err 0
Fatal error in MPI_Finalize: Other MPI error, error stack:
MPI_Finalize(293).................: MPI_Finalize failed
MPI_Finalize(213).................:
MPID_Finalize(117)................:
MPIDI_CH3U_VC_WaitForClose(385)...: an error occurred while the device was
waiting for all open connections to close
MPIDI_CH3I_Progress(367)..........:
MPID_nem_mpich2_blocking_recv(904):
state_commrdy_handler(175)........:
state_commrdy_handler(138)........:
MPID_nem_scif_recv_handler(115)...: Communication error with rank 3
MPID_nem_scif_recv_handler(35)....: scif_scif_read failed (scif_scif_read
failed with error 'Success')
 1:  3: 00000033: 00000042: readv err 0
Fatal error in MPI_Finalize: Other MPI error, error stack:
MPI_Finalize(293).................: MPI_Finalize failed
MPI_Finalize(213).................:
MPID_Finalize(117)................:
MPIDI_CH3U_VC_WaitForClose(385)...: an error occurred while the device was
waiting for all open connections to close
MPIDI_CH3I_Progress(367)..........:
MPID_nem_mpich2_blocking_recv(904):
state_commrdy_handler(175)........:
state_commrdy_handler(138)........:
MPID_nem_scif_recv_handler(115)...: Communication error with rank 3
MPID_nem_scif_recv_handler(35)....: scif_scif_read failed (scif_scif_read
failed with error 'Success')
Hello world from process 5 of 6
Fatal error in MPI_Finalize: Other MPI error, error stack:
MPI_Finalize(293).................: MPI_Finalize failed
MPI_Finalize(213).................:
MPID_Finalize(117)................:
MPIDI_CH3U_VC_WaitForClose(385)...: an error occurred while the device was
waiting for all open connections to close
MPIDI_CH3I_Progress(367)..........:
MPID_nem_mpich2_blocking_recv(904):
state_commrdy_handler(184)........: poll of socket fds failed
Fatal error in MPI_Finalize: Other MPI error, error stack:
MPI_Finalize(293).................: MPI_Finalize failed
MPI_Finalize(213).................:
MPID_Finalize(117)................:
MPIDI_CH3U_VC_WaitForClose(385)...: an error occurred while the device was
waiting for all open connections to close
MPIDI_CH3I_Progress(367)..........:
MPID_nem_mpich2_blocking_recv(904):
state_commrdy_handler(184)........: poll of socket fds failed

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   EXIT CODE: 1
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
[proxy:0:0 at mic0.local] HYD_pmcd_pmip_control_cmd_cb
(./pm/pmiserv/pmip_cb.c:883): assert (!closed) failed
[proxy:0:0 at mic0.local] HYDT_dmxu_poll_wait_for_event
(./tools/demux/demux_poll.c:77): callback returned error status
[proxy:0:0 at mic0.local] main (./pm/pmiserv/pmip.c:210): demux engine error
waiting for event
[mpiexec at host] HYDT_bscu_wait_for_completion
(./tools/bootstrap/utils/bscu_wait.c:76): one of the processes terminated
badly; aborting
[mpiexec at host] HYDT_bsci_wait_for_completion
(./tools/bootstrap/src/bsci_wait.c:23): launcher returned error waiting for
completion
[mpiexec at host] HYD_pmci_wait_for_completion
(./pm/pmiserv/pmiserv_pmci.c:216): launcher returned error waiting for
completion
[mpiexec at host] main (./ui/mpich/mpiexec.c:325): process manager error
waiting for completion

Any ideas?

John
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20121113/a283dbe6/attachment.html>


More information about the discuss mailing list