[mpich-discuss] Support for MIC in mpich2-1.5

John Fettig john.fettig at gmail.com
Fri Nov 30 16:36:17 CST 2012


Any thoughts about this?

Regards,
John


On Tue, Nov 13, 2012 at 5:07 PM, John Fettig <john.fettig at gmail.com> wrote:

> On Mon, Nov 5, 2012 at 9:37 PM, Pavan Balaji <balaji at mcs.anl.gov> wrote:
>
>>
>> On 11/05/12 13:12, John Fettig wrote:
>>
>>> I believe I have a working build, I'll append my cross file to the end
>>> of this email if anybody else wants to try it.
>>>
>>
>> Thanks!
>>
>>
>>  I have a followup question:  is there any support for launching jobs
>>> that use both the MIC and the host CPU?
>>>
>>
>> Yes.  Once you have setup MPICH on both the host and MIC, you can launch
>> jobs across them.
>>
>> If you didn't pass any configure option, it'll use TCP/IP, which is very
>> slow.  If you configure with --with-device=ch3:nemesis:scif, it'll use the
>> SCIF protocol, which is much faster.
>>
>
> I compiled examples/hellow.c for both the MIC and the host CPU, and copied
> it to the card.  This seems to work:
>
> $ mpiexec -hosts 172.31.1.1:1,172.31.1.254:1 -n 1 ./hellow.mic : -n 1
> ./hellow
> Hello world from process 1 of 2
> Hello world from process 0 of 2
>
> However, if I try to run more processes it crashes:
>
> $ mpiexec -hosts 172.31.1.1:3,172.31.1.254:3 -n 3 ./hellow.mic : -n 3
> ./hellow
> Hello world from process 4 of 6
> Hello world from process 0 of 6
> Hello world from process 3 of 6
> Hello world from process 1 of 6
>  0:  3: 00000033: 00000042: readv err 0
> Fatal error in MPI_Finalize: Other MPI error, error stack:
> MPI_Finalize(293).................: MPI_Finalize failed
> MPI_Finalize(213).................:
> MPID_Finalize(117)................:
> MPIDI_CH3U_VC_WaitForClose(385)...: an error occurred while the device was
> waiting for all open connections to close
> MPIDI_CH3I_Progress(367)..........:
> MPID_nem_mpich2_blocking_recv(904):
> state_commrdy_handler(175)........:
> state_commrdy_handler(138)........:
> MPID_nem_scif_recv_handler(115)...: Communication error with rank 3
> MPID_nem_scif_recv_handler(35)....: scif_scif_read failed (scif_scif_read
> failed with error 'Success')
>  1:  3: 00000033: 00000042: readv err 0
> Fatal error in MPI_Finalize: Other MPI error, error stack:
> MPI_Finalize(293).................: MPI_Finalize failed
> MPI_Finalize(213).................:
> MPID_Finalize(117)................:
> MPIDI_CH3U_VC_WaitForClose(385)...: an error occurred while the device was
> waiting for all open connections to close
> MPIDI_CH3I_Progress(367)..........:
> MPID_nem_mpich2_blocking_recv(904):
> state_commrdy_handler(175)........:
> state_commrdy_handler(138)........:
> MPID_nem_scif_recv_handler(115)...: Communication error with rank 3
> MPID_nem_scif_recv_handler(35)....: scif_scif_read failed (scif_scif_read
> failed with error 'Success')
> Hello world from process 5 of 6
> Fatal error in MPI_Finalize: Other MPI error, error stack:
> MPI_Finalize(293).................: MPI_Finalize failed
> MPI_Finalize(213).................:
> MPID_Finalize(117)................:
> MPIDI_CH3U_VC_WaitForClose(385)...: an error occurred while the device was
> waiting for all open connections to close
> MPIDI_CH3I_Progress(367)..........:
> MPID_nem_mpich2_blocking_recv(904):
> state_commrdy_handler(184)........: poll of socket fds failed
> Fatal error in MPI_Finalize: Other MPI error, error stack:
> MPI_Finalize(293).................: MPI_Finalize failed
> MPI_Finalize(213).................:
> MPID_Finalize(117)................:
> MPIDI_CH3U_VC_WaitForClose(385)...: an error occurred while the device was
> waiting for all open connections to close
> MPIDI_CH3I_Progress(367)..........:
> MPID_nem_mpich2_blocking_recv(904):
> state_commrdy_handler(184)........: poll of socket fds failed
>
>
> ===================================================================================
> =   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
> =   EXIT CODE: 1
> =   CLEANING UP REMAINING PROCESSES
> =   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
>
> ===================================================================================
> [proxy:0:0 at mic0.local] HYD_pmcd_pmip_control_cmd_cb
> (./pm/pmiserv/pmip_cb.c:883): assert (!closed) failed
> [proxy:0:0 at mic0.local] HYDT_dmxu_poll_wait_for_event
> (./tools/demux/demux_poll.c:77): callback returned error status
> [proxy:0:0 at mic0.local] main (./pm/pmiserv/pmip.c:210): demux engine error
> waiting for event
> [mpiexec at host] HYDT_bscu_wait_for_completion
> (./tools/bootstrap/utils/bscu_wait.c:76): one of the processes terminated
> badly; aborting
> [mpiexec at host] HYDT_bsci_wait_for_completion
> (./tools/bootstrap/src/bsci_wait.c:23): launcher returned error waiting for
> completion
> [mpiexec at host] HYD_pmci_wait_for_completion
> (./pm/pmiserv/pmiserv_pmci.c:216): launcher returned error waiting for
> completion
> [mpiexec at host] main (./ui/mpich/mpiexec.c:325): process manager error
> waiting for completion
>
> Any ideas?
>
> John
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20121130/f8165829/attachment.html>


More information about the discuss mailing list