[mpich-discuss] MPICH (3.0.3) w/ SCIF
Eric A. Borisch
eborisch at ieee.org
Tue Apr 9 14:39:27 CDT 2013
I'm trying to get MPICH (3.0.3) and SCIF working (on an Intel PHI card.)
I'm using the tests from osu_benchmarks(from mvapich2 tarball) as a set of
sanity checks, and I'm running into some unexpected errors.
One example: running osu_mbw_mr works sometimes, and then fail on the next
try. The printout from two successive runs as well as the hosts file are
below. I especially like the "scif_scif_read failed with error 'Success'"
message. :)
Any thoughts? Or is this something to take up with Intel? Compiler is
latest (13.1.1) icc; latest MPSS (2-2.1.5889-14); Centos 6.4.
Thanks,
Eric
This particular test should be setting up four pairs of processes, each
with one element on the host, and one on the PHI, and try to communicate
between the two (node 0<->4, ...) Things seem to be more stable with one or
two pairs of processes, but that's not really the desired use case...
[eborisch at rt5 osu_benchmarks]$ mpiexec -map-by rr -n 4 native/osu_mbw_mr :
-n 4 mic/osu_mbw_mr
# OSU MPI Multiple Bandwidth / Message Rate Test
# [ pairs: 4 ] [ window size: 64 ]
# Size MB/s Messages/s
1 0.23 230138.66
2 0.46 229579.06
4 0.92 231201.97
8 1.85 231515.16
16 3.63 226781.39
32 6.92 216285.18
64 12.65 197678.16
128 25.21 196946.20
256 50.46 197106.54
512 86.11 168184.28
1024 132.69 129577.13
2048 180.60 88183.67
4096 179.81 43898.89
8192 358.07 43710.21
16384 696.33 42500.74
32768 1364.41 41638.46
65536 2737.42 41769.74
131072 4657.86 35536.68
262144 6160.59 23500.77
524288 6584.39 12558.73
1048576 6690.91 6380.95
2097152 6782.58 3234.18
4194304 6789.25 1618.68 ( Note: Seems
~reasonable for pushing data one direction over 16x PCIe 2.0)
[eborisch at rt5 osu_benchmarks]$ mpiexec -map-by rr -n 4 native/osu_mbw_mr :
-n 4 mic/osu_mbw_mr
# OSU MPI Multiple Bandwidth / Message Rate Test
# [ pairs: 4 ] [ window size: 64 ]
# Size MB/s Messages/s
0: 5: 00000051: 00000060: readv err 0
0: 5: 00000052: 00000060: readv err 0
0: 5: 00000053: 00000060: readv err 0
Fatal error in PMPI_Barrier: Other MPI error, error stack:
PMPI_Barrier(426)................: MPI_Barrier(MPI_COMM_WORLD) failed
MPIR_Barrier_impl(283)...........:
MPIR_Barrier_or_coll_fn(121).....:
MPIR_Barrier_intra(83)...........:
MPIC_Sendrecv(209)...............:
MPIC_Wait(563)...................:
MPIDI_CH3I_Progress(367).........:
MPID_nem_mpich_blocking_recv(894):
state_commrdy_handler(175).......:
state_commrdy_handler(138).......:
MPID_nem_scif_recv_handler(115)..: Communication error with rank 5
MPID_nem_scif_recv_handler(35)...: scif_scif_read failed (scif_scif_read
failed with error 'Success')
MPIR_Barrier_intra(83)...........:
MPIC_Sendrecv(209)...............:
MPIC_Wait(563)...................:
MPIDI_CH3I_Progress(367).........:
MPID_nem_mpich_blocking_recv(894):
state_commrdy_handler(175).......:
state_commrdy_handler(138).......:
MPID_nem_scif_recv_handler(115)..: Communication error with rank 5
MPID_nem_scif_recv_handler(35)...: scif_scif_read failed (scif_scif_read
failed with error 'Success')
MPIR_Barrier_impl(294)...........:
MPIR_Barrier_or_coll_fn(121).....:
MPIR_Barrier_intra(83)...........:
MPIC_Sendrecv(209)...............:
MPIC_Wait(563)...................:
MPIDI_CH3I_Progress(367).........:
MPID_nem_mpich_blocking_recv(894):
state_commrdy_handler(175).......:
state_commrdy_handler(138).......:
MPID_nem_scif_recv_handler(115)..: Communication error with rank 5
MPID_nem_scif_recv_handler(35)...: scif_scif_read failed (scif_scif_read
failed with error 'Success')
MPIR_Barrier_impl(308)...........:
MPIR_Bcast_impl(1369)............:
MPIR_Bcast_intra(1199)...........:
MPIR_Bcast_binomial(220).........: Failure during collective
===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= EXIT CODE: 1
= CLEANING UP REMAINING PROCESSES
= YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
[proxy:0:1 at mic0.local] HYDU_sock_write (./utils/sock/sock.c:291): write
error (Broken pipe)
[proxy:0:1 at mic0.local] stdoe_cb (./pm/pmiserv/pmip_cb.c:63): sock write
error
[proxy:0:1 at mic0.local] HYDT_dmxu_poll_wait_for_event
(./tools/demux/demux_poll.c:77): callback returned error status
[proxy:0:1 at mic0.local] main (./pm/pmiserv/pmip.c:206): demux engine error
waiting for event
[mpiexec at rt5] HYDT_bscu_wait_for_completion
(./tools/bootstrap/utils/bscu_wait.c:76): one of the processes terminated
badly; aborting
[mpiexec at rt5] HYDT_bsci_wait_for_completion
(./tools/bootstrap/src/bsci_wait.c:23): launcher returned error waiting for
completion
[mpiexec at rt5] HYD_pmci_wait_for_completion
(./pm/pmiserv/pmiserv_pmci.c:217): launcher returned error waiting for
completion
[mpiexec at rt5] main (./ui/mpich/mpiexec.c:331): process manager error
waiting for completion
Here's the host file (unchanged between runs):
host:4
mic0:4 binding=user:4,8,12,16
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20130409/1e1ca5cf/attachment.html>
More information about the discuss
mailing list