[mpich-discuss] MPICH (3.0.3) w/ SCIF

Eric A. Borisch eborisch at ieee.org
Tue Apr 9 14:39:27 CDT 2013


I'm trying to get MPICH (3.0.3) and SCIF working (on an Intel PHI card.)

I'm using the tests from osu_benchmarks(from mvapich2 tarball) as a set of
sanity checks, and I'm running into some unexpected errors.

One example: running osu_mbw_mr works sometimes, and then fail on the next
try. The printout from two successive runs as well as the hosts file are
below. I especially like the "scif_scif_read failed with error 'Success'"
message. :)

Any thoughts? Or is this something to take up with Intel? Compiler is
latest (13.1.1) icc; latest MPSS (2-2.1.5889-14); Centos 6.4.

Thanks,
  Eric

This particular test should be setting up four pairs of processes, each
with one element on the host, and one on the PHI, and try to communicate
between the two (node 0<->4, ...) Things seem to be more stable with one or
two pairs of processes, but that's not really the desired use case...

[eborisch at rt5 osu_benchmarks]$ mpiexec -map-by rr -n 4 native/osu_mbw_mr :
-n 4 mic/osu_mbw_mr
# OSU MPI Multiple Bandwidth / Message Rate Test
# [ pairs: 4 ] [ window size: 64 ]
# Size                    MB/s          Messages/s
1                         0.23           230138.66
2                         0.46           229579.06
4                         0.92           231201.97
8                         1.85           231515.16
16                        3.63           226781.39
32                        6.92           216285.18
64                       12.65           197678.16
128                      25.21           196946.20
256                      50.46           197106.54
512                      86.11           168184.28
1024                    132.69           129577.13
2048                    180.60            88183.67
4096                    179.81            43898.89
8192                    358.07            43710.21
16384                   696.33            42500.74
32768                  1364.41            41638.46
65536                  2737.42            41769.74
131072                 4657.86            35536.68
262144                 6160.59            23500.77
524288                 6584.39            12558.73
1048576                6690.91             6380.95
2097152                6782.58             3234.18
4194304                6789.25             1618.68  ( Note: Seems
~reasonable for pushing data one direction over 16x PCIe 2.0)
[eborisch at rt5 osu_benchmarks]$ mpiexec -map-by rr -n 4 native/osu_mbw_mr :
-n 4 mic/osu_mbw_mr
# OSU MPI Multiple Bandwidth / Message Rate Test
# [ pairs: 4 ] [ window size: 64 ]
# Size                    MB/s          Messages/s
 0:  5: 00000051: 00000060: readv err 0
 0:  5: 00000052: 00000060: readv err 0
 0:  5: 00000053: 00000060: readv err 0
Fatal error in PMPI_Barrier: Other MPI error, error stack:
PMPI_Barrier(426)................: MPI_Barrier(MPI_COMM_WORLD) failed
MPIR_Barrier_impl(283)...........:
MPIR_Barrier_or_coll_fn(121).....:
MPIR_Barrier_intra(83)...........:
MPIC_Sendrecv(209)...............:
MPIC_Wait(563)...................:
MPIDI_CH3I_Progress(367).........:
MPID_nem_mpich_blocking_recv(894):
state_commrdy_handler(175).......:
state_commrdy_handler(138).......:
MPID_nem_scif_recv_handler(115)..: Communication error with rank 5
MPID_nem_scif_recv_handler(35)...: scif_scif_read failed (scif_scif_read
failed with error 'Success')
MPIR_Barrier_intra(83)...........:
MPIC_Sendrecv(209)...............:
MPIC_Wait(563)...................:
MPIDI_CH3I_Progress(367).........:
MPID_nem_mpich_blocking_recv(894):
state_commrdy_handler(175).......:
state_commrdy_handler(138).......:
MPID_nem_scif_recv_handler(115)..: Communication error with rank 5
MPID_nem_scif_recv_handler(35)...: scif_scif_read failed (scif_scif_read
failed with error 'Success')
MPIR_Barrier_impl(294)...........:
MPIR_Barrier_or_coll_fn(121).....:
MPIR_Barrier_intra(83)...........:
MPIC_Sendrecv(209)...............:
MPIC_Wait(563)...................:
MPIDI_CH3I_Progress(367).........:
MPID_nem_mpich_blocking_recv(894):
state_commrdy_handler(175).......:
state_commrdy_handler(138).......:
MPID_nem_scif_recv_handler(115)..: Communication error with rank 5
MPID_nem_scif_recv_handler(35)...: scif_scif_read failed (scif_scif_read
failed with error 'Success')
MPIR_Barrier_impl(308)...........:
MPIR_Bcast_impl(1369)............:
MPIR_Bcast_intra(1199)...........:
MPIR_Bcast_binomial(220).........: Failure during collective

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   EXIT CODE: 1
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
[proxy:0:1 at mic0.local] HYDU_sock_write (./utils/sock/sock.c:291): write
error (Broken pipe)
[proxy:0:1 at mic0.local] stdoe_cb (./pm/pmiserv/pmip_cb.c:63): sock write
error
[proxy:0:1 at mic0.local] HYDT_dmxu_poll_wait_for_event
(./tools/demux/demux_poll.c:77): callback returned error status
[proxy:0:1 at mic0.local] main (./pm/pmiserv/pmip.c:206): demux engine error
waiting for event
[mpiexec at rt5] HYDT_bscu_wait_for_completion
(./tools/bootstrap/utils/bscu_wait.c:76): one of the processes terminated
badly; aborting
[mpiexec at rt5] HYDT_bsci_wait_for_completion
(./tools/bootstrap/src/bsci_wait.c:23): launcher returned error waiting for
completion
[mpiexec at rt5] HYD_pmci_wait_for_completion
(./pm/pmiserv/pmiserv_pmci.c:217): launcher returned error waiting for
completion
[mpiexec at rt5] main (./ui/mpich/mpiexec.c:331): process manager error
waiting for completion

Here's the host file (unchanged between runs):

host:4
mic0:4 binding=user:4,8,12,16
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20130409/1e1ca5cf/attachment.html>


More information about the discuss mailing list