<div dir="ltr">I'm trying to get MPICH (3.0.3) and SCIF working (on an Intel PHI card.)<div><br></div><div>I'm using the tests from osu_benchmarks(from mvapich2 tarball) as a set of sanity checks, and I'm running into some unexpected errors.</div>
<div><br></div><div>One example: running osu_mbw_mr works sometimes, and then fail on the next try. The printout from two successive runs as well as the hosts file are below. I especially like the "scif_scif_read failed with error 'Success'" message. :) </div>
<div><br></div><div>Any thoughts? Or is this something to take up with Intel? Compiler is latest (13.1.1) icc; latest MPSS (2-2.1.5889-14); Centos 6.4.<br></div><div><br></div><div>Thanks,</div><div> Eric</div><div><br></div>
<div><div>This particular test should be setting up four pairs of processes, each with one element on the host, and one on the PHI, and try to communicate between the two (node 0<->4, ...) Things seem to be more stable with one or two pairs of processes, but that's not really the desired use case...</div>
</div><div><br></div><div><div>[eborisch@rt5 osu_benchmarks]$ mpiexec -map-by rr -n 4 native/osu_mbw_mr : -n 4 mic/osu_mbw_mr <br></div><div># OSU MPI Multiple Bandwidth / Message Rate Test</div><div># [ pairs: 4 ] [ window size: 64 ]</div>
<div># Size MB/s Messages/s</div><div>1 0.23 230138.66</div><div>2 0.46 229579.06</div><div>4 0.92 231201.97</div>
<div>8 1.85 231515.16</div><div>16 3.63 226781.39</div><div>32 6.92 216285.18</div><div>64 12.65 197678.16</div>
<div>128 25.21 196946.20</div><div>256 50.46 197106.54</div><div>512 86.11 168184.28</div><div>1024 132.69 129577.13</div>
<div>2048 180.60 88183.67</div><div>4096 179.81 43898.89</div><div>8192 358.07 43710.21</div><div>16384 696.33 42500.74</div>
<div>32768 1364.41 41638.46</div><div>65536 2737.42 41769.74</div><div>131072 4657.86 35536.68</div><div>262144 6160.59 23500.77</div>
<div>524288 6584.39 12558.73</div><div>1048576 6690.91 6380.95</div><div>2097152 6782.58 3234.18</div><div>4194304 6789.25 1618.68 ( Note: Seems ~reasonable for pushing data one direction over 16x PCIe 2.0)</div>
<div>[eborisch@rt5 osu_benchmarks]$ mpiexec -map-by rr -n 4 native/osu_mbw_mr : -n 4 mic/osu_mbw_mr </div><div># OSU MPI Multiple Bandwidth / Message Rate Test</div><div># [ pairs: 4 ] [ window size: 64 ]</div><div># Size MB/s Messages/s</div>
<div> 0: 5: 00000051: 00000060: readv err 0</div><div> 0: 5: 00000052: 00000060: readv err 0</div><div> 0: 5: 00000053: 00000060: readv err 0</div><div>Fatal error in PMPI_Barrier: Other MPI error, error stack:</div><div>
PMPI_Barrier(426)................: MPI_Barrier(MPI_COMM_WORLD) failed</div><div>MPIR_Barrier_impl(283)...........: </div><div>MPIR_Barrier_or_coll_fn(121).....: </div><div>MPIR_Barrier_intra(83)...........: </div><div>MPIC_Sendrecv(209)...............: </div>
<div>MPIC_Wait(563)...................: </div><div>MPIDI_CH3I_Progress(367).........: </div><div>MPID_nem_mpich_blocking_recv(894): </div><div>state_commrdy_handler(175).......: </div><div>state_commrdy_handler(138).......: </div>
<div>MPID_nem_scif_recv_handler(115)..: Communication error with rank 5</div><div>MPID_nem_scif_recv_handler(35)...: scif_scif_read failed (scif_scif_read failed with error 'Success')</div><div>MPIR_Barrier_intra(83)...........: </div>
<div>MPIC_Sendrecv(209)...............: </div><div>MPIC_Wait(563)...................: </div><div>MPIDI_CH3I_Progress(367).........: </div><div>MPID_nem_mpich_blocking_recv(894): </div><div>state_commrdy_handler(175).......: </div>
<div>state_commrdy_handler(138).......: </div><div>MPID_nem_scif_recv_handler(115)..: Communication error with rank 5</div><div>MPID_nem_scif_recv_handler(35)...: scif_scif_read failed (scif_scif_read failed with error 'Success')</div>
<div>MPIR_Barrier_impl(294)...........: </div><div>MPIR_Barrier_or_coll_fn(121).....: </div><div>MPIR_Barrier_intra(83)...........: </div><div>MPIC_Sendrecv(209)...............: </div><div>MPIC_Wait(563)...................: </div>
<div>MPIDI_CH3I_Progress(367).........: </div><div>MPID_nem_mpich_blocking_recv(894): </div><div>state_commrdy_handler(175).......: </div><div>state_commrdy_handler(138).......: </div><div>MPID_nem_scif_recv_handler(115)..: Communication error with rank 5</div>
<div>MPID_nem_scif_recv_handler(35)...: scif_scif_read failed (scif_scif_read failed with error 'Success')</div><div>MPIR_Barrier_impl(308)...........: </div><div>MPIR_Bcast_impl(1369)............: </div><div>MPIR_Bcast_intra(1199)...........: </div>
<div>MPIR_Bcast_binomial(220).........: Failure during collective</div><div><br></div><div>===================================================================================</div><div>= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES</div>
<div>= EXIT CODE: 1</div><div>= CLEANING UP REMAINING PROCESSES</div><div>= YOU CAN IGNORE THE BELOW CLEANUP MESSAGES</div><div>===================================================================================</div>
<div>[proxy:0:1@mic0.local] HYDU_sock_write (./utils/sock/sock.c:291): write error (Broken pipe)</div><div>[proxy:0:1@mic0.local] stdoe_cb (./pm/pmiserv/pmip_cb.c:63): sock write error</div><div>[proxy:0:1@mic0.local] HYDT_dmxu_poll_wait_for_event (./tools/demux/demux_poll.c:77): callback returned error status</div>
<div>[proxy:0:1@mic0.local] main (./pm/pmiserv/pmip.c:206): demux engine error waiting for event</div><div>[mpiexec@rt5] HYDT_bscu_wait_for_completion (./tools/bootstrap/utils/bscu_wait.c:76): one of the processes terminated badly; aborting</div>
<div>[mpiexec@rt5] HYDT_bsci_wait_for_completion (./tools/bootstrap/src/bsci_wait.c:23): launcher returned error waiting for completion</div><div>[mpiexec@rt5] HYD_pmci_wait_for_completion (./pm/pmiserv/pmiserv_pmci.c:217): launcher returned error waiting for completion</div>
<div>[mpiexec@rt5] main (./ui/mpich/mpiexec.c:331): process manager error waiting for completion</div><div><br></div></div><div>Here's the host file (unchanged between runs):</div><div><div><br></div><div>host:4</div>
<div>mic0:4 binding=user:4,8,12,16</div></div><div><br></div>
</div>