[mpich-discuss] Missing Nemesis ib

Halim Amer aamer at anl.gov
Thu Sep 21 09:42:35 CDT 2017


You need to launch two instances of the binary, one as a server and the 
other as a client. Please read the MXM user guide: 
http://www.mellanox.com/related-docs/prod_acceleration_software/Mellanox_MXM_User_Manual_v2.1.pdf.

As I said before, I suggest you forward your InfiniBand issues to your 
admin or your Mellanox support. There is only little we can do on the 
MPICH team, since after all, it's not an MPICH problem here. Get back to 
us once you have a *working* MXM stack and MPICH is still failing.

Halim
www.mcs.anl.gov/~aamer

On 9/21/17 3:04 AM, Jason Collins wrote:
> I tried to run the test in this other way to get more information:
> 
> # mpirun -n 10 ./mxm_perftest
> 
> --------------------------------------------------------------------------
> 
> Failed to register memory region (MR):
> 
> 
> Hostname: compute1
> 
> Address:  1d14000
> 
> Length:   20480
> 
> Error:    No space left on device
> 
> --------------------------------------------------------------------------
> 
> --------------------------------------------------------------------------
> 
> Open MPI has detected that there are UD-capable Verbs devices on your
> 
> system, but none of them were able to be setup properly.  This may
> 
> indicate a problem on this system.
> 
> 
> You job will continue, but Open MPI will ignore the "ud" oob component
> 
> in this run.
> 
> 
> Hostname: compute1
> 
> --------------------------------------------------------------------------
> 
> [1505980772.899531] [compute1:59182:0]         sys.c:744  MXM  WARN  
> Conflicting CPU frequencies detected, using: 3599.84
> 
> Waiting for connection...
> 
> [1505980772.900947] [compute1:59183:0]         sys.c:744  MXM  WARN  
> Conflicting CPU frequencies detected, using: 3599.84
> 
> [1505980772.902329] [compute1:59184:0]         sys.c:744  MXM  WARN  
> Conflicting CPU frequencies detected, using: 3599.84
> 
> [1505980772.903490] [compute1:59185:0]         sys.c:744  MXM  WARN  
> Conflicting CPU frequencies detected, using: 3599.84
> 
> [1505980772.904984] [compute1:59186:0]         sys.c:744  MXM  WARN  
> Conflicting CPU frequencies detected, using: 3599.84
> 
> [1505980772.906288] [compute1:59187:0]         sys.c:744  MXM  WARN  
> Conflicting CPU frequencies detected, using: 3599.84
> 
> [1505980772.907957] [compute1:59188:0]         sys.c:744  MXM  WARN  
> Conflicting CPU frequencies detected, using: 3599.84
> 
> [1505980772.909023] [compute1:59189:0]         sys.c:744  MXM  WARN  
> Conflicting CPU frequencies detected, using: 3599.84
> 
> bind() failed: Address already in use
> 
> bind() failed: Address already in use
> 
> bind() failed: Address already in use
> 
> bind() failed: Address already in use
> 
> bind() failed: Address already in use
> 
> bind() failed: Address already in use
> 
> bind() failed: Address already in use
> 
> -------------------------------------------------------
> 
> Primary job  terminated normally, but 1 process returned
> 
> a non-zero exit code.. Per user-direction, the job has been aborted.
> 
> -------------------------------------------------------
> 
> bind() failed: Address already in use
> 
> [1505980772.910503] [compute1:59190:0]         sys.c:744  MXM  WARN  
> Conflicting CPU frequencies detected, using: 3599.84
> 
> bind() failed: Address already in use
> 
> [1505980772.911893] [compute1:59191:0]         sys.c:744  MXM  WARN  
> Conflicting CPU frequencies detected, using: 3599.84
> 
> --------------------------------------------------------------------------
> 
> mpirun detected that one or more processes exited with non-zero status, 
> thus causing
> 
> the job to be terminated. The first process to do so was:
> 
> 
>    Process name: [[55900,1],1]
> 
>    Exit code:    255
> 
> 
> El jue., 21 sept. 2017 a las 7:58, Jason Collins 
> (<jasoncollinsw at gmail.com <mailto:jasoncollinsw at gmail.com>>) escribió:
> 
>     I ran the test and the result was the following:
> 
>     # ./mxm_perftest
> 
>     [1505976675.346380] [compute1:55801:0]         sys.c:744  MXM  WARN 
>     Conflicting CPU frequencies detected, using: 3600.52
> 
>     Waiting for connection...
> 
> 
>     It does nothing else, it remains waiting to establish connection.
>     <https://audio1.spanishdict.com/audio?lang=en&text=i-ran-the-test-and-the-result-was-the-following%3A>
> 
>     El mié., 20 sept. 2017 a las 17:12, Halim Amer (<aamer at anl.gov
>     <mailto:aamer at anl.gov>>) escribió:
> 
>         I seems you have mismatch in the OFED stack. Try installing the
>         Mellanox
>         OFED stack if you are using the bundled OFED stack right now.
> 
>         Make sure MXM works before trying MPICH. Use the
>         mxm/bin/mxm_perftest
>         from your MXM installation to test that things work properly. If it
>         doesn't work, then contact your admin or Mellanox cause it is not an
>         MPICH problem.
> 
>         Halim
>         www.mcs.anl.gov/~aamer <http://www.mcs.anl.gov/~aamer>
> 
>         On 9/19/17 7:14 AM, Jason Collins wrote:
>          > Thank you very much.
>          >
>          > I have compiled with "CH3:nemesis:mxm". The compilation was
>         successful.
>          >
>          > Now I have a new problem. I tested the test "./icp" and get the
>          > following error.
>          >
>          > # mpiexec -f hosts -n 4 ./cpi
>          > [1505822776.546898] [compute1:16212:0] sys.c:744 MXM WARN
>         Conflicting
>          > CPU frequencies detected, using: 3459.84
>          > [1505822776.546898] [compute1:16213:0] sys.c:744 MXM WARN
>         Conflicting
>          > CPU frequencies detected, using: 3459.84
>          > [1505822776.546951] [compute1:16216:0] sys.c:744 MXM WARN
>         Conflicting
>          > CPU frequencies detected, using: 3459.84
>          > [1505822776.547039] [compute1:16214:0] sys.c:744 MXM WARN
>         Conflicting
>          > CPU frequencies detected, using: 3459.84
>          > [1505822776.561357] [compute1:16214:0] ib_dev.c:533 MXM WARN
>         failed call
>          > to ibv_exp_use_priv_env(): Function not implemented
>          > [1505822776.561371] [compute1:16214:0] ib_dev.c:544 MXM ERROR
>          > ibv_query_device() returned 38: Function not implemented
>          > [1505822776.561386] [compute1:16218:0] ib_dev.c:533 MXM WARN
>         failed call
>          > to ibv_exp_use_priv_env(): Function not implemented
>          > [1505822776.561396] [compute1:16218:0] ib_dev.c:544 MXM ERROR
>          > ibv_query_device() returned 38: Function not implemented
>          > [1505822776.561426] [compute1:16225:0] ib_dev.c:533 MXM WARN
>         failed call
>          > to ibv_exp_use_priv_env(): Function not implemented
>          > [1505822776.561442] [compute1:16225:0] ib_dev.c:544 MXM ERROR
>          > ibv_query_device() returned 38: Function not implemented
>          > Fatal error in MPI_Init: Other MPI error, error stack:
>          > MPIR_Init_thread(474).........:
>          > MPID_Init(190)................: channel initialization failed
>          > MPIDI_CH3_Init(89)............:
>          > MPID_nem_init(320)............:
>          > MPID_nem_mxm_init(158)........:
>          > MPID_nem_mxm_get_ordering(464): mxm_init failed (Input/output
>         error)
>          >
>          >
>          > El vie., 15 sept. 2017 a las 16:01, Halim Amer
>         (<aamer at anl.gov <mailto:aamer at anl.gov>
>          > <mailto:aamer at anl.gov <mailto:aamer at anl.gov>>>) escribió:
>          >
>          >     The "nemesis:ib" netmod does not exist anymore. Try
>         "ch3:nemesis:mxm"
>          >     with a dependency on Mellanox's MXM library (can be
>         obtained from the
>          >     HPCX package at www.mellanox.com/products/hpcx
>         <http://www.mellanox.com/products/hpcx>
>          >     <http://www.mellanox.com/products/hpcx>) or "ch3:nemesis:ofi"
>          >     with a dependency on libfabric (which would be built to
>         support the IB
>          >     or MXM providers; see https://ofiwg.github.io/libfabric/).
>          >
>          >     Halim
>          > www.mcs.anl.gov/~aamer <http://www.mcs.anl.gov/~aamer>
>         <http://www.mcs.anl.gov/~aamer>
>          >
>          >     On 9/15/17 4:20 AM, Jason Collins wrote:
>          >      > Hello everyone.
>          >      >
>          >      > Recently, I downloaded Mpich-3.2
>          >      >
>          >      > I want to configure with support for InfiniBand. I've
>         put the
>          >     following
>          >      > command:
>          >      >
>          >      > # ./configure --prefix=/my/path
>         --with-device=ch3:nemesis:ib
>          >      >
>          >      > And I get the following error:
>          >      >
>          >      > configure: error: Network module ib is unknown
>          >      > "./src/mpid/ch3/channels/nemesis/netmod/ib"
>          >      >
>          >      > When I check the path I confirm that in the folder
>         "netmod" does not
>          >      > exist the folder "ib". How can this be solved?
>          >      >
>          >      > Many thanks.
>          >      >
>          >   
>           <https://audio1.spanishdict.com/audio?lang=en&text=when-i-check-the-path-i-confirm-that-within-the-folder-netmod-the-folder-does-not-exist-ib-how-can-this-be-solved-many-thanks>
>          >      >
>          >      >
>          >      > _______________________________________________
>          >      > discuss mailing list discuss at mpich.org
>         <mailto:discuss at mpich.org> <mailto:discuss at mpich.org
>         <mailto:discuss at mpich.org>>
>          >      > To manage subscription options or unsubscribe:
>          >      > https://lists.mpich.org/mailman/listinfo/discuss
>          >      >
>          >     _______________________________________________
>          >     discuss mailing list discuss at mpich.org
>         <mailto:discuss at mpich.org> <mailto:discuss at mpich.org
>         <mailto:discuss at mpich.org>>
>          >     To manage subscription options or unsubscribe:
>          > https://lists.mpich.org/mailman/listinfo/discuss
>          >
>          >
>          >
>          > _______________________________________________
>          > discuss mailing list discuss at mpich.org <mailto:discuss at mpich.org>
>          > To manage subscription options or unsubscribe:
>          > https://lists.mpich.org/mailman/listinfo/discuss
>          >
>         _______________________________________________
>         discuss mailing list discuss at mpich.org <mailto:discuss at mpich.org>
>         To manage subscription options or unsubscribe:
>         https://lists.mpich.org/mailman/listinfo/discuss
> 
> 
> 
> _______________________________________________
> discuss mailing list     discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss
> 
_______________________________________________
discuss mailing list     discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss


More information about the discuss mailing list