[mpich-discuss] Missing Nemesis ib
Halim Amer
aamer at anl.gov
Thu Sep 21 09:42:35 CDT 2017
You need to launch two instances of the binary, one as a server and the
other as a client. Please read the MXM user guide:
http://www.mellanox.com/related-docs/prod_acceleration_software/Mellanox_MXM_User_Manual_v2.1.pdf.
As I said before, I suggest you forward your InfiniBand issues to your
admin or your Mellanox support. There is only little we can do on the
MPICH team, since after all, it's not an MPICH problem here. Get back to
us once you have a *working* MXM stack and MPICH is still failing.
Halim
www.mcs.anl.gov/~aamer
On 9/21/17 3:04 AM, Jason Collins wrote:
> I tried to run the test in this other way to get more information:
>
> # mpirun -n 10 ./mxm_perftest
>
> --------------------------------------------------------------------------
>
> Failed to register memory region (MR):
>
>
> Hostname: compute1
>
> Address: 1d14000
>
> Length: 20480
>
> Error: No space left on device
>
> --------------------------------------------------------------------------
>
> --------------------------------------------------------------------------
>
> Open MPI has detected that there are UD-capable Verbs devices on your
>
> system, but none of them were able to be setup properly. This may
>
> indicate a problem on this system.
>
>
> You job will continue, but Open MPI will ignore the "ud" oob component
>
> in this run.
>
>
> Hostname: compute1
>
> --------------------------------------------------------------------------
>
> [1505980772.899531] [compute1:59182:0] sys.c:744 MXM WARN
> Conflicting CPU frequencies detected, using: 3599.84
>
> Waiting for connection...
>
> [1505980772.900947] [compute1:59183:0] sys.c:744 MXM WARN
> Conflicting CPU frequencies detected, using: 3599.84
>
> [1505980772.902329] [compute1:59184:0] sys.c:744 MXM WARN
> Conflicting CPU frequencies detected, using: 3599.84
>
> [1505980772.903490] [compute1:59185:0] sys.c:744 MXM WARN
> Conflicting CPU frequencies detected, using: 3599.84
>
> [1505980772.904984] [compute1:59186:0] sys.c:744 MXM WARN
> Conflicting CPU frequencies detected, using: 3599.84
>
> [1505980772.906288] [compute1:59187:0] sys.c:744 MXM WARN
> Conflicting CPU frequencies detected, using: 3599.84
>
> [1505980772.907957] [compute1:59188:0] sys.c:744 MXM WARN
> Conflicting CPU frequencies detected, using: 3599.84
>
> [1505980772.909023] [compute1:59189:0] sys.c:744 MXM WARN
> Conflicting CPU frequencies detected, using: 3599.84
>
> bind() failed: Address already in use
>
> bind() failed: Address already in use
>
> bind() failed: Address already in use
>
> bind() failed: Address already in use
>
> bind() failed: Address already in use
>
> bind() failed: Address already in use
>
> bind() failed: Address already in use
>
> -------------------------------------------------------
>
> Primary job terminated normally, but 1 process returned
>
> a non-zero exit code.. Per user-direction, the job has been aborted.
>
> -------------------------------------------------------
>
> bind() failed: Address already in use
>
> [1505980772.910503] [compute1:59190:0] sys.c:744 MXM WARN
> Conflicting CPU frequencies detected, using: 3599.84
>
> bind() failed: Address already in use
>
> [1505980772.911893] [compute1:59191:0] sys.c:744 MXM WARN
> Conflicting CPU frequencies detected, using: 3599.84
>
> --------------------------------------------------------------------------
>
> mpirun detected that one or more processes exited with non-zero status,
> thus causing
>
> the job to be terminated. The first process to do so was:
>
>
> Process name: [[55900,1],1]
>
> Exit code: 255
>
>
> El jue., 21 sept. 2017 a las 7:58, Jason Collins
> (<jasoncollinsw at gmail.com <mailto:jasoncollinsw at gmail.com>>) escribió:
>
> I ran the test and the result was the following:
>
> # ./mxm_perftest
>
> [1505976675.346380] [compute1:55801:0] sys.c:744 MXM WARN
> Conflicting CPU frequencies detected, using: 3600.52
>
> Waiting for connection...
>
>
> It does nothing else, it remains waiting to establish connection.
> <https://audio1.spanishdict.com/audio?lang=en&text=i-ran-the-test-and-the-result-was-the-following%3A>
>
> El mié., 20 sept. 2017 a las 17:12, Halim Amer (<aamer at anl.gov
> <mailto:aamer at anl.gov>>) escribió:
>
> I seems you have mismatch in the OFED stack. Try installing the
> Mellanox
> OFED stack if you are using the bundled OFED stack right now.
>
> Make sure MXM works before trying MPICH. Use the
> mxm/bin/mxm_perftest
> from your MXM installation to test that things work properly. If it
> doesn't work, then contact your admin or Mellanox cause it is not an
> MPICH problem.
>
> Halim
> www.mcs.anl.gov/~aamer <http://www.mcs.anl.gov/~aamer>
>
> On 9/19/17 7:14 AM, Jason Collins wrote:
> > Thank you very much.
> >
> > I have compiled with "CH3:nemesis:mxm". The compilation was
> successful.
> >
> > Now I have a new problem. I tested the test "./icp" and get the
> > following error.
> >
> > # mpiexec -f hosts -n 4 ./cpi
> > [1505822776.546898] [compute1:16212:0] sys.c:744 MXM WARN
> Conflicting
> > CPU frequencies detected, using: 3459.84
> > [1505822776.546898] [compute1:16213:0] sys.c:744 MXM WARN
> Conflicting
> > CPU frequencies detected, using: 3459.84
> > [1505822776.546951] [compute1:16216:0] sys.c:744 MXM WARN
> Conflicting
> > CPU frequencies detected, using: 3459.84
> > [1505822776.547039] [compute1:16214:0] sys.c:744 MXM WARN
> Conflicting
> > CPU frequencies detected, using: 3459.84
> > [1505822776.561357] [compute1:16214:0] ib_dev.c:533 MXM WARN
> failed call
> > to ibv_exp_use_priv_env(): Function not implemented
> > [1505822776.561371] [compute1:16214:0] ib_dev.c:544 MXM ERROR
> > ibv_query_device() returned 38: Function not implemented
> > [1505822776.561386] [compute1:16218:0] ib_dev.c:533 MXM WARN
> failed call
> > to ibv_exp_use_priv_env(): Function not implemented
> > [1505822776.561396] [compute1:16218:0] ib_dev.c:544 MXM ERROR
> > ibv_query_device() returned 38: Function not implemented
> > [1505822776.561426] [compute1:16225:0] ib_dev.c:533 MXM WARN
> failed call
> > to ibv_exp_use_priv_env(): Function not implemented
> > [1505822776.561442] [compute1:16225:0] ib_dev.c:544 MXM ERROR
> > ibv_query_device() returned 38: Function not implemented
> > Fatal error in MPI_Init: Other MPI error, error stack:
> > MPIR_Init_thread(474).........:
> > MPID_Init(190)................: channel initialization failed
> > MPIDI_CH3_Init(89)............:
> > MPID_nem_init(320)............:
> > MPID_nem_mxm_init(158)........:
> > MPID_nem_mxm_get_ordering(464): mxm_init failed (Input/output
> error)
> >
> >
> > El vie., 15 sept. 2017 a las 16:01, Halim Amer
> (<aamer at anl.gov <mailto:aamer at anl.gov>
> > <mailto:aamer at anl.gov <mailto:aamer at anl.gov>>>) escribió:
> >
> > The "nemesis:ib" netmod does not exist anymore. Try
> "ch3:nemesis:mxm"
> > with a dependency on Mellanox's MXM library (can be
> obtained from the
> > HPCX package at www.mellanox.com/products/hpcx
> <http://www.mellanox.com/products/hpcx>
> > <http://www.mellanox.com/products/hpcx>) or "ch3:nemesis:ofi"
> > with a dependency on libfabric (which would be built to
> support the IB
> > or MXM providers; see https://ofiwg.github.io/libfabric/).
> >
> > Halim
> > www.mcs.anl.gov/~aamer <http://www.mcs.anl.gov/~aamer>
> <http://www.mcs.anl.gov/~aamer>
> >
> > On 9/15/17 4:20 AM, Jason Collins wrote:
> > > Hello everyone.
> > >
> > > Recently, I downloaded Mpich-3.2
> > >
> > > I want to configure with support for InfiniBand. I've
> put the
> > following
> > > command:
> > >
> > > # ./configure --prefix=/my/path
> --with-device=ch3:nemesis:ib
> > >
> > > And I get the following error:
> > >
> > > configure: error: Network module ib is unknown
> > > "./src/mpid/ch3/channels/nemesis/netmod/ib"
> > >
> > > When I check the path I confirm that in the folder
> "netmod" does not
> > > exist the folder "ib". How can this be solved?
> > >
> > > Many thanks.
> > >
> >
> <https://audio1.spanishdict.com/audio?lang=en&text=when-i-check-the-path-i-confirm-that-within-the-folder-netmod-the-folder-does-not-exist-ib-how-can-this-be-solved-many-thanks>
> > >
> > >
> > > _______________________________________________
> > > discuss mailing list discuss at mpich.org
> <mailto:discuss at mpich.org> <mailto:discuss at mpich.org
> <mailto:discuss at mpich.org>>
> > > To manage subscription options or unsubscribe:
> > > https://lists.mpich.org/mailman/listinfo/discuss
> > >
> > _______________________________________________
> > discuss mailing list discuss at mpich.org
> <mailto:discuss at mpich.org> <mailto:discuss at mpich.org
> <mailto:discuss at mpich.org>>
> > To manage subscription options or unsubscribe:
> > https://lists.mpich.org/mailman/listinfo/discuss
> >
> >
> >
> > _______________________________________________
> > discuss mailing list discuss at mpich.org <mailto:discuss at mpich.org>
> > To manage subscription options or unsubscribe:
> > https://lists.mpich.org/mailman/listinfo/discuss
> >
> _______________________________________________
> discuss mailing list discuss at mpich.org <mailto:discuss at mpich.org>
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss
>
>
>
> _______________________________________________
> discuss mailing list discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss
>
_______________________________________________
discuss mailing list discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss
More information about the discuss
mailing list