[mpich-discuss] Assertion in MX netmod

Kuleshov Aleksey rndfax at yandex.ru
Mon Nov 24 12:23:29 CST 2014


Thank you Antonio for this information. This is very sad, because:
1) in the releases 3.1.2 and 3.1.3 (which are stable releases!) MPICH has broken netmod?
2) netmod newmad has the same routing as mx (which calls assertion) but newmad is still in 3.2a2 => MPICH still has broken code unless something was fixed in subroutings?

24.11.2014, 18:09, "Antonio J. Peña" <apenya at mcs.anl.gov>:
> Dear Kuleshov,
>
> In order to accomodate resources for more recent networking APIs we
> dropped support for the mx netmod, which in fact has been completely
> removed in our most recent 3.2 releases. So, unfortunately, we are not
> able to assist you with this issue.
>
> Best regards,
>    Antonio
>
> On 11/22/2014 01:52 PM, Kuleshov Aleksey wrote:
>>  And the same problem with different approach:
>>  I downloaded from http://www.mcs.anl.gov/research/projects/mpi/mpi-test/tsuite.html mpi2test.tar.gz, built it and try
>>  to run pingping test:
>>>  MPITEST_VERBOSE=1 MPICH_NEMESIS_NETMOD=mx mpiexec -f /tmp/m -n 2 /tests/pingping
>>  [stdout]
>>  Get new datatypes: send = MPI_INT, recv = MPI_INT
>>  Get new datatypes: send = MPI_INT, recv = MPI_INT
>>  Sending count = 1 of sendtype MPI_INT of total size 4 bytes
>>  Sending count = 1 of sendtype MPI_INT of total size 4 bytes
>>  Get new datatypes: send = MPI_DOUBLE, recv = MPI_DOUBLE
>>  Get new datatypes: send = MPI_DOUBLE, recv = MPI_DOUBLE
>>  Sending count = 1 of sendtype MPI_DOUBLE of total size 8 bytes
>>  Sending count = 1 of sendtype MPI_DOUBLE of total size 8 bytes
>>  Get new datatypes: send = MPI_FLOAT_INT, recv = MPI_FLOAT_INT
>>  Sending count = 1 of sendtype MPI_FLOAT_INT of total size 8 bytes
>>  Get new datatypes: send = MPI_FLOAT_INT, recv = MPI_FLOAT_INT
>>  Sending count = 1 of sendtype MPI_FLOAT_INT of total size 8 bytes
>>  Get new datatypes: send = dup of MPI_INT, recv = dup of MPI_INT
>>  Get new datatypes: send = dup of MPI_INT, recv = dup of MPI_INT
>>  Sending count = 1 of sendtype dup of MPI_INT of total size 4 bytes
>>  Sending count = 1 of sendtype dup of MPI_INT of total size 4 bytes
>>  Get new datatypes: send = int-vector, recv = MPI_INT
>>  Sending count = 1 of sendtype int-vector of total size 4 bytes
>>  Assertion failed in file ../src/mpid/ch3/channels/nemesis/netmod/mx/mx_send.c at line 435: n_iov > 0
>>  internal ABORT - process 0
>>  [/stdout]
>>
>>  22.11.2014, 18:39, "Kuleshov Aleksey" <rndfax at yandex.ru>:
>>>  Hello! Can you please help me with problem?
>>>
>>>  I'm working on custom myriexpress library and I'm using MX netmod in MPICH v.3.1.2.
>>>  For testing purposes I built OSU Micro Benchmarks v3.8.
>>>
>>>  To run it on 7 nodes I execute test osu_alltoall as follows:
>>>>    MPICH_NEMESIS_NETMOD=mx mpiexec -f /tmp/m -n 7 /osu_alltoall
>>>  It passed successfully (I also tried it on 2, 3, 4, 5 and 6 nodes - everything is alright).
>>>
>>>  But now I want to run it on 8 nodes:
>>>>    MPICH_NEMESIS_NETMOD=mx mpiexec -f /tmp/m -n 8 /osu_alltoall
>>>  [stdout]
>>>  # OSU MPI All-to-All Personalized Exchange Latency Test v3.8
>>>  # Size       Avg Latency(us)
>>>  Assertion failed in file ../src/mpid/ch3/channels/nemesis/netmod/mx/mx_poll.c at line 784: n_iov > 0
>>>  Assertion failed in file ../src/mpid/ch3/channels/nemesis/netmod/mx/mx_poll.c at line 784: n_iov > 0
>>>  Assertion failed in file ../src/mpid/ch3/channels/nemesis/netmod/mx/mx_poll.c at line 784: n_iov > 0
>>>  Assertion failed in file ../src/mpid/ch3/channels/nemesis/netmod/mx/mx_poll.c at line 784: n_iov > 0
>>>  Assertion failed in file ../src/mpid/ch3/channels/nemesis/netmod/mx/mx_poll.c at line 784: n_iov > 0
>>>  Assertion failed in file ../src/mpid/ch3/channels/nemesis/netmod/mx/mx_poll.c at line 784: n_iov > 0
>>>  Assertion failed in file ../src/mpid/ch3/channels/nemesis/netmod/mx/mx_poll.c at line 784: n_iov > 0
>>>  Assertion failed in file ../src/mpid/ch3/channels/nemesis/netmod/mx/mx_poll.c at line 784: n_iov > 0
>>>  internal ABORT - process 4
>>>  internal ABORT - process 7
>>>  internal ABORT - process 2
>>>  internal ABORT - process 6
>>>  internal ABORT - process 3
>>>  internal ABORT - process 0
>>>  internal ABORT - process 5
>>>  internal ABORT - process 1
>>>  [/stdout]
>>>
>>>  So, what does these assertions mean?
>>>  Is it something wrong with MX netmod?
>>>  Or in myriexpress library?
>>>  Or in test osu_alltoall itself?
>>>
>>>  BTW, osu_alltoall on 8 nodes passed successfully for TCP netmod.
>>>  _______________________________________________
>>>  discuss mailing list     discuss at mpich.org
>>>  To manage subscription options or unsubscribe:
>>>  https://lists.mpich.org/mailman/listinfo/discuss
>>  _______________________________________________
>>  discuss mailing list     discuss at mpich.org
>>  To manage subscription options or unsubscribe:
>>  https://lists.mpich.org/mailman/listinfo/discuss
>
> --
> Antonio J. Peña
> Postdoctoral Appointee
> Mathematics and Computer Science Division
> Argonne National Laboratory
> 9700 South Cass Avenue, Bldg. 240, Of. 3148
> Argonne, IL 60439-4847
> apenya at mcs.anl.gov
> www.mcs.anl.gov/~apenya
>
> _______________________________________________
> discuss mailing list     discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss
_______________________________________________
discuss mailing list     discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss


More information about the discuss mailing list