[mpich-discuss] Assertion in MX netmod

"Antonio J. Peña" apenya at mcs.anl.gov
Mon Nov 24 14:43:43 CST 2014


On 11/24/2014 12:23 PM, Kuleshov Aleksey wrote:
> Thank you Antonio for this information. This is very sad, because:
> 1) in the releases 3.1.2 and 3.1.3 (which are stable releases!) MPICH has broken netmod?
Our policy is to keep unsupported code for a while in our releases in a 
best effort practice, at least while it seems to be working and does not 
bother us in our developments. The reality is that we do not further 
have any hardware nor specific funding to keep supporting this netmod.
> 2) netmod newmad has the same routing as mx (which calls assertion) but newmad is still in 3.2a2 => MPICH still has broken code unless something was fixed in subroutings?
Are you saying that you can reproduce the issue with the newmad netmod? 
Otherwise, similar code paths do not necessarily mean that we will be 
hitting the same bug. We do extensive automated testing in multiple 
netmods, architectures, compilers, and compiling configurations. Without 
being able to reproduce the problem in other than MX, we cannot conclude 
other than that the bug was specifically located in that netmod. In case 
you confirm you are reproducing the same bug in the newmad netmod, we 
will contact the external person who contributed and used to maintain it.

   Antonio
>
> 24.11.2014, 18:09, "Antonio J. Peña" <apenya at mcs.anl.gov>:
>> Dear Kuleshov,
>>
>> In order to accomodate resources for more recent networking APIs we
>> dropped support for the mx netmod, which in fact has been completely
>> removed in our most recent 3.2 releases. So, unfortunately, we are not
>> able to assist you with this issue.
>>
>> Best regards,
>>     Antonio
>>
>> On 11/22/2014 01:52 PM, Kuleshov Aleksey wrote:
>>>   And the same problem with different approach:
>>>   I downloaded from http://www.mcs.anl.gov/research/projects/mpi/mpi-test/tsuite.html mpi2test.tar.gz, built it and try
>>>   to run pingping test:
>>>>   MPITEST_VERBOSE=1 MPICH_NEMESIS_NETMOD=mx mpiexec -f /tmp/m -n 2 /tests/pingping
>>>   [stdout]
>>>   Get new datatypes: send = MPI_INT, recv = MPI_INT
>>>   Get new datatypes: send = MPI_INT, recv = MPI_INT
>>>   Sending count = 1 of sendtype MPI_INT of total size 4 bytes
>>>   Sending count = 1 of sendtype MPI_INT of total size 4 bytes
>>>   Get new datatypes: send = MPI_DOUBLE, recv = MPI_DOUBLE
>>>   Get new datatypes: send = MPI_DOUBLE, recv = MPI_DOUBLE
>>>   Sending count = 1 of sendtype MPI_DOUBLE of total size 8 bytes
>>>   Sending count = 1 of sendtype MPI_DOUBLE of total size 8 bytes
>>>   Get new datatypes: send = MPI_FLOAT_INT, recv = MPI_FLOAT_INT
>>>   Sending count = 1 of sendtype MPI_FLOAT_INT of total size 8 bytes
>>>   Get new datatypes: send = MPI_FLOAT_INT, recv = MPI_FLOAT_INT
>>>   Sending count = 1 of sendtype MPI_FLOAT_INT of total size 8 bytes
>>>   Get new datatypes: send = dup of MPI_INT, recv = dup of MPI_INT
>>>   Get new datatypes: send = dup of MPI_INT, recv = dup of MPI_INT
>>>   Sending count = 1 of sendtype dup of MPI_INT of total size 4 bytes
>>>   Sending count = 1 of sendtype dup of MPI_INT of total size 4 bytes
>>>   Get new datatypes: send = int-vector, recv = MPI_INT
>>>   Sending count = 1 of sendtype int-vector of total size 4 bytes
>>>   Assertion failed in file ../src/mpid/ch3/channels/nemesis/netmod/mx/mx_send.c at line 435: n_iov > 0
>>>   internal ABORT - process 0
>>>   [/stdout]
>>>
>>>   22.11.2014, 18:39, "Kuleshov Aleksey" <rndfax at yandex.ru>:
>>>>   Hello! Can you please help me with problem?
>>>>
>>>>   I'm working on custom myriexpress library and I'm using MX netmod in MPICH v.3.1.2.
>>>>   For testing purposes I built OSU Micro Benchmarks v3.8.
>>>>
>>>>   To run it on 7 nodes I execute test osu_alltoall as follows:
>>>>>     MPICH_NEMESIS_NETMOD=mx mpiexec -f /tmp/m -n 7 /osu_alltoall
>>>>   It passed successfully (I also tried it on 2, 3, 4, 5 and 6 nodes - everything is alright).
>>>>
>>>>   But now I want to run it on 8 nodes:
>>>>>     MPICH_NEMESIS_NETMOD=mx mpiexec -f /tmp/m -n 8 /osu_alltoall
>>>>   [stdout]
>>>>   # OSU MPI All-to-All Personalized Exchange Latency Test v3.8
>>>>   # Size       Avg Latency(us)
>>>>   Assertion failed in file ../src/mpid/ch3/channels/nemesis/netmod/mx/mx_poll.c at line 784: n_iov > 0
>>>>   Assertion failed in file ../src/mpid/ch3/channels/nemesis/netmod/mx/mx_poll.c at line 784: n_iov > 0
>>>>   Assertion failed in file ../src/mpid/ch3/channels/nemesis/netmod/mx/mx_poll.c at line 784: n_iov > 0
>>>>   Assertion failed in file ../src/mpid/ch3/channels/nemesis/netmod/mx/mx_poll.c at line 784: n_iov > 0
>>>>   Assertion failed in file ../src/mpid/ch3/channels/nemesis/netmod/mx/mx_poll.c at line 784: n_iov > 0
>>>>   Assertion failed in file ../src/mpid/ch3/channels/nemesis/netmod/mx/mx_poll.c at line 784: n_iov > 0
>>>>   Assertion failed in file ../src/mpid/ch3/channels/nemesis/netmod/mx/mx_poll.c at line 784: n_iov > 0
>>>>   Assertion failed in file ../src/mpid/ch3/channels/nemesis/netmod/mx/mx_poll.c at line 784: n_iov > 0
>>>>   internal ABORT - process 4
>>>>   internal ABORT - process 7
>>>>   internal ABORT - process 2
>>>>   internal ABORT - process 6
>>>>   internal ABORT - process 3
>>>>   internal ABORT - process 0
>>>>   internal ABORT - process 5
>>>>   internal ABORT - process 1
>>>>   [/stdout]
>>>>
>>>>   So, what does these assertions mean?
>>>>   Is it something wrong with MX netmod?
>>>>   Or in myriexpress library?
>>>>   Or in test osu_alltoall itself?
>>>>
>>>>   BTW, osu_alltoall on 8 nodes passed successfully for TCP netmod.
>>>>   _______________________________________________
>>>>   discuss mailing list     discuss at mpich.org
>>>>   To manage subscription options or unsubscribe:
>>>>   https://lists.mpich.org/mailman/listinfo/discuss
>>>   _______________________________________________
>>>   discuss mailing list     discuss at mpich.org
>>>   To manage subscription options or unsubscribe:
>>>   https://lists.mpich.org/mailman/listinfo/discuss
>> --
>> Antonio J. Peña
>> Postdoctoral Appointee
>> Mathematics and Computer Science Division
>> Argonne National Laboratory
>> 9700 South Cass Avenue, Bldg. 240, Of. 3148
>> Argonne, IL 60439-4847
>> apenya at mcs.anl.gov
>> www.mcs.anl.gov/~apenya
>>
>> _______________________________________________
>> discuss mailing list     discuss at mpich.org
>> To manage subscription options or unsubscribe:
>> https://lists.mpich.org/mailman/listinfo/discuss
> _______________________________________________
> discuss mailing list     discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss


-- 
Antonio J. Peña
Postdoctoral Appointee
Mathematics and Computer Science Division
Argonne National Laboratory
9700 South Cass Avenue, Bldg. 240, Of. 3148
Argonne, IL 60439-4847
apenya at mcs.anl.gov
www.mcs.anl.gov/~apenya

_______________________________________________
discuss mailing list     discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss


More information about the discuss mailing list