[mpich-discuss] Test of MPICH 3.1.3 on BlueGene/Q

Jeff Hammond jeff.science at gmail.com
Tue Jan 19 17:38:37 CST 2016


Because Blue Gene doesn't have fork() or any other OS mechanism for
spawning processes after job start, it has never had a nontrivial
implementation of MPI_Comm_spawn and thus has never passed the MPICH test
suite.  By nontrivial, I mean one that does something other than fail in a
compliant way because world_size = universe_size (which may not have
happened but which I proposed as a trivial way to achieve MPI-2.2
compliance).

For Blue Gene/Q acceptance testing, every MPICH test (from some version of
the test suite circa 2012, which I do not recall offhand) passed except
those explicitly excluded.  The exclusions were anything related to dynamic
processes, connect-accept, etc. and language bindings (certainly Fortran; I
don't know what we said about C++, but I don't think that is relevant
here).  Fortran was excluded because that has nothing to do with the guts
of MPI, the network, etc. It is strictly a test of the Fortran compiler and
the Fortran bindings.  So if some MPICH Fortran test is failing, it is
either a compiler issue or a problem with MPICH Fortran bindings.

I hope this helps.

Jeff

On Tue, Jan 19, 2016 at 8:59 AM, Rob Latham <robl at mcs.anl.gov> wrote:

>
>
> On 01/17/2016 08:16 PM, Dominic Chien wrote:
>
>> Hi,
>>
>> I have built MPICH 3.1.3 on Bluegene/Q, based on the following
>> configuration
>> ../configure --host=powerpc64-bgq-linux --with-device=pamid:BGQ
>> --with-file-system=gpfs:BGQ
>> --with-bgq-install-dir=/bgsys/drivers/V1R2M0/ppc64 --disable-wrapper-rpath
>> --enable-fast=nochkmsg,notiming,O3 --with-assert-level=0
>> --disable-error-messages --disable-debuginfo --enable-thread-cs=per-object
>> --with-atomic-primitives --enable-handle-allocation=tls
>> --enable-refcount=lock-free --disable-predefined-refcount
>> --with-cross-file=src/mpid/pamid/cross/bgq8
>> --prefix=/scratch/home/chiensh/apps/mpich/3.1.3-opt/ --disable-spawn
>>
>> The resultant mpich has passed most of the tests (679) and 9 are failed
>> (see below), but I am not sure if these errors are critical.  Can anyone
>> comment on this?
>>
>>
> I don't know if MPICH ever passed 100% of the mpich tests on Blue Gene
> (maybe back in 1.5.1 days, but we had fewer tests then, too).
>
> These 9 errors all look like something an application might run into:
> probing messages, truncated messages, RMA via the fortran interface,
> examining the status object.
>
> If your application does any of those things I'd pay particular attention
> to the results.  It's entirely possible that your application won't touch
> the parts of MPICH that are not fully up to spec on Blue Gene.
>
> So, I would say these errors are concerning, but not critical.  Press on
> and let us know how things go with your application!
>
> ==rob
>
>
>
> Many Thanks!
>>
>> Regards,
>> Dominic Chien
>>
>> =========================================================
>> not ok 283 - ./init/timeout 2
>>    ---
>>    Directory: ./init
>>    File: timeout
>>    Num-procs: 2
>>    Date: "Wed Jan 13 13:55:50 2016"
>>    ...
>> ## Test output (expected 'No Errors'):
>> ## srun returned a zero status but the program returned a nonzero status
>> =========================================================
>> =========================================================
>> not ok 324 - ./pt2pt/mprobe 2
>>    ---
>>    Directory: ./pt2pt
>>    File: mprobe
>>    Num-procs: 2
>>    Date: "Wed Jan 13 14:07:47 2016"
>>    ...
>> ## Test output (expected 'No Errors'):
>> ## 2016-01-13 14:07:47.846 (WARN ) [0xfff8d988bb0]
>> 78050:ibm.runjob.client.Job: terminated by signal 11
>> ## 2016-01-13 14:07:47.846 (WARN ) [0xfff8d988bb0]
>> 78050:ibm.runjob.client.Job: abnormal termination by signal 11 from rank 1
>> =========================================================
>> =========================================================
>> not ok 538 - ./f77/rma/wingetf 5
>>    ---
>>    Directory: ./f77/rma
>>    File: wingetf
>>    Num-procs: 5
>>    Date: "Wed Jan 13 16:28:10 2016"
>>    ...
>> ## Test output (expected 'No Errors'):
>> ##  0  buf( 1 , 11 ) =  751  expected  251
>> ##  0  buf( 2 , 11 ) =  752  expected  252
>> ##  0  buf( 3 , 11 ) =  753  expected  253
>> ##  0  buf( 4 , 11 ) =  754  expected  254
>> ##  0  buf( 5 , 11 ) =  755  expected  255
>> ##  0  buf( 6 , 11 ) =  756  expected  256
>> ##  0  buf( 7 , 11 ) =  757  expected  257
>> ##  0  buf( 8 , 11 ) =  758  expected  258
>> ##  0  buf( 9 , 11 ) =  759  expected  259
>> ##  0  buf( 10 , 11 ) =  760  expected  260
>> ##   Found  25  errors
>> =========================================================
>> =========================================================
>> not ok 640 - ./f90/rma/wingetf90 5
>>    ---
>>    Directory: ./f90/rma
>>    File: wingetf90
>>    Num-procs: 5
>>    Date: "Wed Jan 13 18:10:17 2016"
>>    ...
>> ## Test output (expected 'No Errors'):
>> ##  4  buf( 1 ,0) =  0  expected 976
>> ##  4  buf( 2 ,0) =  24525328  expected 977
>> ##  4  buf( 3 ,0) =  31  expected 978
>> ##  4  buf( 4 ,0) =  -1073759872  expected 979
>> ##  4  buf( 5 ,0) =  1107296292  expected 980
>> ##  4  buf( 6 ,0) =  -1073758504  expected 981
>> ##  4  buf( 7 ,0) =  0  expected 982
>> ##  4  buf( 8 ,0) =  22184620  expected 983
>> ##  4  buf( 9 ,0) =  0  expected 984
>> ##  4  buf( 10 ,0) =  25808064  expected 985
>> ##   Found  50  errors
>> =========================================================
>> =========================================================
>> not ok 668 - ./errors/pt2pt/truncmsg1 2
>>    ---
>>    Directory: ./errors/pt2pt
>>    File: truncmsg1
>>    Num-procs: 2
>>    Date: "Wed Jan 13 18:16:19 2016"
>>    ...
>> ## Test output (expected 'No Errors'):
>> ## MPI_Recv (short) returned MPI_SUCCESS instead of truncated message
>> ## MPI_Recv (irecv-short) returned MPI_SUCCESS instead of truncated
>> message
>> ## MPI_Recv (medium) returned MPI_SUCCESS instead of truncated message
>> ##  Found 3 errors
>> =========================================================
>> =========================================================
>> not ok 670 - ./errors/pt2pt/errinstatts 2
>>    ---
>>    Directory: ./errors/pt2pt
>>    File: errinstatts
>>    Num-procs: 2
>>    Date: "Wed Jan 13 18:16:45 2016"
>>    ...
>> ## Test output (expected 'No Errors'):
>> ## Did not get ERR_IN_STATUS in Testsome (outcount = 2, should equal 2);
>> class returned was 0
>> ##  Found 1 errors
>> =========================================================
>> =========================================================
>> not ok 671 - ./errors/pt2pt/errinstatta 2
>>    ---
>>    Directory: ./errors/pt2pt
>>    File: errinstatta
>>    Num-procs: 2
>>    Date: "Wed Jan 13 18:16:58 2016"
>>    ...
>> ## Test output (expected 'No Errors'):
>> ## Did not get ERR_IN_STATUS in Testall
>> ##  Found 1 errors
>> =========================================================
>> =========================================================
>> not ok 672 - ./errors/pt2pt/errinstatws 2
>>    ---
>>    Directory: ./errors/pt2pt
>>    File: errinstatws
>>    Num-procs: 2
>>    Date: "Wed Jan 13 18:17:11 2016"
>>    ...
>> ## Test output (expected 'No Errors'):
>> ## Did not get ERR_IN_STATUS in Waitsome.  Got 0.
>> ##  Found 1 errors
>> =========================================================
>> =========================================================
>> not ok 673 - ./errors/pt2pt/errinstatwa 2
>>    ---
>>    Directory: ./errors/pt2pt
>>    File: errinstatwa
>>    Num-procs: 2
>>    Date: "Wed Jan 13 18:17:23 2016"
>>    ...
>> ## Test output (expected 'No Errors'):
>> ## Did not get ERR_IN_STATUS in Waitall
>> ##  Found 1 errors
>> =========================================================
>>
>>
>> _______________________________________________
>> discuss mailing list     discuss at mpich.org
>> To manage subscription options or unsubscribe:
>> https://lists.mpich.org/mailman/listinfo/discuss
>>
>> _______________________________________________
> discuss mailing list     discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss
>



-- 
Jeff Hammond
jeff.science at gmail.com
http://jeffhammond.github.io/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20160119/831707fd/attachment.html>
-------------- next part --------------
_______________________________________________
discuss mailing list     discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss


More information about the discuss mailing list