[mpich-discuss] MPICH 3.2 on BlueGene/Q

Dominic Chien chiensh.acrc at gmail.com
Mon Jan 11 02:45:00 CST 2016


Hi All,

Here is an update:

MPICH 3.1.3 is the last version that passed the nonblocking test, even without setting PAMID_THREAD_MULTIPLE. However, setting PAMID_ASYNC_PROGRESS=1 will cause error.(Abort(1) on node 7: 'locking' async progress not applicable...)

[chiensh at cumulus coll.bak]$ which mpif90
~/apps/mpich/3.1.3/bin/mpif90
[chiensh at cumulus coll.bak]$ make nonblocking
  CC       nonblocking.o
  CCLD     nonblocking
[chiensh at cumulus coll.bak]$ srun -n 2 ./nonblocking
 No errors
[chiensh at cumulus coll.bak]$ srun -n 4 ./nonblocking
 No errors
[chiensh at cumulus coll.bak]$ srun -n 16 ./nonblocking
 No errors

Thanks all!

Regards,
Dominic


On 11 Jan, 2016, at 2:29 pm, Dominic Chien <chiensh.acrc at gmail.com> wrote:

> Thank you Jeff and Halim,
> 
> Halim, I have tried 3.1.4  but it does not return 0 (error) when the program is finished, e.g. for a helloworld program
> ==================================================================
>   program hello
>   include 'mpif.h'
>   integer rank, size, ierror, tag, status(MPI_STATUS_SIZE)
> 
>   call MPI_INIT(ierror)
>   call MPI_COMM_SIZE(MPI_COMM_WORLD, size, ierror)
>   call MPI_COMM_RANK(MPI_COMM_WORLD, rank, ierror)
>   print*, 'node', rank, ': Hello world'
>   call MPI_FINALIZE(ierror)
>   end
> ==================================================================
> 
> Using MPI 3.1.rc4
> ==================================================================
> [chiensh at cumulus test]$ which mpif90
> ~/apps/mpich/3.1.rc4/bin/mpif90
> [chiensh at cumulus test]$ srun -n 2 ./a.out
> node 1 : Hello world
> node 0 : Hello world
> [chiensh at cumulus test]$
> ==================================================================
> Using MPI 3.1.4
> ==================================================================
> [chiensh at cumulus test]$ which mpif90
> ~/apps/mpich/3.1.4/bin/mpif90
> [chiensh at cumulus test]$ srun -n 2 ./a.out
> node 1 : Hello world
> node 0 : Hello world
> 2016-01-11 14:24:25.968 (WARN ) [0xfff7ef48b10] 75532:ibm.runjob.client.Job: terminated by signal 11
> 2016-01-11 14:24:25.968 (WARN ) [0xfff7ef48b10] 75532:ibm.runjob.client.Job: abnormal termination by signal 11 from rank 1
> ==================================================================
> 
> 
> Jeff, after I set PAMID_THREAD_MULTIPLE=1 and PAMID_ASYNC_PROGRESS=1, it seems to have some "improvement":  nonblocking test can run for up to 4 processes sometime, but sometime it just get a "deadlock", (see below)
> ==========================================================
> [chiensh at cumulus coll.bak]$ srun --nodes=4 --ntasks-per-node=1 nonblocking
> MPIDI_Process.*
>  verbose               : 2
>  statistics            : 1
>  contexts              : 32
>  async_progress        : 1
>  context_post          : 1
>  pt2pt.limits
>    application
>      eager
>        remote, local   : 4097, 4097
>      short
>        remote, local   : 113, 113
>    internal
>      eager
>        remote, local   : 4097, 4097
>      short
>        remote, local   : 113, 113
>  rma_pending           : 1000
>  shmem_pt2pt           : 1
>  disable_internal_eager_scale : 524288
>  optimized.collectives : 0
>  optimized.select_colls: 2
>  optimized.subcomms    : 1
>  optimized.memory      : 0
>  optimized.num_requests: 1
>  mpir_nbc              : 1
>  numTasks              : 4
> mpi thread level        : 'MPI_THREAD_SINGLE'
> MPIU_THREAD_GRANULARITY : 'per object'
> ASSERT_LEVEL            : 0
> MPICH_LIBDIR           : not defined
> The following MPICH_* environment variables were specified:
> The following PAMID_* environment variables were specified:
>  PAMID_STATISTICS=1
>  PAMID_ASYNC_PROGRESS=1
>  PAMID_THREAD_MULTIPLE=1
>  PAMID_VERBOSE=2
> The following PAMI_* environment variables were specified:
> The following COMMAGENT_* environment variables were specified:
> The following MUSPI_* environment variables were specified:
> The following BG_* environment variables were specified:
> No errors
> ==========================================================
> [chiensh at cumulus coll.bak]$ srun --nodes=4 --ntasks-per-node=1 nonblocking
> MPIDI_Process.*
>  verbose               : 2
>  statistics            : 1
>  contexts              : 32
>  async_progress        : 1
>  context_post          : 1
>  pt2pt.limits
>    application
>      eager
>        remote, local   : 4097, 4097
>      short
>        remote, local   : 113, 113
>    internal
>      eager
>        remote, local   : 4097, 4097
>      short
>        remote, local   : 113, 113
>  rma_pending           : 1000
>  shmem_pt2pt           : 1
>  disable_internal_eager_scale : 524288
>  optimized.collectives : 0
>  optimized.select_colls: 2
>  optimized.subcomms    : 1
>  optimized.memory      : 0
>  optimized.num_requests: 1
>  mpir_nbc              : 1
>  numTasks              : 4
> mpi thread level        : 'MPI_THREAD_SINGLE'
> MPIU_THREAD_GRANULARITY : 'per object'
> ASSERT_LEVEL            : 0
> MPICH_LIBDIR           : not defined
> The following MPICH_* environment variables were specified:
> The following PAMID_* environment variables were specified:
>  PAMID_STATISTICS=1
>  PAMID_ASYNC_PROGRESS=1
>  PAMID_THREAD_MULTIPLE=1
>  PAMID_VERBOSE=2
> The following PAMI_* environment variables were specified:
> The following COMMAGENT_* environment variables were specified:
> The following MUSPI_* environment variables were specified:
> The following BG_* environment variables were specified:
> (never return from here)
> ==========================================================
> 
> Thanks!
> 
> Regards,
> Dominic
> 
> 
> On 11 Jan, 2016, at 12:08 pm, Halim Amer <aamer at anl.gov> wrote:
> 
>> Dominic,
>> 
>> There were a bunch of fixes that went to PAMID since v3.1rc4. You could try a release from the 3.1 series (i.e. from 3.1 through 3.1.4).
>> 
>> Regards,
>> --Halim
>> 
>> www.mcs.anl.gov/~aamer
> On 11 Jan, 2016, at 11:30 am, Jeff Hammond <jeff.science at gmail.com> wrote:
>> I recall MPI-3 RMA on BGQ deadlocks if you set PAMID_THREAD_MULTIPLE (please see ALCF MPI docs to verify exact name), which is required for async progress. 
>> 
>> ARMCI-MPI test suite is one good way to validate MPI-3 RMA is working. 
>> 
>> Jeff
> 

_______________________________________________
discuss mailing list     discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss


More information about the discuss mailing list