[mpich-discuss] MPICH 3.2 on BlueGene/Q
Dominic Chien
chiensh.acrc at gmail.com
Mon Jan 11 00:29:59 CST 2016
Thank you Jeff and Halim,
Halim, I have tried 3.1.4 but it does not return 0 (error) when the program is finished, e.g. for a helloworld program
==================================================================
program hello
include 'mpif.h'
integer rank, size, ierror, tag, status(MPI_STATUS_SIZE)
call MPI_INIT(ierror)
call MPI_COMM_SIZE(MPI_COMM_WORLD, size, ierror)
call MPI_COMM_RANK(MPI_COMM_WORLD, rank, ierror)
print*, 'node', rank, ': Hello world'
call MPI_FINALIZE(ierror)
end
==================================================================
Using MPI 3.1.rc4
==================================================================
[chiensh at cumulus test]$ which mpif90
~/apps/mpich/3.1.rc4/bin/mpif90
[chiensh at cumulus test]$ srun -n 2 ./a.out
node 1 : Hello world
node 0 : Hello world
[chiensh at cumulus test]$
==================================================================
Using MPI 3.1.4
==================================================================
[chiensh at cumulus test]$ which mpif90
~/apps/mpich/3.1.4/bin/mpif90
[chiensh at cumulus test]$ srun -n 2 ./a.out
node 1 : Hello world
node 0 : Hello world
2016-01-11 14:24:25.968 (WARN ) [0xfff7ef48b10] 75532:ibm.runjob.client.Job: terminated by signal 11
2016-01-11 14:24:25.968 (WARN ) [0xfff7ef48b10] 75532:ibm.runjob.client.Job: abnormal termination by signal 11 from rank 1
==================================================================
Jeff, after I set PAMID_THREAD_MULTIPLE=1 and PAMID_ASYNC_PROGRESS=1, it seems to have some "improvement": nonblocking test can run for up to 4 processes sometime, but sometime it just get a "deadlock", (see below)
==========================================================
[chiensh at cumulus coll.bak]$ srun --nodes=4 --ntasks-per-node=1 nonblocking
MPIDI_Process.*
verbose : 2
statistics : 1
contexts : 32
async_progress : 1
context_post : 1
pt2pt.limits
application
eager
remote, local : 4097, 4097
short
remote, local : 113, 113
internal
eager
remote, local : 4097, 4097
short
remote, local : 113, 113
rma_pending : 1000
shmem_pt2pt : 1
disable_internal_eager_scale : 524288
optimized.collectives : 0
optimized.select_colls: 2
optimized.subcomms : 1
optimized.memory : 0
optimized.num_requests: 1
mpir_nbc : 1
numTasks : 4
mpi thread level : 'MPI_THREAD_SINGLE'
MPIU_THREAD_GRANULARITY : 'per object'
ASSERT_LEVEL : 0
MPICH_LIBDIR : not defined
The following MPICH_* environment variables were specified:
The following PAMID_* environment variables were specified:
PAMID_STATISTICS=1
PAMID_ASYNC_PROGRESS=1
PAMID_THREAD_MULTIPLE=1
PAMID_VERBOSE=2
The following PAMI_* environment variables were specified:
The following COMMAGENT_* environment variables were specified:
The following MUSPI_* environment variables were specified:
The following BG_* environment variables were specified:
No errors
==========================================================
[chiensh at cumulus coll.bak]$ srun --nodes=4 --ntasks-per-node=1 nonblocking
MPIDI_Process.*
verbose : 2
statistics : 1
contexts : 32
async_progress : 1
context_post : 1
pt2pt.limits
application
eager
remote, local : 4097, 4097
short
remote, local : 113, 113
internal
eager
remote, local : 4097, 4097
short
remote, local : 113, 113
rma_pending : 1000
shmem_pt2pt : 1
disable_internal_eager_scale : 524288
optimized.collectives : 0
optimized.select_colls: 2
optimized.subcomms : 1
optimized.memory : 0
optimized.num_requests: 1
mpir_nbc : 1
numTasks : 4
mpi thread level : 'MPI_THREAD_SINGLE'
MPIU_THREAD_GRANULARITY : 'per object'
ASSERT_LEVEL : 0
MPICH_LIBDIR : not defined
The following MPICH_* environment variables were specified:
The following PAMID_* environment variables were specified:
PAMID_STATISTICS=1
PAMID_ASYNC_PROGRESS=1
PAMID_THREAD_MULTIPLE=1
PAMID_VERBOSE=2
The following PAMI_* environment variables were specified:
The following COMMAGENT_* environment variables were specified:
The following MUSPI_* environment variables were specified:
The following BG_* environment variables were specified:
(never return from here)
==========================================================
Thanks!
Regards,
Dominic
On 11 Jan, 2016, at 12:08 pm, Halim Amer <aamer at anl.gov> wrote:
> Dominic,
>
> There were a bunch of fixes that went to PAMID since v3.1rc4. You could try a release from the 3.1 series (i.e. from 3.1 through 3.1.4).
>
> Regards,
> --Halim
>
> www.mcs.anl.gov/~aamer
On 11 Jan, 2016, at 11:30 am, Jeff Hammond <jeff.science at gmail.com> wrote:
> I recall MPI-3 RMA on BGQ deadlocks if you set PAMID_THREAD_MULTIPLE (please see ALCF MPI docs to verify exact name), which is required for async progress.
>
> ARMCI-MPI test suite is one good way to validate MPI-3 RMA is working.
>
> Jeff
_______________________________________________
discuss mailing list discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss
More information about the discuss
mailing list