[mpich-discuss] Deadlock when using MPICH 3.1.1 and per-object critical sections on BG/Q

Halim halim.amer at gmail.com
Thu Jul 17 14:54:50 CDT 2014


Hi Michael,

Thanks. I created a ticket (#2132) on trac.mpich.org to track the problem.

Regards,
--Halim

On 2014年06月26日 17:41, Michael Blocksome wrote:
> Halim,
>
> This problem sounds similar to another issue we are debugging related to
> cancel and multiple endpoints in per-object locking mode.  I'll try a few
> things and post status.
>
> Thanks,
>
> Michael Blocksome
> Parallel Environment MPI Middleware Team Lead, TCEM
> POWER, x86, and Blue Gene HPC Messaging
> blocksom at us.ibm.com
>
>
>
>
> From:   Halim <halim.amer at gmail.com>
> To:     discuss at mpich.org,
> Date:   06/26/2014 02:31 PM
> Subject:        [mpich-discuss] Deadlock when using MPICH 3.1.1 and
> per-object critical sections on BG/Q
> Sent by:        discuss-bounces at mpich.org
>
>
>
> Hi,
>
> I have a specific issue that arises with MPICH (I use 3.1.1 built with
> gcc) + MPI_THREAD_MULTIPLE + per-object critical sections on BG/Q.
>
> A deadlock happens in the attached hybrid MPI+OpenMP example code with 2
> processes and more than one thread per process.
>
> Debugging shows that one process is stuck in MPI_Allreduce while the
> other is blocked in MPI_Finalize.
>
> A similar communication pattern happens in my application, but in this
> case both processes are stuck in MPI_Allreduce.
>
> Note that the problem disappears when removing MPI_Allreduce, or
> avoiding request cancellation (cancel+wait+test_cancelled). Both
> Allreduce and cancellation operations can be avoided in this test while
> ensuring a correct result. But in my application, both operations are
> necessary.
>
> In addition, using a global critical section (default) results in a
> correct execution.
>
> My configure line is as follows:
>
> ./configure --prefix=/home/aamer/usr --host=powerpc64-bgq-linux
> --with-device=pamid --with-file-system=gpfs:BGQ
> --with-file-system=bg+bglockless --with-atomic-primitives
> --enable-handle-allocation=tls --enable-refcount=lock-free
> -disable-predefined-refcount --disable-error-checking --without-timing
> --without-mpit-pvars --enable-fast=O3,ndebug --enable-thread-cs=per-object
>
> I appreciate any advice to solve this issue?
>
> Regards,
> --Halim
> [attachment "allred_cancel.c" deleted by Michael Blocksome/Rochester/IBM]
> _______________________________________________
> discuss mailing list     discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss
>
>
>
> _______________________________________________
> discuss mailing list     discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss
>



More information about the discuss mailing list