[mpich-discuss] Crash on MPI_Rput

Zhou, Hui zhouh at anl.gov
Fri Nov 4 16:09:51 CDT 2022


Hi Bruce,

Is the test suite available for us to checkout and test?

--
Hui
________________________________
From: Palmer, Bruce J via discuss <discuss at mpich.org>
Sent: Friday, November 4, 2022 4:03 PM
To: discuss at mpich.org <discuss at mpich.org>
Cc: Palmer, Bruce J <Bruce.Palmer at pnnl.gov>
Subject: Re: [mpich-discuss] Crash on MPI_Rput


I kind of dropped this for a while but I’d like to pick it back up. I did some more testing using different versions of mpich and got the following results for the RMA runtime



MPICH-3.1.4 configured with

./configure --prefix=/people/d3g293/mpich/mpich-3.1.4/install --with-libfabric=embedded --enable-threads=multiple --with-slurm CC=gcc CXX=g++



2/80 tests fail in GA test suite



MPICH-4.0.2 configured with

unset F90

./configure --prefix=/people/d3g293/mpich/mpich-4.0.2/install --with-device=ch4:ofi:sockets --with-libfabric=embedded --enable-threads=multiple --with-slurm CC=gcc CXX=g++



25/80 tests fail in GA test suite



Running with MPICH-3.3.2 seems to lead to around 8 failures, but my notes on this aren’t that good.



If I run with OpenMPI 4.1.4, everything passes. Any reason for why I’m seeing this? I haven’t really done much to this runtime in the last few years.



Bruce



From: Palmer, Bruce J via discuss <discuss at mpich.org>
Date: Wednesday, September 28, 2022 at 12:30 PM
To: 'Thakur, Rajeev' <thakur at anl.gov>, discuss at mpich.org <discuss at mpich.org>, Zhou, Hui <zhouh at anl.gov>
Cc: Palmer, Bruce J <Bruce.Palmer at pnnl.gov>
Subject: Re: [mpich-discuss] Crash on MPI_Rput

Check twice before you click! This email originated from outside PNNL.



I think the MPI-RMA runtime was mostly (maybe completely) working with 3.2-3.4. It may have even been working earlier with 4.0. I think there is a pretty good chance that the problem is a system configuration problem at our end and I was hoping that you might have some insight into what it might be based on the errors I’m seeing. I can try running with a few earlier versions of mpich and see if any of them work better.



Bruce



From: Thakur, Rajeev <thakur at anl.gov>
Sent: Wednesday, September 28, 2022 12:24 PM
To: discuss at mpich.org; Zhou, Hui <zhouh at anl.gov>
Cc: Palmer, Bruce J <Bruce.Palmer at pnnl.gov>
Subject: Re: [mpich-discuss] Crash on MPI_Rput



Was it working with an earlier version of MPICH? If so, which one?



Rajeev



From: "Palmer, Bruce J via discuss" <discuss at mpich.org<mailto:discuss at mpich.org>>
Reply-To: "discuss at mpich.org<mailto:discuss at mpich.org>" <discuss at mpich.org<mailto:discuss at mpich.org>>
Date: Wednesday, September 28, 2022 at 2:20 PM
To: "Zhou, Hui" <zhouh at anl.gov<mailto:zhouh at anl.gov>>, "discuss at mpich.org<mailto:discuss at mpich.org>" <discuss at mpich.org<mailto:discuss at mpich.org>>
Cc: "Palmer, Bruce J" <Bruce.Palmer at pnnl.gov<mailto:Bruce.Palmer at pnnl.gov>>
Subject: Re: [mpich-discuss] Crash on MPI_Rput



I upgraded to mpich-4.0.2 (latest stable release) and get pretty much the same result. This failure is reproducible, I get the same error on multiple runs so it doesn’t look like an unexpected process failure.



One other feature that I forgot to mention earlier is that I’m running this test on 4 processors distributed over 2 nodes. If I run 4 processes on 1 node, the code runs without error.



Bruce



From: Zhou, Hui <zhouh at anl.gov<mailto:zhouh at anl.gov>>
Date: Tuesday, September 27, 2022 at 2:55 PM
To: discuss at mpich.org<mailto:discuss at mpich.org> <discuss at mpich.org<mailto:discuss at mpich.org>>
Cc: Palmer, Bruce J <Bruce.Palmer at pnnl.gov<mailto:Bruce.Palmer at pnnl.gov>>
Subject: Re: Crash on MPI_Rput

Hi Bruce,



  *   srun: error: node003: task 1: Exited with exit code 7



Looks like one of the process crashed unexpectedly.



--
Hui Zhou





From: Palmer, Bruce J via discuss <discuss at mpich.org<mailto:discuss at mpich.org>>
Date: Tuesday, September 27, 2022 at 3:32 PM
To: discuss at mpich.org<mailto:discuss at mpich.org> <discuss at mpich.org<mailto:discuss at mpich.org>>
Cc: Palmer, Bruce J <Bruce.Palmer at pnnl.gov<mailto:Bruce.Palmer at pnnl.gov>>
Subject: [mpich-discuss] Crash on MPI_Rput

Hi,



I’m testing the MPI-RMA runtime in Global Arrays and I’m getting a lot more crashes than I’ve seen in the past. The MPI-RMA runtime code is fairly stable and hasn’t been modified much recently and all the tests used to pass using one of the more recent MPICH releases. However, I’m getting significant crashes at this point. One of them occurs in a program designed to test non-blocking communication. It creates an MPI window, using MPI_Alloc_mem followed by MPI_Win_create and then calls MPI_Win_lock_all on the window. The code currently crashes when it gets to an MPI_Rput call. I’m trying to see if there is something different in the environment that might be causing this.



I’m currently up to MPICH-4.0b1 configured with



./configure --prefix=/people/d3g293/mpich/mpich-4.0b1/install --with-device=ch4:ofi:sockets --with-libfabric=embedded --enable-threads=multiple --with-slurm CC=gcc CXX=g++

#./configure --prefix=/people/d3g293/mpich/mpich-3.4.1/install-newell-nocuda --with-device=ch4:ofi:sockets --with-libfabric=embedded --enable-threads=multiple --with-slurm CC=gcc CXX=g++



I’ve tried other recent vintages of MPICH, but I get similar results. The error I’m seeing when the program crashes is



[proxy:0:1 at node003.local] HYD_pmcd_pmip_control_cmd_cb (pm/pmiserv/pmip_cb.c:899): assert (!closed) failed

[proxy:0:1 at node003.local] HYDT_dmxu_poll_wait_for_event (tools/demux/demux_poll.c:76): callback returned error status

srun: error: node003: task 1: Exited with exit code 7

[proxy:0:1 at node003.local] main (pm/pmiserv/pmip.c:169): demux engine error waiting for event

[mpiexec at node002.local] HYDT_bscu_wait_for_completion (tools/bootstrap/utils/bscu_wait.c:74): one of the processes terminated badly; aborting

[mpiexec at node002.local] HYDT_bsci_wait_for_completion (tools/bootstrap/src/bsci_wait.c:21): launcher returned error waiting for completion

[mpiexec at node002.local] HYD_pmci_wait_for_completion (pm/pmiserv/pmiserv_pmci.c:179): launcher returned error waiting for completion

[mpiexec at node002.local] main (ui/mpich/mpiexec.c:325): process manager error waiting for completion



Any suggestions about what might be going wrong here? It could be a problem with the machine configuration, since this code seemed to be running fine a while ago and has not been modified since then. I’ll try building the latest stable release and see if that fixes anything, but as I mentioned none of the recent releases seems to work.



Bruce Palmer

Computer Scientist

Pacific Northwest National Laboratory

(509) 375-3899


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20221104/c873c5fa/attachment-0001.html>


More information about the discuss mailing list