From Eric.Chamberland at giref.ulaval.ca Thu Oct 2 08:33:20 2025 From: Eric.Chamberland at giref.ulaval.ca (Eric Chamberland) Date: Thu, 2 Oct 2025 09:33:20 -0400 Subject: [mpich-discuss] Recommended configure options for MPICH 4.3.x with Valgrind or address-sanitizer In-Reply-To: References: <6336c8eb-729d-49a5-b446-55f0a12a287c@giref.ulaval.ca> Message-ID: <27831a5c-d3ee-436f-ab7a-7bbb77117913@giref.ulaval.ca> Hi, I have been building MPICH with the following configure options for a long time, mainly to keep my code ?Valgrind-clean?: === ./configure \ ? --enable-g=dbg,meminit \ ? --with-device=ch3:sock \ ? --enable-romio === This setup worked reasonably well in the past, but recently I?ve been seeing occasional errors with address-sanitizer or valgrind (with 4.3.0 on a single node) such as: === Fatal error in internal_Allreduce_c: Unknown error class, error stack: internal_Allreduce_c(347)...................: MPI_Allreduce_c(sendbuf=0x7ffdeb0b8e90, recvbuf=0x7ffdeb0b8e98, count=1, dtype=0x4c00083a, MPI_SUM, comm=0x84000003) failed MPIR_Allreduce_impl(4826)...................: MPIR_Allreduce_allcomm_auto(4732)...........: MPIR_Allreduce_intra_recursive_doubling(115): MPIC_Sendrecv(266)..........................: MPIC_Wait(90)...............................: MPIR_Wait(751)..............................: MPIR_Wait_state(708)........................: MPIDI_CH3i_Progress_wait(187)...............: an error occurred while handling an event returned by MPIDI_CH3I_Sock_Wait() MPIDI_CH3I_Progress_handle_sock_event(385)..: MPIDI_CH3I_Socki_handle_read(3647)..........: connection failure (set=0,sock=1,errno=104:Connection reset by peer) === Is CH3 considered legacy? I would like to also ask: ?1. What are the recommended configure options in 2025 for building MPICH in a way that works well with Valgrind? ?2. Is it preferable now to move to CH4 (e.g. ch4:ofi or ch4:shm) when debugging with Valgrind? ?3. Are there any other options (besides --enable-g=dbg,meminit) that you would suggest for catching memory errors while keeping Valgrind reports as clean as possible? ?4. Is https://urldefense.us/v3/__https://github.com/pmodels/mpich/blob/main/doc/wiki/design/Support_for_Debugging_Memory_Allocation.md__;!!G_uCfscf7eWS!dFuaQZT98RM3djlwTDR166TzLQA81Eo6KgJexqv-7tSs0ksEgsw5P7-HubGwLwtV3eIycYeq5yGo0flX2JqpDUTLrun1lyOw$ up-to-date? Any guidance on the ?best practice? configuration for this use case would be greatly appreciated. PETSc guys have some options about debug (https://urldefense.us/v3/__https://gitlab.com/petsc/petsc/-/blob/main/config/BuildSystem/config/packages/MPICH.py*L94__;Iw!!G_uCfscf7eWS!dFuaQZT98RM3djlwTDR166TzLQA81Eo6KgJexqv-7tSs0ksEgsw5P7-HubGwLwtV3eIycYeq5yGo0flX2JqpDUTLrlIPsrgu$ ) but still uses CH3 by default.? However Satish uses the configuration described above, at least for valgrind CI. Thanks a lot, Eric -- Eric Chamberland, ing., M. Ing Professionnel de recherche GIREF/Universit? Laval On 2025-10-02 08:13, Balay, Satish wrote: > We currently use: > > balay at petsc-gpu-02:~$ > /nfs/gce/projects/petsc/soft/u22.04/mpich-4.3.0-p2-ucx/bin/mpichversion > MPICH Version: ? ? ?4.3.0 > MPICH Release date: Mon Feb ?3 09:09:47 AM CST 2025 > MPICH ABI: ? ? ? ? ?17:0:5 > MPICH Device: ? ? ? ch4:ucx > MPICH configure: > ?--prefix=/nfs/gce/projects/petsc/soft/u22.04/mpich-4.3.0-p2-ucx > --enable-shared --with-device=ch4:ucx --with-pm=hydra --enable-fast=no > --enable-error-messages=all --enable-g=meminit --disable-java > --without-hwloc --disable-opencl --without-cuda --without-hip > MPICH CC: ? ? ? ? ? gcc ? ? -O0 > MPICH CXX: ? ? ? ? ?g++ ? -O0 > MPICH F77: ? ? ? ? ?gfortran ? -O0 > MPICH FC: ? ? ? ? ? gfortran ? -O0 > MPICH features: ? ? threadcomm > > With: > > ? ? #MPICH OFI/UCX/valgrind > ? ? export FI_PROVIDER=^psm3 > ? ? export UCX_SYSV_HUGETLB_MODE=n > ? ? export UCX_LOG_LEVEL=error > > Satish > ------------------------------------------------------------------------ -------------- next part -------------- An HTML attachment was scrubbed... URL: