[mpich-discuss] mpi hello-world error

Niyaz Murshed Niyaz.Murshed at arm.com
Mon Jun 17 01:07:15 CDT 2024


What is the best way to understand this log ?


[proxy:1 at ampere-altra-2-1] Sending upstream hdr.cmd = CMD_STDERR
Abort(680650255) on node 1: Fatal error in internal_Init: Other MPI error, error stack:
internal_Init(70)................: MPI_Init(argc=(nil), argv=(nil)) failed
MPII_Init_thread(268)............:
MPIR_init_comm_world(34).........:
MPIR_Comm_commit(823)............:
MPID_Comm_commit_post_hook(222)..:
MPIDI_world_post_init(665).......:
MPIDI_OFI_init_vcis(851).........:
check_num_nics(900)..............:
MPIR_Allreduce_allcomm_auto(4726):
MPIC_Sendrecv(301)...............:
MPID_Isend(63)...................:
MPIDI_isend(35)..................:
(unknown)(): Other MPI error
[proxy:1 at ampere-altra-2-1] Sending upstream hdr.cmd = CMD_EXIT_STATUS

[proxy:0 at cesw-amp-gbt-2s-m12830-01] Sending upstream hdr.cmd = CMD_STDERR
Abort(680650255) on node 0: Fatal error in internal_Init: Other MPI error, error stack:
internal_Init(70)................: MPI_Init(argc=(nil), argv=(nil)) failed
MPII_Init_thread(268)............:
MPIR_init_comm_world(34).........:
MPIR_Comm_commit(823)............:
MPID_Comm_commit_post_hook(222)..:
MPIDI_world_post_init(665).......:
MPIDI_OFI_init_vcis(851).........:
check_num_nics(900)..............:
MPIR_Allreduce_allcomm_auto(4726):
MPIC_Sendrecv(301)...............:
MPID_Isend(63)...................:
MPIDI_isend(35)..................:
(unknown)(): Other MPI error
[proxy:0 at cesw-amp-gbt-2s-m12830-01] Sending upstream hdr.cmd = CMD_EXIT_STATUS



From: Niyaz Murshed via discuss <discuss at mpich.org>
Date: Saturday, June 15, 2024 at 10:53 AM
To: discuss at mpich.org <discuss at mpich.org>
Cc: Niyaz Murshed <Niyaz.Murshed at arm.com>, nd <nd at arm.com>
Subject: Re: [mpich-discuss] mpi hello-world error
Also seeing this error sometimes. root@ dpr740: /mpich/examples# export FI_PROVIDER=tcp root@ dpr740: /mpich/examples# mpirun -verbose -n 2 -hosts 10. 118. 91. 158,10. 118. 91. 159 ./a. out host: 10. 118. 91. 158 host: 10. 118. 91. 159 [mpiexec@ dpr740] Timeout
ZjQcmQRYFpfptBannerStart
This Message Is From an External Sender
This message came from outside your organization.

ZjQcmQRYFpfptBannerEnd
Also seeing this error sometimes.


root at dpr740:/mpich/examples# export FI_PROVIDER=tcp
root at dpr740:/mpich/examples# mpirun  -verbose -n 2 -hosts 10.118.91.158,10.118.91.159 ./a.out
host: 10.118.91.158
host: 10.118.91.159
[mpiexec at dpr740] Timeout set to -1 (-1 means infinite)

==================================================================================================
mpiexec options:
----------------
  Base path: /opt/mpich/bin/
  Launcher: (null)
  Debug level: 1
  Enable X: -1

  Global environment:
  -------------------
    PKG_CONFIG_PATH=:/opt/libfabric/lib/pkgconfig:/opt/mpich/lib/pkgconfig
    HOSTNAME=dpr740
    HYDRA_LAUNCHER_EXTRA_ARGS=-p 2233
    PWD=/mpich/examples
    HOME=/root
    FI_PROVIDER=tcp
    LS_COLORS=rs=0:di=01;34:ln=01;36:mh=00:pi=40;33:so=01;35:do=01;35:bd=40;33;01:cd=40;33;01:or=40;31;01:mi=00:su=37;41:sg=30;43:ca=30;41:tw=30;42:ow=34;42:st=37;44:ex=01;32:*.tar=01;31:*.tgz=01;31:*.arc=01;31:*.arj=01;31:*.taz=01;31:*.lha=01;31:*.lz4=01;31:*.lzh=01;31:*.lzma=01;31:*.tlz=01;31:*.txz=01;31:*.tzo=01;31:*.t7z=01;31:*.zip=01;31:*.z=01;31:*.dz=01;31:*.gz=01;31:*.lrz=01;31:*.lz=01;31:*.lzo=01;31:*.xz=01;31:*.zst=01;31:*.tzst=01;31:*.bz2=01;31:*.bz=01;31:*.tbz=01;31:*.tbz2=01;31:*.tz=01;31:*.deb=01;31:*.rpm=01;31:*.jar=01;31:*.war=01;31:*.ear=01;31:*.sar=01;31:*.rar=01;31:*.alz=01;31:*.ace=01;31:*.zoo=01;31:*.cpio=01;31:*.7z=01;31:*.rz=01;31:*.cab=01;31:*.wim=01;31:*.swm=01;31:*.dwm=01;31:*.esd=01;31:*.jpg=01;35:*.jpeg=01;35:*.mjpg=01;35:*.mjpeg=01;35:*.gif=01;35:*.bmp=01;35:*.pbm=01;35:*.pgm=01;35:*.ppm=01;35:*.tga=01;35:*.xbm=01;35:*.xpm=01;35:*.tif=01;35:*.tiff=01;35:*.png=01;35:*.svg=01;35:*.svgz=01;35:*.mng=01;35:*.pcx=01;35:*.mov=01;35:*.mpg=01;35:*.mpeg=01;35:*.m2v=01;35:*.mkv=01;35:*.webm=01;35:*.webp=01;35:*.ogm=01;35:*.mp4=01;35:*.m4v=01;35:*.mp4v=01;35:*.vob=01;35:*.qt=01;35:*.nuv=01;35:*.wmv=01;35:*.asf=01;35:*.rm=01;35:*.rmvb=01;35:*.flc=01;35:*.avi=01;35:*.fli=01;35:*.flv=01;35:*.gl=01;35:*.dl=01;35:*.xcf=01;35:*.xwd=01;35:*.yuv=01;35:*.cgm=01;35:*.emf=01;35:*.ogv=01;35:*.ogx=01;35:*.aac=00;36:*.au=00;36:*.flac=00;36:*.m4a=00;36:*.mid=00;36:*.midi=00;36:*.mka=00;36:*.mp3=00;36:*.mpc=00;36:*.ogg=00;36:*.ra=00;36:*.wav=00;36:*.oga=00;36:*.opus=00;36:*.spx=00;36:*.xspf=00;36:
    LESSCLOSE=/usr/bin/lesspipe %s %s
    TERM=xterm
    LESSOPEN=| /usr/bin/lesspipe %s
    SHLVL=1
    LD_LIBRARY_PATH=:/opt/libfabric/lib:/opt/fabtests/lib:/opt/mpich/lib
    PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/opt/libfabric/bin/:/opt/fabtest/bin:/opt/mpich/bin
    OLDPWD=/
    _=/opt/mpich/bin/mpirun

  Hydra internal environment:
  ---------------------------
    GFORTRAN_UNBUFFERED_PRECONNECTED=y


    Proxy information:
    *********************
      [1] proxy: 10.118.91.158 (1 cores)
      Exec list: ./a.out (1 processes);

      [2] proxy: 10.118.91.159 (1 cores)
      Exec list: ./a.out (1 processes);


==================================================================================================


Proxy launch args: /opt/mpich/bin/hydra_pmi_proxy --control-port 10.118.91.159:35625 --debug --rmk user --launcher ssh --demux poll --pgid 0 --retries 10 --usize -2 --pmi-port 0 --gpus-per-proc -2 --gpu-subdevs-per-proc -2 --proxy-id

Arguments being passed to proxy 0:
--version 4.3.0a1 --iface-ip-env-name MPIR_CVAR_CH3_INTERFACE_HOSTNAME --hostname 10.118.91.158 --global-core-map 0,1,2 --pmi-id-map 0,0 --global-process-count 2 --auto-cleanup 1 --pmi-kvsname kvs_1151_0_1450155337_dpr740 --pmi-process-mapping (vector,(0,2,1)) --global-inherited-env 15 'PKG_CONFIG_PATH=:/opt/libfabric/lib/pkgconfig:/opt/mpich/lib/pkgconfig' 'HOSTNAME=dpr740' 'HYDRA_LAUNCHER_EXTRA_ARGS=-p 2233' 'PWD=/mpich/examples' 'HOME=/root' 'FI_PROVIDER=tcp' 'LS_COLORS=rs=0:di=01;34:ln=01;36:mh=00:pi=40;33:so=01;35:do=01;35:bd=40;33;01:cd=40;33;01:or=40;31;01:mi=00:su=37;41:sg=30;43:ca=30;41:tw=30;42:ow=34;42:st=37;44:ex=01;32:*.tar=01;31:*.tgz=01;31:*.arc=01;31:*.arj=01;31:*.taz=01;31:*.lha=01;31:*.lz4=01;31:*.lzh=01;31:*.lzma=01;31:*.tlz=01;31:*.txz=01;31:*.tzo=01;31:*.t7z=01;31:*.zip=01;31:*.z=01;31:*.dz=01;31:*.gz=01;31:*.lrz=01;31:*.lz=01;31:*.lzo=01;31:*.xz=01;31:*.zst=01;31:*.tzst=01;31:*.bz2=01;31:*.bz=01;31:*.tbz=01;31:*.tbz2=01;31:*.tz=01;31:*.deb=01;31:*.rpm=01;31:*.jar=01;31:*.war=01;31:*.ear=01;31:*.sar=01;31:*.rar=01;31:*.alz=01;31:*.ace=01;31:*.zoo=01;31:*.cpio=01;31:*.7z=01;31:*.rz=01;31:*.cab=01;31:*.wim=01;31:*.swm=01;31:*.dwm=01;31:*.esd=01;31:*.jpg=01;35:*.jpeg=01;35:*.mjpg=01;35:*.mjpeg=01;35:*.gif=01;35:*.bmp=01;35:*.pbm=01;35:*.pgm=01;35:*.ppm=01;35:*.tga=01;35:*.xbm=01;35:*.xpm=01;35:*.tif=01;35:*.tiff=01;35:*.png=01;35:*.svg=01;35:*.svgz=01;35:*.mng=01;35:*.pcx=01;35:*.mov=01;35:*.mpg=01;35:*.mpeg=01;35:*.m2v=01;35:*.mkv=01;35:*.webm=01;35:*.webp=01;35:*.ogm=01;35:*.mp4=01;35:*.m4v=01;35:*.mp4v=01;35:*.vob=01;35:*.qt=01;35:*.nuv=01;35:*.wmv=01;35:*.asf=01;35:*.rm=01;35:*.rmvb=01;35:*.flc=01;35:*.avi=01;35:*.fli=01;35:*.flv=01;35:*.gl=01;35:*.dl=01;35:*.xcf=01;35:*.xwd=01;35:*.yuv=01;35:*.cgm=01;35:*.emf=01;35:*.ogv=01;35:*.ogx=01;35:*.aac=00;36:*.au=00;36:*.flac=00;36:*.m4a=00;36:*.mid=00;36:*.midi=00;36:*.mka=00;36:*.mp3=00;36:*.mpc=00;36:*.ogg=00;36:*.ra=00;36:*.wav=00;36:*.oga=00;36:*.opus=00;36:*.spx=00;36:*.xspf=00;36:' 'LESSCLOSE=/usr/bin/lesspipe %s %s' 'TERM=xterm' 'LESSOPEN=| /usr/bin/lesspipe %s' 'SHLVL=1' 'LD_LIBRARY_PATH=:/opt/libfabric/lib:/opt/fabtests/lib:/opt/mpich/lib' 'PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/opt/libfabric/bin/:/opt/fabtest/bin:/opt/mpich/bin' 'OLDPWD=/' '_=/opt/mpich/bin/mpirun' --global-user-env 0 --global-system-env 1 'GFORTRAN_UNBUFFERED_PRECONNECTED=y' --proxy-core-count 1 --exec --exec-appnum 0 --exec-proc-count 1 --exec-local-env 0 --exec-wdir /mpich/examples --exec-args 1 ./a.out

Arguments being passed to proxy 1:
--version 4.3.0a1 --iface-ip-env-name MPIR_CVAR_CH3_INTERFACE_HOSTNAME --hostname 10.118.91.159 --global-core-map 0,1,2 --pmi-id-map 0,1 --global-process-count 2 --auto-cleanup 1 --pmi-kvsname kvs_1151_0_1450155337_dpr740 --pmi-process-mapping (vector,(0,2,1)) --global-inherited-env 15 'PKG_CONFIG_PATH=:/opt/libfabric/lib/pkgconfig:/opt/mpich/lib/pkgconfig' 'HOSTNAME=dpr740' 'HYDRA_LAUNCHER_EXTRA_ARGS=-p 2233' 'PWD=/mpich/examples' 'HOME=/root' 'FI_PROVIDER=tcp' 'LS_COLORS=rs=0:di=01;34:ln=01;36:mh=00:pi=40;33:so=01;35:do=01;35:bd=40;33;01:cd=40;33;01:or=40;31;01:mi=00:su=37;41:sg=30;43:ca=30;41:tw=30;42:ow=34;42:st=37;44:ex=01;32:*.tar=01;31:*.tgz=01;31:*.arc=01;31:*.arj=01;31:*.taz=01;31:*.lha=01;31:*.lz4=01;31:*.lzh=01;31:*.lzma=01;31:*.tlz=01;31:*.txz=01;31:*.tzo=01;31:*.t7z=01;31:*.zip=01;31:*.z=01;31:*.dz=01;31:*.gz=01;31:*.lrz=01;31:*.lz=01;31:*.lzo=01;31:*.xz=01;31:*.zst=01;31:*.tzst=01;31:*.bz2=01;31:*.bz=01;31:*.tbz=01;31:*.tbz2=01;31:*.tz=01;31:*.deb=01;31:*.rpm=01;31:*.jar=01;31:*.war=01;31:*.ear=01;31:*.sar=01;31:*.rar=01;31:*.alz=01;31:*.ace=01;31:*.zoo=01;31:*.cpio=01;31:*.7z=01;31:*.rz=01;31:*.cab=01;31:*.wim=01;31:*.swm=01;31:*.dwm=01;31:*.esd=01;31:*.jpg=01;35:*.jpeg=01;35:*.mjpg=01;35:*.mjpeg=01;35:*.gif=01;35:*.bmp=01;35:*.pbm=01;35:*.pgm=01;35:*.ppm=01;35:*.tga=01;35:*.xbm=01;35:*.xpm=01;35:*.tif=01;35:*.tiff=01;35:*.png=01;35:*.svg=01;35:*.svgz=01;35:*.mng=01;35:*.pcx=01;35:*.mov=01;35:*.mpg=01;35:*.mpeg=01;35:*.m2v=01;35:*.mkv=01;35:*.webm=01;35:*.webp=01;35:*.ogm=01;35:*.mp4=01;35:*.m4v=01;35:*.mp4v=01;35:*.vob=01;35:*.qt=01;35:*.nuv=01;35:*.wmv=01;35:*.asf=01;35:*.rm=01;35:*.rmvb=01;35:*.flc=01;35:*.avi=01;35:*.fli=01;35:*.flv=01;35:*.gl=01;35:*.dl=01;35:*.xcf=01;35:*.xwd=01;35:*.yuv=01;35:*.cgm=01;35:*.emf=01;35:*.ogv=01;35:*.ogx=01;35:*.aac=00;36:*.au=00;36:*.flac=00;36:*.m4a=00;36:*.mid=00;36:*.midi=00;36:*.mka=00;36:*.mp3=00;36:*.mpc=00;36:*.ogg=00;36:*.ra=00;36:*.wav=00;36:*.oga=00;36:*.opus=00;36:*.spx=00;36:*.xspf=00;36:' 'LESSCLOSE=/usr/bin/lesspipe %s %s' 'TERM=xterm' 'LESSOPEN=| /usr/bin/lesspipe %s' 'SHLVL=1' 'LD_LIBRARY_PATH=:/opt/libfabric/lib:/opt/fabtests/lib:/opt/mpich/lib' 'PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/opt/libfabric/bin/:/opt/fabtest/bin:/opt/mpich/bin' 'OLDPWD=/' '_=/opt/mpich/bin/mpirun' --global-user-env 0 --global-system-env 1 'GFORTRAN_UNBUFFERED_PRECONNECTED=y' --proxy-core-count 1 --exec --exec-appnum 0 --exec-proc-count 1 --exec-local-env 0 --exec-wdir /mpich/examples --exec-args 1 ./a.out

[mpiexec at dpr740] Launch arguments: /usr/bin/ssh -x -p 2233 10.118.91.158 "/opt/mpich/bin/hydra_pmi_proxy" --control-port 10.118.91.159:35625 --debug --rmk user --launcher ssh --demux poll --pgid 0 --retries 10 --usize -2 --pmi-port 0 --gpus-per-proc -2 --gpu-subdevs-per-proc -2 --proxy-id 0
[mpiexec at dpr740] Launch arguments: /opt/mpich/bin/hydra_pmi_proxy --control-port 10.118.91.159:35625 --debug --rmk user --launcher ssh --demux poll --pgid 0 --retries 10 --usize -2 --pmi-port 0 --gpus-per-proc -2 --gpu-subdevs-per-proc -2 --proxy-id 1
[proxy:1 at dpr740] Sending upstream hdr.cmd = CMD_PID_LIST
[proxy:1 at dpr740] got pmi command from downstream 0-0:
    cmd=init pmi_version=1 pmi_subversion=1
[proxy:1 at dpr740] Sending PMI command:
    cmd=response_to_init rc=0 pmi_version=1 pmi_subversion=1
[proxy:1 at dpr740] got pmi command from downstream 0-0:
    cmd=get_maxes
[proxy:1 at dpr740] Sending PMI command:
    cmd=maxes rc=0 kvsname_max=256 keylen_max=64 vallen_max=1024
[proxy:1 at dpr740] got pmi command from downstream 0-0:
    cmd=get_appnum
[proxy:1 at dpr740] Sending PMI command:
    cmd=appnum rc=0 appnum=0
[proxy:1 at dpr740] got pmi command from downstream 0-0:
    cmd=get_my_kvsname
[proxy:1 at dpr740] Sending PMI command:
    cmd=my_kvsname rc=0 kvsname=kvs_1151_0_1450155337_dpr740
[proxy:1 at dpr740] got pmi command from downstream 0-0:
    cmd=get kvsname=kvs_1151_0_1450155337_dpr740 key=PMI_process_mapping
[proxy:1 at dpr740] Sending PMI command:
    cmd=get_result rc=0 value=(vector,(0,2,1)) found=TRUE
[proxy:1 at dpr740] got pmi command from downstream 0-0:
    cmd=get kvsname=kvs_1151_0_1450155337_dpr740 key=PMI_hwloc_xmlfile
[proxy:1 at dpr740] Sending PMI command:
    cmd=get_result rc=0 value=/tmp/hydra_hwloc_xmlfile_YPoAhr found=TRUE
[proxy:1 at dpr740] got pmi command from downstream 0-0:
    cmd=get kvsname=kvs_1151_0_1450155337_dpr740 key=PMI_mpi_memory_alloc_kinds
[proxy:1 at dpr740] Sending upstream internal PMI command:
    cmd=get kvsname=kvs_1151_0_1450155337_dpr740 key=PMI_mpi_memory_alloc_kinds
[proxy:1 at dpr740] Sending upstream hdr.cmd = CMD_PMI
[mpiexec at dpr740] [pgid: 0] got PMI command: cmd=get kvsname=kvs_1151_0_1450155337_dpr740 key=PMI_mpi_memory_alloc_kinds

[mpiexec at dpr740] Sending internal PMI command (proxy:0:1):
    cmd=get_result rc=1
[proxy:1 at dpr740] we don't understand the response get_result; forwarding downstream
[proxy:1 at dpr740] got pmi command from downstream 0-0:
    cmd=barrier_in
[proxy:1 at dpr740] Sending upstream internal PMI command:
    cmd=barrier_in
[proxy:1 at dpr740] Sending upstream hdr.cmd = CMD_PMI
[mpiexec at dpr740] [pgid: 0] got PMI command: cmd=barrier_in

[proxy:0 at ampere-altra-2-1] Sending upstream hdr.cmd = CMD_PID_LIST
[proxy:0 at ampere-altra-2-1] got pmi command from downstream 0-0:
    cmd=init pmi_version=1 pmi_subversion=1
[proxy:0 at ampere-altra-2-1] Sending PMI command:
    cmd=response_to_init rc=0 pmi_version=1 pmi_subversion=1
[proxy:0 at ampere-altra-2-1] got pmi command from downstream 0-0:
    cmd=get_maxes
[proxy:0 at ampere-altra-2-1] Sending PMI command:
    cmd=maxes rc=0 kvsname_max=256 keylen_max=64 vallen_max=1024
[proxy:0 at ampere-altra-2-1] got pmi command from downstream 0-0:
    cmd=get_appnum
[proxy:0 at ampere-altra-2-1] Sending PMI command:
    cmd=appnum rc=0 appnum=0
[proxy:0 at ampere-altra-2-1] got pmi command from downstream 0-0:
    cmd=get_my_kvsname
[proxy:0 at ampere-altra-2-1] Sending PMI command:
    cmd=my_kvsname rc=0 kvsname=kvs_1151_0_1450155337_dpr740
[proxy:0 at ampere-altra-2-1] got pmi command from downstream 0-0:
    cmd=get kvsname=kvs_1151_0_1450155337_dpr740 key=PMI_process_mapping
[proxy:0 at ampere-altra-2-1] Sending PMI command:
    cmd=get_result rc=0 value=(vector,(0,2,1)) found=TRUE
[proxy:0 at ampere-altra-2-1] got pmi command from downstream 0-0:
    cmd=get kvsname=kvs_1151_0_1450155337_dpr740 key=PMI_hwloc_xmlfile
[proxy:0 at ampere-altra-2-1] Sending PMI command:
    cmd=get_result rc=0 value=/tmp/hydra_hwloc_xmlfile_68iqm3 found=TRUE
[mpiexec at dpr740] [pgid: 0] got PMI command: cmd=get kvsname=kvs_1151_0_1450155337_dpr740 key=PMI_mpi_memory_alloc_kinds

[mpiexec at dpr740] Sending internal PMI command (proxy:0:0):
    cmd=get_result rc=1
[proxy:0 at ampere-altra-2-1] got pmi command from downstream 0-0:
    cmd=get kvsname=kvs_1151_0_1450155337_dpr740 key=PMI_mpi_memory_alloc_kinds
[proxy:0 at ampere-altra-2-1] Sending upstream internal PMI command:
    cmd=get kvsname=kvs_1151_0_1450155337_dpr740 key=PMI_mpi_memory_alloc_kinds
[proxy:0 at ampere-altra-2-1] Sending upstream hdr.cmd = CMD_PMI
[proxy:0 at ampere-altra-2-1] we don't understand the response get_result; forwarding downstream
[mpiexec at dpr740] [pgid: 0] got PMI command: cmd=barrier_in

[mpiexec at dpr740] Sending internal PMI command (proxy:0:0):
    cmd=barrier_out
[mpiexec at dpr740] Sending internal PMI command (proxy:0:1):
    cmd=barrier_out
[proxy:0 at ampere-altra-2-1] got pmi command from downstream 0-0:
    cmd=barrier_in
[proxy:0 at ampere-altra-2-1] Sending upstream internal PMI command:
    cmd=barrier_in
[proxy:0 at ampere-altra-2-1] [proxy:1 at dpr740] Sending PMI command:
    cmd=barrier_out
Sending upstream hdr.cmd = CMD_PMI
[proxy:0 at ampere-altra-2-1] Sending PMI command:
    cmd=barrier_out
[proxy:0 at ampere-altra-2-1] got pmi command from downstream 0-0:
    cmd=put kvsname=kvs_1151_0_1450155337_dpr740 key=-allgather-shm-1-0 value=0200937DC0A80101[8]
[proxy:0 at ampere-altra-2-1] cached command: -allgather-shm-1-0=0200937DC0A80101[8]
[proxy:0 at ampere-altra-2-1] Sending PMI command:
    cmd=put_result rc=0
[mpiexec at dpr740] [pgid: 0] got PMI command: cmd=mput -allgather-shm-1-0=0200937DC0A80101[8]

[mpiexec at dpr740] [pgid: 0] got PMI command: cmd=barrier_in

[proxy:0 at ampere-altra-2-1] got pmi command from downstream 0-0:
    cmd=barrier_in
[proxy:0 at ampere-altra-2-1] flushing 1 put command(s) out
[proxy:0 at ampere-altra-2-1] forwarding command upstream:
cmd=mput -allgather-shm-1-0=0200937DC0A80101[8]
[proxy:0 at ampere-altra-2-1] Sending upstream internal PMI command:
    cmd=mput -allgather-shm-1-0=0200937DC0A80101[8]
[proxy:0 at ampere-altra-2-1] Sending upstream hdr.cmd = CMD_PMI
[proxy:0 at ampere-altra-2-1] Sending upstream internal PMI command:
    cmd=barrier_in
[proxy:0 at ampere-altra-2-1] Sending upstream hdr.cmd = CMD_PMI
[proxy:1 at dpr740] got pmi command from downstream 0-0:
    cmd=put kvsname=kvs_1151_0_1450155337_dpr740 key=-allgather-shm-1-1 value=0A00B381[4]FE80[6]526B4BFFFEFC134208[3]
[proxy:1 at dpr740] cached command: -allgather-shm-1-1=0A00B381[4]FE80[6]526B4BFFFEFC134208[3]
[proxy:1 at dpr740] Sending PMI command:
    cmd=put_result rc=0
[proxy:1 at dpr740] got pmi command from downstream 0-0:
    cmd=barrier_in
[proxy:1 at dpr740] flushing 1 put command(s) out
[proxy:1 at dpr740] forwarding command upstream:
cmd=mput -allgather-shm-1-1=0A00B381[4]FE80[6]526B4BFFFEFC134208[3]
[proxy:1 at dpr740] Sending upstream internal PMI command:
    cmd=mput -allgather-shm-1-1=0A00B381[4]FE80[6]526B4BFFFEFC134208[3]
[proxy:1 at dpr740] Sending upstream hdr.cmd = CMD_PMI
[proxy:1 at dpr740] Sending upstream internal PMI command:
    cmd=barrier_in
[mpiexec at dpr740] [pgid: 0] got PMI command: cmd=mput -allgather-shm-1-1=0A00B381[4]FE80[6]526B4BFFFEFC134208[3]

[proxy:1 at dpr740] Sending upstream hdr.cmd = CMD_PMI
[mpiexec at dpr740] [pgid: 0] got PMI command: cmd=barrier_in

[mpiexec at dpr740] Sending internal PMI command (proxy:0:0):
    cmd=keyval_cache -allgather-shm-1-0=0200937DC0A80101[8] -allgather-shm-1-1=0A00B381[4]FE80[6]526B4BFFFEFC134208[3]
[mpiexec at dpr740] Sending internal PMI command (proxy:0:1):
    cmd=keyval_cache -allgather-shm-1-0=0200937DC0A80101[8] -allgather-shm-1-1=0A00B381[4]FE80[6]526B4BFFFEFC134208[3]
[mpiexec at dpr740] Sending internal PMI command (proxy:0:0):
    cmd=barrier_out
[mpiexec at dpr740] Sending internal PMI command (proxy:0:1):
    cmd=barrier_out
[proxy:1 at dpr740] Sending PMI command:
    cmd=barrier_out
[proxy:1 at dpr740] got pmi command from downstream 0-0:
    cmd=get kvsname=kvs_1151_0_1450155337_dpr740 key=-allgather-shm-1-0
[proxy:1 at dpr740] Sending PMI command:
    cmd=get_result rc=0 value=0200937DC0A80101[8] found=TRUE
[proxy:1 at dpr740] got pmi command from downstream 0-0:
    cmd=get kvsname=kvs_1151_0_1450155337_dpr740 key=-allgather-shm-1-1
[proxy:1 at dpr740] Sending PMI command:
    cmd=get_result rc=0 value=0A00B381[4]FE80[6]526B4BFFFEFC134208[3] found=TRUE
[proxy:0 at ampere-altra-2-1] Sending PMI command:
    cmd=barrier_out
[proxy:1 at dpr740] Sending upstream hdr.cmd = CMD_STDERR
[proxy:0 at ampere-altra-2-1] got pmi command from downstream 0-0:
    cmd=get kvsname=kvs_1151_0_1450155337_dpr740 key=-allgather-shm-1-0
Assertion failed in file src/mpid/ch4/netmod/ofi/init_addrxchg.c at line 151: mapped_table[i] != FI_ADDR_NOTAVAIL
[proxy:0 at ampere-altra-2-1] Sending PMI command:
    cmd=get_result rc=0 value=0200937DC0A80101[8] found=TRUE
[proxy:0 at ampere-altra-2-1] got pmi command from downstream 0-0:
    cmd=get kvsname=kvs_1151_0_1450155337_dpr740 key=-allgather-shm-1-1
[proxy:0 at ampere-altra-2-1] Sending PMI command:
    cmd=get_result rc=0 value=0A00B381[4]FE80[6]526B4BFFFEFC134208[3] found=TRUE
Assertion failed in file src/mpid/ch4/netmod/ofi/init_addrxchg.c at line 151: mapped_table[i] != FI_ADDR_NOTAVAIL
[proxy:0 at ampere-altra-2-1] Sending upstream hdr.cmd = CMD_STDERR
[proxy:1 at dpr740] Sending upstream hdr.cmd = CMD_STDERR
/opt/mpich/lib/libmpi.so.0(+0x58c005) [0x7f967920c005]
/opt/mpich/lib/libmpi.so.0(+0x491858) [0x7f9679111858]
/opt/mpich/lib/libmpi.so.0(+0x55428c) [0x7f96791d428c]
/opt/mpich/lib/libmpi.so.0(+0x53402d) [0x7f96791b402d]
/opt/mpich/lib/libmpi.so.0(+0x4dc71f) [0x7f967915c71f]
/opt/mpich/lib/libmpi.so.0(+0x4df09a) [0x7f967915f09a]
/opt/mpich/lib/libmpi.so.0(+0x3deab6) [0x7f967905eab6]
/opt/mpich/lib/libmpi.so.0(+0x3e0732) [0x7f9679060732]
/opt/mpich/lib/libmpi.so.0(+0x3dd075) [0x7f967905d075]
/opt/mpich/lib/libmpi.so.0(+0x418215) [0x7f9679098215]
[proxy:1 at dpr740] Sending upstream hdr.cmd = CMD_STDERR
/opt/mpich/lib/libmpi.so.0(+0x4188fa) [0x7f96790988fa]
/opt/mpich/lib/libmpi.so.0(MPI_Init+0x34) [0x7f9678d57594]
./a.out(+0x121a) [0x55b07f1cc21a]
/lib/x86_64-linux-gnu/libc.so.6(+0x29d90) [0x7f9678a7cd90]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80) [0x7f9678a7ce40]
./a.out(+0x1125) [0x55b07f1cc125]
Abort(1) on node 1: Internal error
/opt/mpich/lib/libmpi.so.0(+0x59a0fc) [0xffff91d0a0fc]
/opt/mpich/lib/libmpi.so.0(+0x4a6b58) [0xffff91c16b58]
/opt/mpich/lib/libmpi.so.0(+0x564740) [0xffff91cd4740]
/opt/mpich/lib/libmpi.so.0(+0x546c14) [0xffff91cb6c14]
/opt/mpich/lib/libmpi.so.0(+0x4f70cc) [0xffff91c670cc]
/opt/mpich/lib/libmpi.so.0(+0x4f9850) [0xffff91c69850]
/opt/mpich/lib/libmpi.so.0(+0x3ffd2c) [0xffff91b6fd2c]
/opt/mpich/lib/libmpi.so.0(+0x4017ec) [0xffff91b717ec]
/opt/mpich/lib/libmpi.so.0(+0x3fe384) [0xffff91b6e384]
/opt/mpich/lib/libmpi.so.0(+0x436a64) [0xffff91ba6a64]
[proxy:0 at ampere-altra-2-1] Sending upstream hdr.cmd = CMD_STDERR
[proxy:0 at ampere-altra-2-1] Sending upstream hdr.cmd = CMD_STDERR
/opt/mpich/lib/libmpi.so.0(+0x43700c) [0xffff91ba700c]
/opt/mpich/lib/libmpi.so.0(MPI_Init+0x44) [0xffff9189eeb4]
./a.out(+0x9c4) [0xaaaab5c709c4]
/lib/aarch64-linux-gnu/libc.so.6(+0x273fc) [0xffff915e73fc]
/lib/aarch64-linux-gnu/libc.so.6(__libc_start_main+0x98) [0xffff915e74cc]
./a.out(+0x8b0) [0xaaaab5c708b0]
Abort(1) on node 0: Internal error
[proxy:1 at dpr740] Sending upstream hdr.cmd = CMD_EXIT_STATUS
[proxy:0 at ampere-altra-2-1] Sending upstream hdr.cmd = CMD_EXIT_STATUS


From: Niyaz Murshed via discuss <discuss at mpich.org>
Date: Saturday, June 15, 2024 at 12:10 AM
To: discuss at mpich.org <discuss at mpich.org>
Cc: Niyaz Murshed <Niyaz.Murshed at arm.com>, nd <nd at arm.com>
Subject: [mpich-discuss] mpi hello-world error
Hello, I am trying to run the example hellow. c between 2 servers. I can run them individually and it works fine. 10. 118. 91. 158 is the machine I am running on. 10. 118. 91. 159 is the remote machine. root@ dpr740: /mpich/examples# mpirun -n 2 -hosts
ZjQcmQRYFpfptBannerStart
This Message Is From an External Sender
This message came from outside your organization.

ZjQcmQRYFpfptBannerEnd
Hello,

I am trying to run the example hellow.c between 2 servers.
I can run them individually and it works fine.

10.118.91.158  is the machine I am running on.
10.118.91.159 is the remote machine.


root at dpr740:/mpich/examples# mpirun  -n 2 -hosts 10.118.91.158  ./a.out

Hello world from process 0 of 2

Hello world from process 1 of 2



root at dpr740:/mpich/examples# mpirun  -n 2 -hosts 10.118.91.159  ./a.out

Hello world from process 1 of 2

Hello world from process 0 of 2



However, when I try to run them on both, I get the below error.

realloc(): invalid pointer



Is this a known issue ? Any suggestions?





root at dpr740:/mpich/examples# mpirun -verbose  -n 2 -hosts 10.118.91.159,10.118.91.158  ./a.out

host: 10.118.91.159

host: 10.118.91.158

[mpiexec at dpr740] Timeout set to -1 (-1 means infinite)



==================================================================================================

mpiexec options:

----------------

  Base path: /opt/mpich/bin/

  Launcher: (null)

  Debug level: 1

  Enable X: -1



  Global environment:

  -------------------

    PKG_CONFIG_PATH=:/opt/libfabric/lib/pkgconfig:/opt/mpich/lib/pkgconfig

    HOSTNAME=dpr740

    HYDRA_LAUNCHER_EXTRA_ARGS=-p 2233

    PWD=/mpich/examples

    HOME=/root

    LS_COLORS=rs=0:di=01;34:ln=01;36:mh=00:pi=40;33:so=01;35:do=01;35:bd=40;33;01:cd=40;33;01:or=40;31;01:mi=00:su=37;41:sg=30;43:ca=30;41:tw=30;42:ow=34;42:st=37;44:ex=01;32:*.tar=01;31:*.tgz=01;31:*.arc=01;31:*.arj=01;31:*.taz=01;31:*.lha=01;31:*.lz4=01;31:*.lzh=01;31:*.lzma=01;31:*.tlz=01;31:*.txz=01;31:*.tzo=01;31:*.t7z=01;31:*.zip=01;31:*.z=01;31:*.dz=01;31:*.gz=01;31:*.lrz=01;31:*.lz=01;31:*.lzo=01;31:*.xz=01;31:*.zst=01;31:*.tzst=01;31:*.bz2=01;31:*.bz=01;31:*.tbz=01;31:*.tbz2=01;31:*.tz=01;31:*.deb=01;31:*.rpm=01;31:*.jar=01;31:*.war=01;31:*.ear=01;31:*.sar=01;31:*.rar=01;31:*.alz=01;31:*.ace=01;31:*.zoo=01;31:*.cpio=01;31:*.7z=01;31:*.rz=01;31:*.cab=01;31:*.wim=01;31:*.swm=01;31:*.dwm=01;31:*.esd=01;31:*.jpg=01;35:*.jpeg=01;35:*.mjpg=01;35:*.mjpeg=01;35:*.gif=01;35:*.bmp=01;35:*.pbm=01;35:*.pgm=01;35:*.ppm=01;35:*.tga=01;35:*.xbm=01;35:*.xpm=01;35:*.tif=01;35:*.tiff=01;35:*.png=01;35:*.svg=01;35:*.svgz=01;35:*.mng=01;35:*.pcx=01;35:*.mov=01;35:*.mpg=01;35:*.mpeg=01;35:*.m2v=01;35:*.mkv=01;35:*.webm=01;35:*.webp=01;35:*.ogm=01;35:*.mp4=01;35:*.m4v=01;35:*.mp4v=01;35:*.vob=01;35:*.qt=01;35:*.nuv=01;35:*.wmv=01;35:*.asf=01;35:*.rm=01;35:*.rmvb=01;35:*.flc=01;35:*.avi=01;35:*.fli=01;35:*.flv=01;35:*.gl=01;35:*.dl=01;35:*.xcf=01;35:*.xwd=01;35:*.yuv=01;35:*.cgm=01;35:*.emf=01;35:*.ogv=01;35:*.ogx=01;35:*.aac=00;36:*.au=00;36:*.flac=00;36:*.m4a=00;36:*.mid=00;36:*.midi=00;36:*.mka=00;36:*.mp3=00;36:*.mpc=00;36:*.ogg=00;36:*.ra=00;36:*.wav=00;36:*.oga=00;36:*.opus=00;36:*.spx=00;36:*.xspf=00;36:

    LESSCLOSE=/usr/bin/lesspipe %s %s

    TERM=xterm

    LESSOPEN=| /usr/bin/lesspipe %s

    SHLVL=1

    LD_LIBRARY_PATH=:/opt/libfabric/lib:/opt/fabtests/lib:/opt/mpich/lib

    PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/opt/libfabric/bin/:/opt/fabtest/bin:/opt/mpich/bin

    _=/opt/mpich/bin/mpirun

    OLDPWD=/



  Hydra internal environment:

  ---------------------------

    GFORTRAN_UNBUFFERED_PRECONNECTED=y





    Proxy information:

    *********************

      [1] proxy: 10.118.91.159 (1 cores)

      Exec list: ./a.out (1 processes);



      [2] proxy: 10.118.91.158 (1 cores)

      Exec list: ./a.out (1 processes);





==================================================================================================





Proxy launch args: /opt/mpich/bin/hydra_pmi_proxy --control-port 10.118.91.159:33909 --debug --rmk user --launcher ssh --demux poll --pgid 0 --retries 10 --usize -2 --pmi-port 0 --gpus-per-proc -2 --gpu-subdevs-per-proc -2 --proxy-id



Arguments being passed to proxy 0:

--version 4.3.0a1 --iface-ip-env-name MPIR_CVAR_CH3_INTERFACE_HOSTNAME --hostname 10.118.91.159 --global-core-map 0,1,2 --pmi-id-map 0,0 --global-process-count 2 --auto-cleanup 1 --pmi-kvsname kvs_844_0_801938186_dpr740 --pmi-process-mapping (vector,(0,2,1)) --global-inherited-env 14 'PKG_CONFIG_PATH=:/opt/libfabric/lib/pkgconfig:/opt/mpich/lib/pkgconfig' 'HOSTNAME=dpr740' 'HYDRA_LAUNCHER_EXTRA_ARGS=-p 2233' 'PWD=/mpich/examples' 'HOME=/root' 'LS_COLORS=rs=0:di=01;34:ln=01;36:mh=00:pi=40;33:so=01;35:do=01;35:bd=40;33;01:cd=40;33;01:or=40;31;01:mi=00:su=37;41:sg=30;43:ca=30;41:tw=30;42:ow=34;42:st=37;44:ex=01;32:*.tar=01;31:*.tgz=01;31:*.arc=01;31:*.arj=01;31:*.taz=01;31:*.lha=01;31:*.lz4=01;31:*.lzh=01;31:*.lzma=01;31:*.tlz=01;31:*.txz=01;31:*.tzo=01;31:*.t7z=01;31:*.zip=01;31:*.z=01;31:*.dz=01;31:*.gz=01;31:*.lrz=01;31:*.lz=01;31:*.lzo=01;31:*.xz=01;31:*.zst=01;31:*.tzst=01;31:*.bz2=01;31:*.bz=01;31:*.tbz=01;31:*.tbz2=01;31:*.tz=01;31:*.deb=01;31:*.rpm=01;31:*.jar=01;31:*.war=01;31:*.ear=01;31:*.sar=01;31:*.rar=01;31:*.alz=01;31:*.ace=01;31:*.zoo=01;31:*.cpio=01;31:*.7z=01;31:*.rz=01;31:*.cab=01;31:*.wim=01;31:*.swm=01;31:*.dwm=01;31:*.esd=01;31:*.jpg=01;35:*.jpeg=01;35:*.mjpg=01;35:*.mjpeg=01;35:*.gif=01;35:*.bmp=01;35:*.pbm=01;35:*.pgm=01;35:*.ppm=01;35:*.tga=01;35:*.xbm=01;35:*.xpm=01;35:*.tif=01;35:*.tiff=01;35:*.png=01;35:*.svg=01;35:*.svgz=01;35:*.mng=01;35:*.pcx=01;35:*.mov=01;35:*.mpg=01;35:*.mpeg=01;35:*.m2v=01;35:*.mkv=01;35:*.webm=01;35:*.webp=01;35:*.ogm=01;35:*.mp4=01;35:*.m4v=01;35:*.mp4v=01;35:*.vob=01;35:*.qt=01;35:*.nuv=01;35:*.wmv=01;35:*.asf=01;35:*.rm=01;35:*.rmvb=01;35:*.flc=01;35:*.avi=01;35:*.fli=01;35:*.flv=01;35:*.gl=01;35:*.dl=01;35:*.xcf=01;35:*.xwd=01;35:*.yuv=01;35:*.cgm=01;35:*.emf=01;35:*.ogv=01;35:*.ogx=01;35:*.aac=00;36:*.au=00;36:*.flac=00;36:*.m4a=00;36:*.mid=00;36:*.midi=00;36:*.mka=00;36:*.mp3=00;36:*.mpc=00;36:*.ogg=00;36:*.ra=00;36:*.wav=00;36:*.oga=00;36:*.opus=00;36:*.spx=00;36:*.xspf=00;36:' 'LESSCLOSE=/usr/bin/lesspipe %s %s' 'TERM=xterm' 'LESSOPEN=| /usr/bin/lesspipe %s' 'SHLVL=1' 'LD_LIBRARY_PATH=:/opt/libfabric/lib:/opt/fabtests/lib:/opt/mpich/lib' 'PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/opt/libfabric/bin/:/opt/fabtest/bin:/opt/mpich/bin' '_=/opt/mpich/bin/mpirun' 'OLDPWD=/' --global-user-env 0 --global-system-env 1 'GFORTRAN_UNBUFFERED_PRECONNECTED=y' --proxy-core-count 1 --exec --exec-appnum 0 --exec-proc-count 1 --exec-local-env 0 --exec-wdir /mpich/examples --exec-args 1 ./a.out



Arguments being passed to proxy 1:

--version 4.3.0a1 --iface-ip-env-name MPIR_CVAR_CH3_INTERFACE_HOSTNAME --hostname 10.118.91.158 --global-core-map 0,1,2 --pmi-id-map 0,1 --global-process-count 2 --auto-cleanup 1 --pmi-kvsname kvs_844_0_801938186_dpr740 --pmi-process-mapping (vector,(0,2,1)) --global-inherited-env 14 'PKG_CONFIG_PATH=:/opt/libfabric/lib/pkgconfig:/opt/mpich/lib/pkgconfig' 'HOSTNAME=dpr740' 'HYDRA_LAUNCHER_EXTRA_ARGS=-p 2233' 'PWD=/mpich/examples' 'HOME=/root' 'LS_COLORS=rs=0:di=01;34:ln=01;36:mh=00:pi=40;33:so=01;35:do=01;35:bd=40;33;01:cd=40;33;01:or=40;31;01:mi=00:su=37;41:sg=30;43:ca=30;41:tw=30;42:ow=34;42:st=37;44:ex=01;32:*.tar=01;31:*.tgz=01;31:*.arc=01;31:*.arj=01;31:*.taz=01;31:*.lha=01;31:*.lz4=01;31:*.lzh=01;31:*.lzma=01;31:*.tlz=01;31:*.txz=01;31:*.tzo=01;31:*.t7z=01;31:*.zip=01;31:*.z=01;31:*.dz=01;31:*.gz=01;31:*.lrz=01;31:*.lz=01;31:*.lzo=01;31:*.xz=01;31:*.zst=01;31:*.tzst=01;31:*.bz2=01;31:*.bz=01;31:*.tbz=01;31:*.tbz2=01;31:*.tz=01;31:*.deb=01;31:*.rpm=01;31:*.jar=01;31:*.war=01;31:*.ear=01;31:*.sar=01;31:*.rar=01;31:*.alz=01;31:*.ace=01;31:*.zoo=01;31:*.cpio=01;31:*.7z=01;31:*.rz=01;31:*.cab=01;31:*.wim=01;31:*.swm=01;31:*.dwm=01;31:*.esd=01;31:*.jpg=01;35:*.jpeg=01;35:*.mjpg=01;35:*.mjpeg=01;35:*.gif=01;35:*.bmp=01;35:*.pbm=01;35:*.pgm=01;35:*.ppm=01;35:*.tga=01;35:*.xbm=01;35:*.xpm=01;35:*.tif=01;35:*.tiff=01;35:*.png=01;35:*.svg=01;35:*.svgz=01;35:*.mng=01;35:*.pcx=01;35:*.mov=01;35:*.mpg=01;35:*.mpeg=01;35:*.m2v=01;35:*.mkv=01;35:*.webm=01;35:*.webp=01;35:*.ogm=01;35:*.mp4=01;35:*.m4v=01;35:*.mp4v=01;35:*.vob=01;35:*.qt=01;35:*.nuv=01;35:*.wmv=01;35:*.asf=01;35:*.rm=01;35:*.rmvb=01;35:*.flc=01;35:*.avi=01;35:*.fli=01;35:*.flv=01;35:*.gl=01;35:*.dl=01;35:*.xcf=01;35:*.xwd=01;35:*.yuv=01;35:*.cgm=01;35:*.emf=01;35:*.ogv=01;35:*.ogx=01;35:*.aac=00;36:*.au=00;36:*.flac=00;36:*.m4a=00;36:*.mid=00;36:*.midi=00;36:*.mka=00;36:*.mp3=00;36:*.mpc=00;36:*.ogg=00;36:*.ra=00;36:*.wav=00;36:*.oga=00;36:*.opus=00;36:*.spx=00;36:*.xspf=00;36:' 'LESSCLOSE=/usr/bin/lesspipe %s %s' 'TERM=xterm' 'LESSOPEN=| /usr/bin/lesspipe %s' 'SHLVL=1' 'LD_LIBRARY_PATH=:/opt/libfabric/lib:/opt/fabtests/lib:/opt/mpich/lib' 'PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/opt/libfabric/bin/:/opt/fabtest/bin:/opt/mpich/bin' '_=/opt/mpich/bin/mpirun' 'OLDPWD=/' --global-user-env 0 --global-system-env 1 'GFORTRAN_UNBUFFERED_PRECONNECTED=y' --proxy-core-count 1 --exec --exec-appnum 0 --exec-proc-count 1 --exec-local-env 0 --exec-wdir /mpich/examples --exec-args 1 ./a.out



[mpiexec at dpr740] Launch arguments: /opt/mpich/bin/hydra_pmi_proxy --control-port 10.118.91.159:33909 --debug --rmk user --launcher ssh --demux poll --pgid 0 --retries 10 --usize -2 --pmi-port 0 --gpus-per-proc -2 --gpu-subdevs-per-proc -2 --proxy-id 0

[mpiexec at dpr740] Launch arguments: /usr/bin/ssh -x -p 2233 10.118.91.158 "/opt/mpich/bin/hydra_pmi_proxy" --control-port 10.118.91.159:33909 --debug --rmk user --launcher ssh --demux poll --pgid 0 --retries 10 --usize -2 --pmi-port 0 --gpus-per-proc -2 --gpu-subdevs-per-proc -2 --proxy-id 1

[proxy:0 at dpr740] Sending upstream hdr.cmd = CMD_PID_LIST

[proxy:0 at dpr740] got pmi command from downstream 0-0:

    cmd=init pmi_version=1 pmi_subversion=1

[proxy:0 at dpr740] Sending PMI command:

    cmd=response_to_init rc=0 pmi_version=1 pmi_subversion=1

[proxy:0 at dpr740] got pmi command from downstream 0-0:

    cmd=get_maxes

[proxy:0 at dpr740] Sending PMI command:

    cmd=maxes rc=0 kvsname_max=256 keylen_max=64 vallen_max=1024

[proxy:0 at dpr740] got pmi command from downstream 0-0:

    cmd=get_appnum

[proxy:0 at dpr740] Sending PMI command:

    cmd=appnum rc=0 appnum=0

[proxy:0 at dpr740] got pmi command from downstream 0-0:

    cmd=get_my_kvsname

[proxy:0 at dpr740] Sending PMI command:

    cmd=my_kvsname rc=0 kvsname=kvs_844_0_801938186_dpr740

[proxy:0 at dpr740] got pmi command from downstream 0-0:

    cmd=get kvsname=kvs_844_0_801938186_dpr740 key=PMI_process_mapping

[proxy:0 at dpr740] Sending PMI command:

    cmd=get_result rc=0 value=(vector,(0,2,1)) found=TRUE

[proxy:0 at dpr740] got pmi command from downstream 0-0:

    cmd=get kvsname=kvs_844_0_801938186_dpr740 key=PMI_hwloc_xmlfile

[proxy:0 at dpr740] Sending PMI command:

    cmd=get_result rc=0 value=/tmp/hydra_hwloc_xmlfile_CeNRJN found=TRUE

[proxy:0 at dpr740] got pmi command from downstream 0-0:

    cmd=get kvsname=kvs_844_0_801938186_dpr740 key=PMI_mpi_memory_alloc_kinds

[proxy:0 at dpr740] Sending upstream internal PMI command:

    cmd=get kvsname=kvs_844_0_801938186_dpr740 key=PMI_mpi_memory_alloc_kinds

[proxy:0 at dpr740] Sending upstream hdr.cmd = CMD_PMI

[mpiexec at dpr740] [pgid: 0] got PMI command: cmd=get kvsname=kvs_844_0_801938186_dpr740 key=PMI_mpi_memory_alloc_kinds



[mpiexec at dpr740] Sending internal PMI command (proxy:0:0):

    cmd=get_result rc=1

[proxy:0 at dpr740] we don't understand the response get_result; forwarding downstream

[proxy:1 at ampere-altra-2-1] Sending upstream hdr.cmd = CMD_PID_LIST

[proxy:1 at ampere-altra-2-1] got pmi command from downstream 0-0:

    cmd=init pmi_version=1 pmi_subversion=1

[proxy:1 at ampere-altra-2-1] Sending PMI command:

    cmd=response_to_init rc=0 pmi_version=1 pmi_subversion=1

[proxy:1 at ampere-altra-2-1] got pmi command from downstream 0-0:

    cmd=get_maxes

[proxy:1 at ampere-altra-2-1] Sending PMI command:

    cmd=maxes rc=0 kvsname_max=256 keylen_max=64 vallen_max=1024

[proxy:1 at ampere-altra-2-1] got pmi command from downstream 0-0:

    cmd=get_appnum

[proxy:1 at ampere-altra-2-1] Sending PMI command:

    cmd=appnum rc=0 appnum=0

[proxy:1 at ampere-altra-2-1] got pmi command from downstream 0-0:

    cmd=get_my_kvsname

[proxy:1 at ampere-altra-2-1] Sending PMI command:

    cmd=my_kvsname rc=0 kvsname=kvs_844_0_801938186_dpr740

[proxy:1 at ampere-altra-2-1] got pmi command from downstream 0-0:

    cmd=get kvsname=kvs_844_0_801938186_dpr740 key=PMI_process_mapping

[proxy:1 at ampere-altra-2-1] Sending PMI command:

    cmd=get_result rc=0 value=(vector,(0,2,1)) found=TRUE

[proxy:1 at ampere-altra-2-1] got pmi command from downstream 0-0:

    cmd=get kvsname=kvs_844_0_801938186_dpr740 key=PMI_hwloc_xmlfile

[proxy:1 at ampere-altra-2-1] Sending PMI command:

    cmd=get_result rc=0 value=/tmp/hydra_hwloc_xmlfile_xv8EIG found=TRUE

[proxy:0 at dpr740] got pmi command from downstream 0-0:

    cmd=barrier_in

[proxy:0 at dpr740] Sending upstream internal PMI command:

    cmd=barrier_in

[proxy:0 at dpr740] Sending upstream hdr.cmd = CMD_PMI

[mpiexec at dpr740] [pgid: 0] got PMI command: cmd=barrier_in



[mpiexec at dpr740] [pgid: 0] got PMI command: cmd=get kvsname=kvs_844_0_801938186_dpr740 key=PMI_mpi_memory_alloc_kinds



[mpiexec at dpr740] Sending internal PMI command (proxy:0:1):

    cmd=get_result rc=1

[proxy:1 at ampere-altra-2-1] got pmi command from downstream 0-0:

    cmd=get kvsname=kvs_844_0_801938186_dpr740 key=PMI_mpi_memory_alloc_kinds

[proxy:1 at ampere-altra-2-1] Sending upstream internal PMI command:

    cmd=get kvsname=kvs_844_0_801938186_dpr740 key=PMI_mpi_memory_alloc_kinds

[proxy:1 at ampere-altra-2-1] Sending upstream hdr.cmd = CMD_PMI

[proxy:1 at ampere-altra-2-1] we don't understand the response get_result; forwarding downstream

[mpiexec at dpr740] [pgid: 0] got PMI command: cmd=barrier_in



[mpiexec at dpr740] Sending internal PMI command (proxy:0:0):

    cmd=barrier_out

[mpiexec at dpr740] Sending internal PMI command (proxy:0:1):

    cmd=barrier_out

[proxy:0 at dpr740] Sending PMI command:

    cmd=barrier_out

[proxy:1 at ampere-altra-2-1] got pmi command from downstream 0-0:

    cmd=barrier_in

[proxy:1 at ampere-altra-2-1] Sending upstream internal PMI command:

    cmd=barrier_in

[proxy:1 at ampere-altra-2-1] Sending upstream hdr.cmd = CMD_PMI

[proxy:1 at ampere-altra-2-1] Sending PMI command:

    cmd=barrier_out

[proxy:0 at dpr740] got pmi command from downstream 0-0:

    cmd=put kvsname=kvs_844_0_801938186_dpr740 key=-allgather-shm-1-0 value=0A00812D[4]FE80[6]526B4BFFFEFC134208[3]

[proxy:0 at dpr740] [proxy:1 at ampere-altra-2-1] got pmi command from downstream 0-0:

    cmd=put kvsname=kvs_844_0_801938186_dpr740 key=-allgather-shm-1-1 value=0200A8BFC0A80101[8]

cached command: -allgather-shm-1-0=0A00812D[4]FE80[6]526B4BFFFEFC134208[3]

[proxy:0 at dpr740] Sending PMI command:

    cmd=put_result rc=0

[proxy:0 at dpr740] got pmi command from downstream 0-0:

    cmd=barrier_in

[proxy:0 at dpr740] flushing 1 put command(s) out

[proxy:0 at dpr740] forwarding command upstream:

cmd=mput -allgather-shm-1-0=0A00812D[4]FE80[6]526B4BFFFEFC134208[3]

[proxy:0 at dpr740] Sending upstream internal PMI command:

    cmd=mput -allgather-shm-1-0=0A00812D[4]FE80[6]526B4BFFFEFC134208[3]

[proxy:0 at dpr740] Sending upstream hdr.cmd = CMD_PMI

[proxy:1 at ampere-altra-2-1] cached command: -allgather-shm-1-1=0200A8BFC0A80101[8]

[proxy:1 at ampere-altra-2-1] Sending PMI command:

[proxy:0 at dpr740] Sending upstream internal PMI command:

    cmd=barrier_in

[proxy:0 at dpr740] Sending upstream hdr.cmd = CMD_PMI

[mpiexec at dpr740] [pgid: 0] got PMI command: cmd=mput -allgather-shm-1-0=0A00812D[4]FE80[6]526B4BFFFEFC134208[3]



    cmd=put_result rc=0

[mpiexec at dpr740] [pgid: 0] got PMI command: cmd=barrier_in



[mpiexec at dpr740] [pgid: 0] got PMI command: cmd=mput -allgather-shm-1-1=0200A8BFC0A80101[8]



[proxy:1 at ampere-altra-2-1] got pmi command from downstream 0-0:

    cmd=barrier_in

[proxy:1 at ampere-altra-2-1] flushing 1 put command(s) out

[proxy:1 at ampere-altra-2-1] forwarding command upstream:

[mpiexec at dpr740] [pgid: 0] got PMI command: cmd=barrier_in



[mpiexec at dpr740] Sending internal PMI command (proxy:0:0):

    cmd=keyval_cache -allgather-shm-1-0=0A00812D[4]FE80[6]526B4BFFFEFC134208[3] -allgather-shm-1-1=0200A8BFC0A80101[8]

[mpiexec at dpr740] Sending internal PMI command (proxy:0:1):

    cmd=keyval_cache -allgather-shm-1-0=0A00812D[4]FE80[6]526B4BFFFEFC134208[3] -allgather-shm-1-1=0200A8BFC0A80101[8]

[mpiexec at dpr740] Sending internal PMI command (proxy:0:0):

    cmd=barrier_out

[mpiexec at dpr740] Sending internal PMI command (proxy:0:1):

    cmd=barrier_out

[proxy:0 at dpr740] Sending PMI command:

    cmd=barrier_out

cmd=mput -allgather-shm-1-1=0200A8BFC0A80101[8]

[proxy:1 at ampere-altra-2-1] Sending upstream internal PMI command:

    cmd=mput -allgather-shm-1-1=0200A8BFC0A80101[8]

[proxy:1 at ampere-altra-2-1] Sending upstream hdr.cmd = CMD_PMI

[proxy:1 at ampere-altra-2-1] Sending upstream internal PMI command:

    cmd=barrier_in

[proxy:1 at ampere-altra-2-1] Sending upstream hdr.cmd = CMD_PMI

[proxy:0 at dpr740] got pmi command from downstream 0-0:

    cmd=get kvsname=kvs_844_0_801938186_dpr740 key=-allgather-shm-1-0

[proxy:0 at dpr740] Sending PMI command:

    cmd=get_result rc=0 value=0A00812D[4]FE80[6]526B4BFFFEFC134208[3] found=TRUE

[proxy:0 at dpr740] got pmi command from downstream 0-0:

    cmd=get kvsname=kvs_844_0_801938186_dpr740 key=-allgather-shm-1-1

[proxy:0 at dpr740] Sending PMI command:

    cmd=get_result rc=0 value=0200A8BFC0A80101[8] found=TRUE

[proxy:1 at ampere-altra-2-1] Sending PMI command:

    cmd=barrier_out

[proxy:1 at ampere-altra-2-1] got pmi command from downstream 0-0:

    cmd=get kvsname=kvs_844_0_801938186_dpr740 key=-allgather-shm-1-0

[proxy:1 at ampere-altra-2-1] Sending PMI command:

    cmd=get_result rc=0 value=0A00812D[4]FE80[6]526B4BFFFEFC134208[3] found=TRUE

[proxy:1 at ampere-altra-2-1] got pmi command from downstream 0-0:

    cmd=get kvsname=kvs_844_0_801938186_dpr740 key=-allgather-shm-1-1

[proxy:1 at ampere-altra-2-1] Sending PMI command:

    cmd=get_result rc=0 value=0200A8BFC0A80101[8] found=TRUE

realloc(): invalid pointer

[proxy:1 at ampere-altra-2-1] Sending upstream hdr.cmd = CMD_STDERR



===================================================================================

=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES

=   PID 2404 RUNNING AT 10.118.91.158

=   EXIT CODE: 134

=   CLEANING UP REMAINING PROCESSES

=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES

===================================================================================

[proxy:1 at ampere-altra-2-1] Sending upstream hdr.cmd = CMD_EXIT_STATUS

[proxy:0 at dpr740] HYD_pmcd_pmip_control_cmd_cb (proxy/pmip_cb.c:484): assert (!closed) failed

[proxy:0 at dpr740] HYDT_dmxu_poll_wait_for_event (lib/tools/demux/demux_poll.c:76): callback returned error status

[proxy:0 at dpr740] main (proxy/pmip.c:122): demux engine error waiting for event

[mpiexec at dpr740] HYDT_bscu_wait_for_completion (lib/tools/bootstrap/utils/bscu_wait.c:109): one of the processes terminated badly; aborting

[mpiexec at dpr740] HYDT_bsci_wait_for_completion (lib/tools/bootstrap/src/bsci_wait.c:21): launcher returned error waiting for completion

[mpiexec at dpr740] HYD_pmci_wait_for_completion (mpiexec/pmiserv_pmci.c:189): launcher returned error waiting for completion

[mpiexec at dpr740] main (mpiexec/mpiexec.c:260): process manager error waiting for completion




Regards,
Niyaz
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20240617/a3dd83ef/attachment-0001.html>


More information about the discuss mailing list