[mpich-discuss] mpi hello-world error

Zhou, Hui zhouh at anl.gov
Mon Jun 17 09:41:50 CDT 2024


Niyaz,

I am quite lost on the errors you encountered. The three errors seem all over the place.  Are the two hosts on the same local network?

--
Hui
________________________________
From: Niyaz Murshed via discuss <discuss at mpich.org>
Sent: Monday, June 17, 2024 1:07 AM
To: discuss at mpich.org <discuss at mpich.org>
Cc: Niyaz Murshed <Niyaz.Murshed at arm.com>; nd <nd at arm.com>
Subject: Re: [mpich-discuss] mpi hello-world error

What is the best way to understand this log ? [proxy: 1@ ampere-altra-2-1] Sending upstream hdr. cmd = CMD_STDERR Abort(680650255) on node 1: Fatal error in internal_Init: Other MPI error, error stack: internal_Init(70). . . . . . . . . . . . . . . . : MPI_Init(argc=(nil),
ZjQcmQRYFpfptBannerStart
This Message Is From an External Sender
This message came from outside your organization.

ZjQcmQRYFpfptBannerEnd

What is the best way to understand this log ?





[proxy:1 at ampere-altra-2-1] Sending upstream hdr.cmd = CMD_STDERR

Abort(680650255) on node 1: Fatal error in internal_Init: Other MPI error, error stack:

internal_Init(70)................: MPI_Init(argc=(nil), argv=(nil)) failed

MPII_Init_thread(268)............:

MPIR_init_comm_world(34).........:

MPIR_Comm_commit(823)............:

MPID_Comm_commit_post_hook(222)..:

MPIDI_world_post_init(665).......:

MPIDI_OFI_init_vcis(851).........:

check_num_nics(900)..............:

MPIR_Allreduce_allcomm_auto(4726):

MPIC_Sendrecv(301)...............:

MPID_Isend(63)...................:

MPIDI_isend(35)..................:

(unknown)(): Other MPI error

[proxy:1 at ampere-altra-2-1] Sending upstream hdr.cmd = CMD_EXIT_STATUS



[proxy:0 at cesw-amp-gbt-2s-m12830-01] Sending upstream hdr.cmd = CMD_STDERR

Abort(680650255) on node 0: Fatal error in internal_Init: Other MPI error, error stack:

internal_Init(70)................: MPI_Init(argc=(nil), argv=(nil)) failed

MPII_Init_thread(268)............:

MPIR_init_comm_world(34).........:

MPIR_Comm_commit(823)............:

MPID_Comm_commit_post_hook(222)..:

MPIDI_world_post_init(665).......:

MPIDI_OFI_init_vcis(851).........:

check_num_nics(900)..............:

MPIR_Allreduce_allcomm_auto(4726):

MPIC_Sendrecv(301)...............:

MPID_Isend(63)...................:

MPIDI_isend(35)..................:

(unknown)(): Other MPI error

[proxy:0 at cesw-amp-gbt-2s-m12830-01] Sending upstream hdr.cmd = CMD_EXIT_STATUS







From: Niyaz Murshed via discuss <discuss at mpich.org>
Date: Saturday, June 15, 2024 at 10:53 AM
To: discuss at mpich.org <discuss at mpich.org>
Cc: Niyaz Murshed <Niyaz.Murshed at arm.com>, nd <nd at arm.com>
Subject: Re: [mpich-discuss] mpi hello-world error

Also seeing this error sometimes. root@ dpr740: /mpich/examples# export FI_PROVIDER=tcp root@ dpr740: /mpich/examples# mpirun -verbose -n 2 -hosts 10. 118. 91. 158,10. 118. 91. 159 ./a. out host: 10. 118. 91. 158 host: 10. 118. 91. 159 [mpiexec@ dpr740] Timeout

ZjQcmQRYFpfptBannerStart

This Message Is From an External Sender

This message came from outside your organization.



ZjQcmQRYFpfptBannerEnd

Also seeing this error sometimes.





root at dpr740:/mpich/examples# export FI_PROVIDER=tcp

root at dpr740:/mpich/examples# mpirun  -verbose -n 2 -hosts 10.118.91.158,10.118.91.159 ./a.out

host: 10.118.91.158

host: 10.118.91.159

[mpiexec at dpr740] Timeout set to -1 (-1 means infinite)



==================================================================================================

mpiexec options:

----------------

  Base path: /opt/mpich/bin/

  Launcher: (null)

  Debug level: 1

  Enable X: -1



  Global environment:

  -------------------

    PKG_CONFIG_PATH=:/opt/libfabric/lib/pkgconfig:/opt/mpich/lib/pkgconfig

    HOSTNAME=dpr740

    HYDRA_LAUNCHER_EXTRA_ARGS=-p 2233

    PWD=/mpich/examples

    HOME=/root

    FI_PROVIDER=tcp

    LS_COLORS=rs=0:di=01;34:ln=01;36:mh=00:pi=40;33:so=01;35:do=01;35:bd=40;33;01:cd=40;33;01:or=40;31;01:mi=00:su=37;41:sg=30;43:ca=30;41:tw=30;42:ow=34;42:st=37;44:ex=01;32:*.tar=01;31:*.tgz=01;31:*.arc=01;31:*.arj=01;31:*.taz=01;31:*.lha=01;31:*.lz4=01;31:*.lzh=01;31:*.lzma=01;31:*.tlz=01;31:*.txz=01;31:*.tzo=01;31:*.t7z=01;31:*.zip=01;31:*.z=01;31:*.dz=01;31:*.gz=01;31:*.lrz=01;31:*.lz=01;31:*.lzo=01;31:*.xz=01;31:*.zst=01;31:*.tzst=01;31:*.bz2=01;31:*.bz=01;31:*.tbz=01;31:*.tbz2=01;31:*.tz=01;31:*.deb=01;31:*.rpm=01;31:*.jar=01;31:*.war=01;31:*.ear=01;31:*.sar=01;31:*.rar=01;31:*.alz=01;31:*.ace=01;31:*.zoo=01;31:*.cpio=01;31:*.7z=01;31:*.rz=01;31:*.cab=01;31:*.wim=01;31:*.swm=01;31:*.dwm=01;31:*.esd=01;31:*.jpg=01;35:*.jpeg=01;35:*.mjpg=01;35:*.mjpeg=01;35:*.gif=01;35:*.bmp=01;35:*.pbm=01;35:*.pgm=01;35:*.ppm=01;35:*.tga=01;35:*.xbm=01;35:*.xpm=01;35:*.tif=01;35:*.tiff=01;35:*.png=01;35:*.svg=01;35:*.svgz=01;35:*.mng=01;35:*.pcx=01;35:*.mov=01;35:*.mpg=01;35:*.mpeg=01;35:*.m2v=01;35:*.mkv=01;35:*.webm=01;35:*.webp=01;35:*.ogm=01;35:*.mp4=01;35:*.m4v=01;35:*.mp4v=01;35:*.vob=01;35:*.qt=01;35:*.nuv=01;35:*.wmv=01;35:*.asf=01;35:*.rm=01;35:*.rmvb=01;35:*.flc=01;35:*.avi=01;35:*.fli=01;35:*.flv=01;35:*.gl=01;35:*.dl=01;35:*.xcf=01;35:*.xwd=01;35:*.yuv=01;35:*.cgm=01;35:*.emf=01;35:*.ogv=01;35:*.ogx=01;35:*.aac=00;36:*.au=00;36:*.flac=00;36:*.m4a=00;36:*.mid=00;36:*.midi=00;36:*.mka=00;36:*.mp3=00;36:*.mpc=00;36:*.ogg=00;36:*.ra=00;36:*.wav=00;36:*.oga=00;36:*.opus=00;36:*.spx=00;36:*.xspf=00;36:

    LESSCLOSE=/usr/bin/lesspipe %s %s

    TERM=xterm

    LESSOPEN=| /usr/bin/lesspipe %s

    SHLVL=1

    LD_LIBRARY_PATH=:/opt/libfabric/lib:/opt/fabtests/lib:/opt/mpich/lib

    PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/opt/libfabric/bin/:/opt/fabtest/bin:/opt/mpich/bin

    OLDPWD=/

    _=/opt/mpich/bin/mpirun



  Hydra internal environment:

  ---------------------------

    GFORTRAN_UNBUFFERED_PRECONNECTED=y





    Proxy information:

    *********************

      [1] proxy: 10.118.91.158 (1 cores)

      Exec list: ./a.out (1 processes);



      [2] proxy: 10.118.91.159 (1 cores)

      Exec list: ./a.out (1 processes);





==================================================================================================





Proxy launch args: /opt/mpich/bin/hydra_pmi_proxy --control-port 10.118.91.159:35625 --debug --rmk user --launcher ssh --demux poll --pgid 0 --retries 10 --usize -2 --pmi-port 0 --gpus-per-proc -2 --gpu-subdevs-per-proc -2 --proxy-id



Arguments being passed to proxy 0:

--version 4.3.0a1 --iface-ip-env-name MPIR_CVAR_CH3_INTERFACE_HOSTNAME --hostname 10.118.91.158 --global-core-map 0,1,2 --pmi-id-map 0,0 --global-process-count 2 --auto-cleanup 1 --pmi-kvsname kvs_1151_0_1450155337_dpr740 --pmi-process-mapping (vector,(0,2,1)) --global-inherited-env 15 'PKG_CONFIG_PATH=:/opt/libfabric/lib/pkgconfig:/opt/mpich/lib/pkgconfig' 'HOSTNAME=dpr740' 'HYDRA_LAUNCHER_EXTRA_ARGS=-p 2233' 'PWD=/mpich/examples' 'HOME=/root' 'FI_PROVIDER=tcp' 'LS_COLORS=rs=0:di=01;34:ln=01;36:mh=00:pi=40;33:so=01;35:do=01;35:bd=40;33;01:cd=40;33;01:or=40;31;01:mi=00:su=37;41:sg=30;43:ca=30;41:tw=30;42:ow=34;42:st=37;44:ex=01;32:*.tar=01;31:*.tgz=01;31:*.arc=01;31:*.arj=01;31:*.taz=01;31:*.lha=01;31:*.lz4=01;31:*.lzh=01;31:*.lzma=01;31:*.tlz=01;31:*.txz=01;31:*.tzo=01;31:*.t7z=01;31:*.zip=01;31:*.z=01;31:*.dz=01;31:*.gz=01;31:*.lrz=01;31:*.lz=01;31:*.lzo=01;31:*.xz=01;31:*.zst=01;31:*.tzst=01;31:*.bz2=01;31:*.bz=01;31:*.tbz=01;31:*.tbz2=01;31:*.tz=01;31:*.deb=01;31:*.rpm=01;31:*.jar=01;31:*.war=01;31:*.ear=01;31:*.sar=01;31:*.rar=01;31:*.alz=01;31:*.ace=01;31:*.zoo=01;31:*.cpio=01;31:*.7z=01;31:*.rz=01;31:*.cab=01;31:*.wim=01;31:*.swm=01;31:*.dwm=01;31:*.esd=01;31:*.jpg=01;35:*.jpeg=01;35:*.mjpg=01;35:*.mjpeg=01;35:*.gif=01;35:*.bmp=01;35:*.pbm=01;35:*.pgm=01;35:*.ppm=01;35:*.tga=01;35:*.xbm=01;35:*.xpm=01;35:*.tif=01;35:*.tiff=01;35:*.png=01;35:*.svg=01;35:*.svgz=01;35:*.mng=01;35:*.pcx=01;35:*.mov=01;35:*.mpg=01;35:*.mpeg=01;35:*.m2v=01;35:*.mkv=01;35:*.webm=01;35:*.webp=01;35:*.ogm=01;35:*.mp4=01;35:*.m4v=01;35:*.mp4v=01;35:*.vob=01;35:*.qt=01;35:*.nuv=01;35:*.wmv=01;35:*.asf=01;35:*.rm=01;35:*.rmvb=01;35:*.flc=01;35:*.avi=01;35:*.fli=01;35:*.flv=01;35:*.gl=01;35:*.dl=01;35:*.xcf=01;35:*.xwd=01;35:*.yuv=01;35:*.cgm=01;35:*.emf=01;35:*.ogv=01;35:*.ogx=01;35:*.aac=00;36:*.au=00;36:*.flac=00;36:*.m4a=00;36:*.mid=00;36:*.midi=00;36:*.mka=00;36:*.mp3=00;36:*.mpc=00;36:*.ogg=00;36:*.ra=00;36:*.wav=00;36:*.oga=00;36:*.opus=00;36:*.spx=00;36:*.xspf=00;36:' 'LESSCLOSE=/usr/bin/lesspipe %s %s' 'TERM=xterm' 'LESSOPEN=| /usr/bin/lesspipe %s' 'SHLVL=1' 'LD_LIBRARY_PATH=:/opt/libfabric/lib:/opt/fabtests/lib:/opt/mpich/lib' 'PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/opt/libfabric/bin/:/opt/fabtest/bin:/opt/mpich/bin' 'OLDPWD=/' '_=/opt/mpich/bin/mpirun' --global-user-env 0 --global-system-env 1 'GFORTRAN_UNBUFFERED_PRECONNECTED=y' --proxy-core-count 1 --exec --exec-appnum 0 --exec-proc-count 1 --exec-local-env 0 --exec-wdir /mpich/examples --exec-args 1 ./a.out



Arguments being passed to proxy 1:

--version 4.3.0a1 --iface-ip-env-name MPIR_CVAR_CH3_INTERFACE_HOSTNAME --hostname 10.118.91.159 --global-core-map 0,1,2 --pmi-id-map 0,1 --global-process-count 2 --auto-cleanup 1 --pmi-kvsname kvs_1151_0_1450155337_dpr740 --pmi-process-mapping (vector,(0,2,1)) --global-inherited-env 15 'PKG_CONFIG_PATH=:/opt/libfabric/lib/pkgconfig:/opt/mpich/lib/pkgconfig' 'HOSTNAME=dpr740' 'HYDRA_LAUNCHER_EXTRA_ARGS=-p 2233' 'PWD=/mpich/examples' 'HOME=/root' 'FI_PROVIDER=tcp' 'LS_COLORS=rs=0:di=01;34:ln=01;36:mh=00:pi=40;33:so=01;35:do=01;35:bd=40;33;01:cd=40;33;01:or=40;31;01:mi=00:su=37;41:sg=30;43:ca=30;41:tw=30;42:ow=34;42:st=37;44:ex=01;32:*.tar=01;31:*.tgz=01;31:*.arc=01;31:*.arj=01;31:*.taz=01;31:*.lha=01;31:*.lz4=01;31:*.lzh=01;31:*.lzma=01;31:*.tlz=01;31:*.txz=01;31:*.tzo=01;31:*.t7z=01;31:*.zip=01;31:*.z=01;31:*.dz=01;31:*.gz=01;31:*.lrz=01;31:*.lz=01;31:*.lzo=01;31:*.xz=01;31:*.zst=01;31:*.tzst=01;31:*.bz2=01;31:*.bz=01;31:*.tbz=01;31:*.tbz2=01;31:*.tz=01;31:*.deb=01;31:*.rpm=01;31:*.jar=01;31:*.war=01;31:*.ear=01;31:*.sar=01;31:*.rar=01;31:*.alz=01;31:*.ace=01;31:*.zoo=01;31:*.cpio=01;31:*.7z=01;31:*.rz=01;31:*.cab=01;31:*.wim=01;31:*.swm=01;31:*.dwm=01;31:*.esd=01;31:*.jpg=01;35:*.jpeg=01;35:*.mjpg=01;35:*.mjpeg=01;35:*.gif=01;35:*.bmp=01;35:*.pbm=01;35:*.pgm=01;35:*.ppm=01;35:*.tga=01;35:*.xbm=01;35:*.xpm=01;35:*.tif=01;35:*.tiff=01;35:*.png=01;35:*.svg=01;35:*.svgz=01;35:*.mng=01;35:*.pcx=01;35:*.mov=01;35:*.mpg=01;35:*.mpeg=01;35:*.m2v=01;35:*.mkv=01;35:*.webm=01;35:*.webp=01;35:*.ogm=01;35:*.mp4=01;35:*.m4v=01;35:*.mp4v=01;35:*.vob=01;35:*.qt=01;35:*.nuv=01;35:*.wmv=01;35:*.asf=01;35:*.rm=01;35:*.rmvb=01;35:*.flc=01;35:*.avi=01;35:*.fli=01;35:*.flv=01;35:*.gl=01;35:*.dl=01;35:*.xcf=01;35:*.xwd=01;35:*.yuv=01;35:*.cgm=01;35:*.emf=01;35:*.ogv=01;35:*.ogx=01;35:*.aac=00;36:*.au=00;36:*.flac=00;36:*.m4a=00;36:*.mid=00;36:*.midi=00;36:*.mka=00;36:*.mp3=00;36:*.mpc=00;36:*.ogg=00;36:*.ra=00;36:*.wav=00;36:*.oga=00;36:*.opus=00;36:*.spx=00;36:*.xspf=00;36:' 'LESSCLOSE=/usr/bin/lesspipe %s %s' 'TERM=xterm' 'LESSOPEN=| /usr/bin/lesspipe %s' 'SHLVL=1' 'LD_LIBRARY_PATH=:/opt/libfabric/lib:/opt/fabtests/lib:/opt/mpich/lib' 'PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/opt/libfabric/bin/:/opt/fabtest/bin:/opt/mpich/bin' 'OLDPWD=/' '_=/opt/mpich/bin/mpirun' --global-user-env 0 --global-system-env 1 'GFORTRAN_UNBUFFERED_PRECONNECTED=y' --proxy-core-count 1 --exec --exec-appnum 0 --exec-proc-count 1 --exec-local-env 0 --exec-wdir /mpich/examples --exec-args 1 ./a.out



[mpiexec at dpr740] Launch arguments: /usr/bin/ssh -x -p 2233 10.118.91.158 "/opt/mpich/bin/hydra_pmi_proxy" --control-port 10.118.91.159:35625 --debug --rmk user --launcher ssh --demux poll --pgid 0 --retries 10 --usize -2 --pmi-port 0 --gpus-per-proc -2 --gpu-subdevs-per-proc -2 --proxy-id 0

[mpiexec at dpr740] Launch arguments: /opt/mpich/bin/hydra_pmi_proxy --control-port 10.118.91.159:35625 --debug --rmk user --launcher ssh --demux poll --pgid 0 --retries 10 --usize -2 --pmi-port 0 --gpus-per-proc -2 --gpu-subdevs-per-proc -2 --proxy-id 1

[proxy:1 at dpr740] Sending upstream hdr.cmd = CMD_PID_LIST

[proxy:1 at dpr740] got pmi command from downstream 0-0:

    cmd=init pmi_version=1 pmi_subversion=1

[proxy:1 at dpr740] Sending PMI command:

    cmd=response_to_init rc=0 pmi_version=1 pmi_subversion=1

[proxy:1 at dpr740] got pmi command from downstream 0-0:

    cmd=get_maxes

[proxy:1 at dpr740] Sending PMI command:

    cmd=maxes rc=0 kvsname_max=256 keylen_max=64 vallen_max=1024

[proxy:1 at dpr740] got pmi command from downstream 0-0:

    cmd=get_appnum

[proxy:1 at dpr740] Sending PMI command:

    cmd=appnum rc=0 appnum=0

[proxy:1 at dpr740] got pmi command from downstream 0-0:

    cmd=get_my_kvsname

[proxy:1 at dpr740] Sending PMI command:

    cmd=my_kvsname rc=0 kvsname=kvs_1151_0_1450155337_dpr740

[proxy:1 at dpr740] got pmi command from downstream 0-0:

    cmd=get kvsname=kvs_1151_0_1450155337_dpr740 key=PMI_process_mapping

[proxy:1 at dpr740] Sending PMI command:

    cmd=get_result rc=0 value=(vector,(0,2,1)) found=TRUE

[proxy:1 at dpr740] got pmi command from downstream 0-0:

    cmd=get kvsname=kvs_1151_0_1450155337_dpr740 key=PMI_hwloc_xmlfile

[proxy:1 at dpr740] Sending PMI command:

    cmd=get_result rc=0 value=/tmp/hydra_hwloc_xmlfile_YPoAhr found=TRUE

[proxy:1 at dpr740] got pmi command from downstream 0-0:

    cmd=get kvsname=kvs_1151_0_1450155337_dpr740 key=PMI_mpi_memory_alloc_kinds

[proxy:1 at dpr740] Sending upstream internal PMI command:

    cmd=get kvsname=kvs_1151_0_1450155337_dpr740 key=PMI_mpi_memory_alloc_kinds

[proxy:1 at dpr740] Sending upstream hdr.cmd = CMD_PMI

[mpiexec at dpr740] [pgid: 0] got PMI command: cmd=get kvsname=kvs_1151_0_1450155337_dpr740 key=PMI_mpi_memory_alloc_kinds



[mpiexec at dpr740] Sending internal PMI command (proxy:0:1):

    cmd=get_result rc=1

[proxy:1 at dpr740] we don't understand the response get_result; forwarding downstream

[proxy:1 at dpr740] got pmi command from downstream 0-0:

    cmd=barrier_in

[proxy:1 at dpr740] Sending upstream internal PMI command:

    cmd=barrier_in

[proxy:1 at dpr740] Sending upstream hdr.cmd = CMD_PMI

[mpiexec at dpr740] [pgid: 0] got PMI command: cmd=barrier_in



[proxy:0 at ampere-altra-2-1] Sending upstream hdr.cmd = CMD_PID_LIST

[proxy:0 at ampere-altra-2-1] got pmi command from downstream 0-0:

    cmd=init pmi_version=1 pmi_subversion=1

[proxy:0 at ampere-altra-2-1] Sending PMI command:

    cmd=response_to_init rc=0 pmi_version=1 pmi_subversion=1

[proxy:0 at ampere-altra-2-1] got pmi command from downstream 0-0:

    cmd=get_maxes

[proxy:0 at ampere-altra-2-1] Sending PMI command:

    cmd=maxes rc=0 kvsname_max=256 keylen_max=64 vallen_max=1024

[proxy:0 at ampere-altra-2-1] got pmi command from downstream 0-0:

    cmd=get_appnum

[proxy:0 at ampere-altra-2-1] Sending PMI command:

    cmd=appnum rc=0 appnum=0

[proxy:0 at ampere-altra-2-1] got pmi command from downstream 0-0:

    cmd=get_my_kvsname

[proxy:0 at ampere-altra-2-1] Sending PMI command:

    cmd=my_kvsname rc=0 kvsname=kvs_1151_0_1450155337_dpr740

[proxy:0 at ampere-altra-2-1] got pmi command from downstream 0-0:

    cmd=get kvsname=kvs_1151_0_1450155337_dpr740 key=PMI_process_mapping

[proxy:0 at ampere-altra-2-1] Sending PMI command:

    cmd=get_result rc=0 value=(vector,(0,2,1)) found=TRUE

[proxy:0 at ampere-altra-2-1] got pmi command from downstream 0-0:

    cmd=get kvsname=kvs_1151_0_1450155337_dpr740 key=PMI_hwloc_xmlfile

[proxy:0 at ampere-altra-2-1] Sending PMI command:

    cmd=get_result rc=0 value=/tmp/hydra_hwloc_xmlfile_68iqm3 found=TRUE

[mpiexec at dpr740] [pgid: 0] got PMI command: cmd=get kvsname=kvs_1151_0_1450155337_dpr740 key=PMI_mpi_memory_alloc_kinds



[mpiexec at dpr740] Sending internal PMI command (proxy:0:0):

    cmd=get_result rc=1

[proxy:0 at ampere-altra-2-1] got pmi command from downstream 0-0:

    cmd=get kvsname=kvs_1151_0_1450155337_dpr740 key=PMI_mpi_memory_alloc_kinds

[proxy:0 at ampere-altra-2-1] Sending upstream internal PMI command:

    cmd=get kvsname=kvs_1151_0_1450155337_dpr740 key=PMI_mpi_memory_alloc_kinds

[proxy:0 at ampere-altra-2-1] Sending upstream hdr.cmd = CMD_PMI

[proxy:0 at ampere-altra-2-1] we don't understand the response get_result; forwarding downstream

[mpiexec at dpr740] [pgid: 0] got PMI command: cmd=barrier_in



[mpiexec at dpr740] Sending internal PMI command (proxy:0:0):

    cmd=barrier_out

[mpiexec at dpr740] Sending internal PMI command (proxy:0:1):

    cmd=barrier_out

[proxy:0 at ampere-altra-2-1] got pmi command from downstream 0-0:

    cmd=barrier_in

[proxy:0 at ampere-altra-2-1] Sending upstream internal PMI command:

    cmd=barrier_in

[proxy:0 at ampere-altra-2-1] [proxy:1 at dpr740] Sending PMI command:

    cmd=barrier_out

Sending upstream hdr.cmd = CMD_PMI

[proxy:0 at ampere-altra-2-1] Sending PMI command:

    cmd=barrier_out

[proxy:0 at ampere-altra-2-1] got pmi command from downstream 0-0:

    cmd=put kvsname=kvs_1151_0_1450155337_dpr740 key=-allgather-shm-1-0 value=0200937DC0A80101[8]

[proxy:0 at ampere-altra-2-1] cached command: -allgather-shm-1-0=0200937DC0A80101[8]

[proxy:0 at ampere-altra-2-1] Sending PMI command:

    cmd=put_result rc=0

[mpiexec at dpr740] [pgid: 0] got PMI command: cmd=mput -allgather-shm-1-0=0200937DC0A80101[8]



[mpiexec at dpr740] [pgid: 0] got PMI command: cmd=barrier_in



[proxy:0 at ampere-altra-2-1] got pmi command from downstream 0-0:

    cmd=barrier_in

[proxy:0 at ampere-altra-2-1] flushing 1 put command(s) out

[proxy:0 at ampere-altra-2-1] forwarding command upstream:

cmd=mput -allgather-shm-1-0=0200937DC0A80101[8]

[proxy:0 at ampere-altra-2-1] Sending upstream internal PMI command:

    cmd=mput -allgather-shm-1-0=0200937DC0A80101[8]

[proxy:0 at ampere-altra-2-1] Sending upstream hdr.cmd = CMD_PMI

[proxy:0 at ampere-altra-2-1] Sending upstream internal PMI command:

    cmd=barrier_in

[proxy:0 at ampere-altra-2-1] Sending upstream hdr.cmd = CMD_PMI

[proxy:1 at dpr740] got pmi command from downstream 0-0:

    cmd=put kvsname=kvs_1151_0_1450155337_dpr740 key=-allgather-shm-1-1 value=0A00B381[4]FE80[6]526B4BFFFEFC134208[3]

[proxy:1 at dpr740] cached command: -allgather-shm-1-1=0A00B381[4]FE80[6]526B4BFFFEFC134208[3]

[proxy:1 at dpr740] Sending PMI command:

    cmd=put_result rc=0

[proxy:1 at dpr740] got pmi command from downstream 0-0:

    cmd=barrier_in

[proxy:1 at dpr740] flushing 1 put command(s) out

[proxy:1 at dpr740] forwarding command upstream:

cmd=mput -allgather-shm-1-1=0A00B381[4]FE80[6]526B4BFFFEFC134208[3]

[proxy:1 at dpr740] Sending upstream internal PMI command:

    cmd=mput -allgather-shm-1-1=0A00B381[4]FE80[6]526B4BFFFEFC134208[3]

[proxy:1 at dpr740] Sending upstream hdr.cmd = CMD_PMI

[proxy:1 at dpr740] Sending upstream internal PMI command:

    cmd=barrier_in

[mpiexec at dpr740] [pgid: 0] got PMI command: cmd=mput -allgather-shm-1-1=0A00B381[4]FE80[6]526B4BFFFEFC134208[3]



[proxy:1 at dpr740] Sending upstream hdr.cmd = CMD_PMI

[mpiexec at dpr740] [pgid: 0] got PMI command: cmd=barrier_in



[mpiexec at dpr740] Sending internal PMI command (proxy:0:0):

    cmd=keyval_cache -allgather-shm-1-0=0200937DC0A80101[8] -allgather-shm-1-1=0A00B381[4]FE80[6]526B4BFFFEFC134208[3]

[mpiexec at dpr740] Sending internal PMI command (proxy:0:1):

    cmd=keyval_cache -allgather-shm-1-0=0200937DC0A80101[8] -allgather-shm-1-1=0A00B381[4]FE80[6]526B4BFFFEFC134208[3]

[mpiexec at dpr740] Sending internal PMI command (proxy:0:0):

    cmd=barrier_out

[mpiexec at dpr740] Sending internal PMI command (proxy:0:1):

    cmd=barrier_out

[proxy:1 at dpr740] Sending PMI command:

    cmd=barrier_out

[proxy:1 at dpr740] got pmi command from downstream 0-0:

    cmd=get kvsname=kvs_1151_0_1450155337_dpr740 key=-allgather-shm-1-0

[proxy:1 at dpr740] Sending PMI command:

    cmd=get_result rc=0 value=0200937DC0A80101[8] found=TRUE

[proxy:1 at dpr740] got pmi command from downstream 0-0:

    cmd=get kvsname=kvs_1151_0_1450155337_dpr740 key=-allgather-shm-1-1

[proxy:1 at dpr740] Sending PMI command:

    cmd=get_result rc=0 value=0A00B381[4]FE80[6]526B4BFFFEFC134208[3] found=TRUE

[proxy:0 at ampere-altra-2-1] Sending PMI command:

    cmd=barrier_out

[proxy:1 at dpr740] Sending upstream hdr.cmd = CMD_STDERR

[proxy:0 at ampere-altra-2-1] got pmi command from downstream 0-0:

    cmd=get kvsname=kvs_1151_0_1450155337_dpr740 key=-allgather-shm-1-0

Assertion failed in file src/mpid/ch4/netmod/ofi/init_addrxchg.c at line 151: mapped_table[i] != FI_ADDR_NOTAVAIL

[proxy:0 at ampere-altra-2-1] Sending PMI command:

    cmd=get_result rc=0 value=0200937DC0A80101[8] found=TRUE

[proxy:0 at ampere-altra-2-1] got pmi command from downstream 0-0:

    cmd=get kvsname=kvs_1151_0_1450155337_dpr740 key=-allgather-shm-1-1

[proxy:0 at ampere-altra-2-1] Sending PMI command:

    cmd=get_result rc=0 value=0A00B381[4]FE80[6]526B4BFFFEFC134208[3] found=TRUE

Assertion failed in file src/mpid/ch4/netmod/ofi/init_addrxchg.c at line 151: mapped_table[i] != FI_ADDR_NOTAVAIL

[proxy:0 at ampere-altra-2-1] Sending upstream hdr.cmd = CMD_STDERR

[proxy:1 at dpr740] Sending upstream hdr.cmd = CMD_STDERR

/opt/mpich/lib/libmpi.so.0(+0x58c005) [0x7f967920c005]

/opt/mpich/lib/libmpi.so.0(+0x491858) [0x7f9679111858]

/opt/mpich/lib/libmpi.so.0(+0x55428c) [0x7f96791d428c]

/opt/mpich/lib/libmpi.so.0(+0x53402d) [0x7f96791b402d]

/opt/mpich/lib/libmpi.so.0(+0x4dc71f) [0x7f967915c71f]

/opt/mpich/lib/libmpi.so.0(+0x4df09a) [0x7f967915f09a]

/opt/mpich/lib/libmpi.so.0(+0x3deab6) [0x7f967905eab6]

/opt/mpich/lib/libmpi.so.0(+0x3e0732) [0x7f9679060732]

/opt/mpich/lib/libmpi.so.0(+0x3dd075) [0x7f967905d075]

/opt/mpich/lib/libmpi.so.0(+0x418215) [0x7f9679098215]

[proxy:1 at dpr740] Sending upstream hdr.cmd = CMD_STDERR

/opt/mpich/lib/libmpi.so.0(+0x4188fa) [0x7f96790988fa]

/opt/mpich/lib/libmpi.so.0(MPI_Init+0x34) [0x7f9678d57594]

./a.out(+0x121a) [0x55b07f1cc21a]

/lib/x86_64-linux-gnu/libc.so.6(+0x29d90) [0x7f9678a7cd90]

/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80) [0x7f9678a7ce40]

./a.out(+0x1125) [0x55b07f1cc125]

Abort(1) on node 1: Internal error

/opt/mpich/lib/libmpi.so.0(+0x59a0fc) [0xffff91d0a0fc]

/opt/mpich/lib/libmpi.so.0(+0x4a6b58) [0xffff91c16b58]

/opt/mpich/lib/libmpi.so.0(+0x564740) [0xffff91cd4740]

/opt/mpich/lib/libmpi.so.0(+0x546c14) [0xffff91cb6c14]

/opt/mpich/lib/libmpi.so.0(+0x4f70cc) [0xffff91c670cc]

/opt/mpich/lib/libmpi.so.0(+0x4f9850) [0xffff91c69850]

/opt/mpich/lib/libmpi.so.0(+0x3ffd2c) [0xffff91b6fd2c]

/opt/mpich/lib/libmpi.so.0(+0x4017ec) [0xffff91b717ec]

/opt/mpich/lib/libmpi.so.0(+0x3fe384) [0xffff91b6e384]

/opt/mpich/lib/libmpi.so.0(+0x436a64) [0xffff91ba6a64]

[proxy:0 at ampere-altra-2-1] Sending upstream hdr.cmd = CMD_STDERR

[proxy:0 at ampere-altra-2-1] Sending upstream hdr.cmd = CMD_STDERR

/opt/mpich/lib/libmpi.so.0(+0x43700c) [0xffff91ba700c]

/opt/mpich/lib/libmpi.so.0(MPI_Init+0x44) [0xffff9189eeb4]

./a.out(+0x9c4) [0xaaaab5c709c4]

/lib/aarch64-linux-gnu/libc.so.6(+0x273fc) [0xffff915e73fc]

/lib/aarch64-linux-gnu/libc.so.6(__libc_start_main+0x98) [0xffff915e74cc]

./a.out(+0x8b0) [0xaaaab5c708b0]

Abort(1) on node 0: Internal error

[proxy:1 at dpr740] Sending upstream hdr.cmd = CMD_EXIT_STATUS

[proxy:0 at ampere-altra-2-1] Sending upstream hdr.cmd = CMD_EXIT_STATUS





From: Niyaz Murshed via discuss <discuss at mpich.org>
Date: Saturday, June 15, 2024 at 12:10 AM
To: discuss at mpich.org <discuss at mpich.org>
Cc: Niyaz Murshed <Niyaz.Murshed at arm.com>, nd <nd at arm.com>
Subject: [mpich-discuss] mpi hello-world error

Hello, I am trying to run the example hellow. c between 2 servers. I can run them individually and it works fine. 10. 118. 91. 158 is the machine I am running on. 10. 118. 91. 159 is the remote machine. root@ dpr740: /mpich/examples# mpirun -n 2 -hosts

ZjQcmQRYFpfptBannerStart

This Message Is From an External Sender

This message came from outside your organization.



ZjQcmQRYFpfptBannerEnd

Hello,



I am trying to run the example hellow.c between 2 servers.

I can run them individually and it works fine.



10.118.91.158  is the machine I am running on.

10.118.91.159 is the remote machine.



root at dpr740:/mpich/examples# mpirun  -n 2 -hosts 10.118.91.158  ./a.out

Hello world from process 0 of 2

Hello world from process 1 of 2



root at dpr740:/mpich/examples# mpirun  -n 2 -hosts 10.118.91.159  ./a.out

Hello world from process 1 of 2

Hello world from process 0 of 2



However, when I try to run them on both, I get the below error.

realloc(): invalid pointer



Is this a known issue ? Any suggestions?





root at dpr740:/mpich/examples# mpirun -verbose  -n 2 -hosts 10.118.91.159,10.118.91.158  ./a.out

host: 10.118.91.159

host: 10.118.91.158

[mpiexec at dpr740] Timeout set to -1 (-1 means infinite)



==================================================================================================

mpiexec options:

----------------

  Base path: /opt/mpich/bin/

  Launcher: (null)

  Debug level: 1

  Enable X: -1



  Global environment:

  -------------------

    PKG_CONFIG_PATH=:/opt/libfabric/lib/pkgconfig:/opt/mpich/lib/pkgconfig

    HOSTNAME=dpr740

    HYDRA_LAUNCHER_EXTRA_ARGS=-p 2233

    PWD=/mpich/examples

    HOME=/root

    LS_COLORS=rs=0:di=01;34:ln=01;36:mh=00:pi=40;33:so=01;35:do=01;35:bd=40;33;01:cd=40;33;01:or=40;31;01:mi=00:su=37;41:sg=30;43:ca=30;41:tw=30;42:ow=34;42:st=37;44:ex=01;32:*.tar=01;31:*.tgz=01;31:*.arc=01;31:*.arj=01;31:*.taz=01;31:*.lha=01;31:*.lz4=01;31:*.lzh=01;31:*.lzma=01;31:*.tlz=01;31:*.txz=01;31:*.tzo=01;31:*.t7z=01;31:*.zip=01;31:*.z=01;31:*.dz=01;31:*.gz=01;31:*.lrz=01;31:*.lz=01;31:*.lzo=01;31:*.xz=01;31:*.zst=01;31:*.tzst=01;31:*.bz2=01;31:*.bz=01;31:*.tbz=01;31:*.tbz2=01;31:*.tz=01;31:*.deb=01;31:*.rpm=01;31:*.jar=01;31:*.war=01;31:*.ear=01;31:*.sar=01;31:*.rar=01;31:*.alz=01;31:*.ace=01;31:*.zoo=01;31:*.cpio=01;31:*.7z=01;31:*.rz=01;31:*.cab=01;31:*.wim=01;31:*.swm=01;31:*.dwm=01;31:*.esd=01;31:*.jpg=01;35:*.jpeg=01;35:*.mjpg=01;35:*.mjpeg=01;35:*.gif=01;35:*.bmp=01;35:*.pbm=01;35:*.pgm=01;35:*.ppm=01;35:*.tga=01;35:*.xbm=01;35:*.xpm=01;35:*.tif=01;35:*.tiff=01;35:*.png=01;35:*.svg=01;35:*.svgz=01;35:*.mng=01;35:*.pcx=01;35:*.mov=01;35:*.mpg=01;35:*.mpeg=01;35:*.m2v=01;35:*.mkv=01;35:*.webm=01;35:*.webp=01;35:*.ogm=01;35:*.mp4=01;35:*.m4v=01;35:*.mp4v=01;35:*.vob=01;35:*.qt=01;35:*.nuv=01;35:*.wmv=01;35:*.asf=01;35:*.rm=01;35:*.rmvb=01;35:*.flc=01;35:*.avi=01;35:*.fli=01;35:*.flv=01;35:*.gl=01;35:*.dl=01;35:*.xcf=01;35:*.xwd=01;35:*.yuv=01;35:*.cgm=01;35:*.emf=01;35:*.ogv=01;35:*.ogx=01;35:*.aac=00;36:*.au=00;36:*.flac=00;36:*.m4a=00;36:*.mid=00;36:*.midi=00;36:*.mka=00;36:*.mp3=00;36:*.mpc=00;36:*.ogg=00;36:*.ra=00;36:*.wav=00;36:*.oga=00;36:*.opus=00;36:*.spx=00;36:*.xspf=00;36:

    LESSCLOSE=/usr/bin/lesspipe %s %s

    TERM=xterm

    LESSOPEN=| /usr/bin/lesspipe %s

    SHLVL=1

    LD_LIBRARY_PATH=:/opt/libfabric/lib:/opt/fabtests/lib:/opt/mpich/lib

    PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/opt/libfabric/bin/:/opt/fabtest/bin:/opt/mpich/bin

    _=/opt/mpich/bin/mpirun

    OLDPWD=/



  Hydra internal environment:

  ---------------------------

    GFORTRAN_UNBUFFERED_PRECONNECTED=y





    Proxy information:

    *********************

      [1] proxy: 10.118.91.159 (1 cores)

      Exec list: ./a.out (1 processes);



      [2] proxy: 10.118.91.158 (1 cores)

      Exec list: ./a.out (1 processes);





==================================================================================================





Proxy launch args: /opt/mpich/bin/hydra_pmi_proxy --control-port 10.118.91.159:33909 --debug --rmk user --launcher ssh --demux poll --pgid 0 --retries 10 --usize -2 --pmi-port 0 --gpus-per-proc -2 --gpu-subdevs-per-proc -2 --proxy-id



Arguments being passed to proxy 0:

--version 4.3.0a1 --iface-ip-env-name MPIR_CVAR_CH3_INTERFACE_HOSTNAME --hostname 10.118.91.159 --global-core-map 0,1,2 --pmi-id-map 0,0 --global-process-count 2 --auto-cleanup 1 --pmi-kvsname kvs_844_0_801938186_dpr740 --pmi-process-mapping (vector,(0,2,1)) --global-inherited-env 14 'PKG_CONFIG_PATH=:/opt/libfabric/lib/pkgconfig:/opt/mpich/lib/pkgconfig' 'HOSTNAME=dpr740' 'HYDRA_LAUNCHER_EXTRA_ARGS=-p 2233' 'PWD=/mpich/examples' 'HOME=/root' 'LS_COLORS=rs=0:di=01;34:ln=01;36:mh=00:pi=40;33:so=01;35:do=01;35:bd=40;33;01:cd=40;33;01:or=40;31;01:mi=00:su=37;41:sg=30;43:ca=30;41:tw=30;42:ow=34;42:st=37;44:ex=01;32:*.tar=01;31:*.tgz=01;31:*.arc=01;31:*.arj=01;31:*.taz=01;31:*.lha=01;31:*.lz4=01;31:*.lzh=01;31:*.lzma=01;31:*.tlz=01;31:*.txz=01;31:*.tzo=01;31:*.t7z=01;31:*.zip=01;31:*.z=01;31:*.dz=01;31:*.gz=01;31:*.lrz=01;31:*.lz=01;31:*.lzo=01;31:*.xz=01;31:*.zst=01;31:*.tzst=01;31:*.bz2=01;31:*.bz=01;31:*.tbz=01;31:*.tbz2=01;31:*.tz=01;31:*.deb=01;31:*.rpm=01;31:*.jar=01;31:*.war=01;31:*.ear=01;31:*.sar=01;31:*.rar=01;31:*.alz=01;31:*.ace=01;31:*.zoo=01;31:*.cpio=01;31:*.7z=01;31:*.rz=01;31:*.cab=01;31:*.wim=01;31:*.swm=01;31:*.dwm=01;31:*.esd=01;31:*.jpg=01;35:*.jpeg=01;35:*.mjpg=01;35:*.mjpeg=01;35:*.gif=01;35:*.bmp=01;35:*.pbm=01;35:*.pgm=01;35:*.ppm=01;35:*.tga=01;35:*.xbm=01;35:*.xpm=01;35:*.tif=01;35:*.tiff=01;35:*.png=01;35:*.svg=01;35:*.svgz=01;35:*.mng=01;35:*.pcx=01;35:*.mov=01;35:*.mpg=01;35:*.mpeg=01;35:*.m2v=01;35:*.mkv=01;35:*.webm=01;35:*.webp=01;35:*.ogm=01;35:*.mp4=01;35:*.m4v=01;35:*.mp4v=01;35:*.vob=01;35:*.qt=01;35:*.nuv=01;35:*.wmv=01;35:*.asf=01;35:*.rm=01;35:*.rmvb=01;35:*.flc=01;35:*.avi=01;35:*.fli=01;35:*.flv=01;35:*.gl=01;35:*.dl=01;35:*.xcf=01;35:*.xwd=01;35:*.yuv=01;35:*.cgm=01;35:*.emf=01;35:*.ogv=01;35:*.ogx=01;35:*.aac=00;36:*.au=00;36:*.flac=00;36:*.m4a=00;36:*.mid=00;36:*.midi=00;36:*.mka=00;36:*.mp3=00;36:*.mpc=00;36:*.ogg=00;36:*.ra=00;36:*.wav=00;36:*.oga=00;36:*.opus=00;36:*.spx=00;36:*.xspf=00;36:' 'LESSCLOSE=/usr/bin/lesspipe %s %s' 'TERM=xterm' 'LESSOPEN=| /usr/bin/lesspipe %s' 'SHLVL=1' 'LD_LIBRARY_PATH=:/opt/libfabric/lib:/opt/fabtests/lib:/opt/mpich/lib' 'PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/opt/libfabric/bin/:/opt/fabtest/bin:/opt/mpich/bin' '_=/opt/mpich/bin/mpirun' 'OLDPWD=/' --global-user-env 0 --global-system-env 1 'GFORTRAN_UNBUFFERED_PRECONNECTED=y' --proxy-core-count 1 --exec --exec-appnum 0 --exec-proc-count 1 --exec-local-env 0 --exec-wdir /mpich/examples --exec-args 1 ./a.out



Arguments being passed to proxy 1:

--version 4.3.0a1 --iface-ip-env-name MPIR_CVAR_CH3_INTERFACE_HOSTNAME --hostname 10.118.91.158 --global-core-map 0,1,2 --pmi-id-map 0,1 --global-process-count 2 --auto-cleanup 1 --pmi-kvsname kvs_844_0_801938186_dpr740 --pmi-process-mapping (vector,(0,2,1)) --global-inherited-env 14 'PKG_CONFIG_PATH=:/opt/libfabric/lib/pkgconfig:/opt/mpich/lib/pkgconfig' 'HOSTNAME=dpr740' 'HYDRA_LAUNCHER_EXTRA_ARGS=-p 2233' 'PWD=/mpich/examples' 'HOME=/root' 'LS_COLORS=rs=0:di=01;34:ln=01;36:mh=00:pi=40;33:so=01;35:do=01;35:bd=40;33;01:cd=40;33;01:or=40;31;01:mi=00:su=37;41:sg=30;43:ca=30;41:tw=30;42:ow=34;42:st=37;44:ex=01;32:*.tar=01;31:*.tgz=01;31:*.arc=01;31:*.arj=01;31:*.taz=01;31:*.lha=01;31:*.lz4=01;31:*.lzh=01;31:*.lzma=01;31:*.tlz=01;31:*.txz=01;31:*.tzo=01;31:*.t7z=01;31:*.zip=01;31:*.z=01;31:*.dz=01;31:*.gz=01;31:*.lrz=01;31:*.lz=01;31:*.lzo=01;31:*.xz=01;31:*.zst=01;31:*.tzst=01;31:*.bz2=01;31:*.bz=01;31:*.tbz=01;31:*.tbz2=01;31:*.tz=01;31:*.deb=01;31:*.rpm=01;31:*.jar=01;31:*.war=01;31:*.ear=01;31:*.sar=01;31:*.rar=01;31:*.alz=01;31:*.ace=01;31:*.zoo=01;31:*.cpio=01;31:*.7z=01;31:*.rz=01;31:*.cab=01;31:*.wim=01;31:*.swm=01;31:*.dwm=01;31:*.esd=01;31:*.jpg=01;35:*.jpeg=01;35:*.mjpg=01;35:*.mjpeg=01;35:*.gif=01;35:*.bmp=01;35:*.pbm=01;35:*.pgm=01;35:*.ppm=01;35:*.tga=01;35:*.xbm=01;35:*.xpm=01;35:*.tif=01;35:*.tiff=01;35:*.png=01;35:*.svg=01;35:*.svgz=01;35:*.mng=01;35:*.pcx=01;35:*.mov=01;35:*.mpg=01;35:*.mpeg=01;35:*.m2v=01;35:*.mkv=01;35:*.webm=01;35:*.webp=01;35:*.ogm=01;35:*.mp4=01;35:*.m4v=01;35:*.mp4v=01;35:*.vob=01;35:*.qt=01;35:*.nuv=01;35:*.wmv=01;35:*.asf=01;35:*.rm=01;35:*.rmvb=01;35:*.flc=01;35:*.avi=01;35:*.fli=01;35:*.flv=01;35:*.gl=01;35:*.dl=01;35:*.xcf=01;35:*.xwd=01;35:*.yuv=01;35:*.cgm=01;35:*.emf=01;35:*.ogv=01;35:*.ogx=01;35:*.aac=00;36:*.au=00;36:*.flac=00;36:*.m4a=00;36:*.mid=00;36:*.midi=00;36:*.mka=00;36:*.mp3=00;36:*.mpc=00;36:*.ogg=00;36:*.ra=00;36:*.wav=00;36:*.oga=00;36:*.opus=00;36:*.spx=00;36:*.xspf=00;36:' 'LESSCLOSE=/usr/bin/lesspipe %s %s' 'TERM=xterm' 'LESSOPEN=| /usr/bin/lesspipe %s' 'SHLVL=1' 'LD_LIBRARY_PATH=:/opt/libfabric/lib:/opt/fabtests/lib:/opt/mpich/lib' 'PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/opt/libfabric/bin/:/opt/fabtest/bin:/opt/mpich/bin' '_=/opt/mpich/bin/mpirun' 'OLDPWD=/' --global-user-env 0 --global-system-env 1 'GFORTRAN_UNBUFFERED_PRECONNECTED=y' --proxy-core-count 1 --exec --exec-appnum 0 --exec-proc-count 1 --exec-local-env 0 --exec-wdir /mpich/examples --exec-args 1 ./a.out



[mpiexec at dpr740] Launch arguments: /opt/mpich/bin/hydra_pmi_proxy --control-port 10.118.91.159:33909 --debug --rmk user --launcher ssh --demux poll --pgid 0 --retries 10 --usize -2 --pmi-port 0 --gpus-per-proc -2 --gpu-subdevs-per-proc -2 --proxy-id 0

[mpiexec at dpr740] Launch arguments: /usr/bin/ssh -x -p 2233 10.118.91.158 "/opt/mpich/bin/hydra_pmi_proxy" --control-port 10.118.91.159:33909 --debug --rmk user --launcher ssh --demux poll --pgid 0 --retries 10 --usize -2 --pmi-port 0 --gpus-per-proc -2 --gpu-subdevs-per-proc -2 --proxy-id 1

[proxy:0 at dpr740] Sending upstream hdr.cmd = CMD_PID_LIST

[proxy:0 at dpr740] got pmi command from downstream 0-0:

    cmd=init pmi_version=1 pmi_subversion=1

[proxy:0 at dpr740] Sending PMI command:

    cmd=response_to_init rc=0 pmi_version=1 pmi_subversion=1

[proxy:0 at dpr740] got pmi command from downstream 0-0:

    cmd=get_maxes

[proxy:0 at dpr740] Sending PMI command:

    cmd=maxes rc=0 kvsname_max=256 keylen_max=64 vallen_max=1024

[proxy:0 at dpr740] got pmi command from downstream 0-0:

    cmd=get_appnum

[proxy:0 at dpr740] Sending PMI command:

    cmd=appnum rc=0 appnum=0

[proxy:0 at dpr740] got pmi command from downstream 0-0:

    cmd=get_my_kvsname

[proxy:0 at dpr740] Sending PMI command:

    cmd=my_kvsname rc=0 kvsname=kvs_844_0_801938186_dpr740

[proxy:0 at dpr740] got pmi command from downstream 0-0:

    cmd=get kvsname=kvs_844_0_801938186_dpr740 key=PMI_process_mapping

[proxy:0 at dpr740] Sending PMI command:

    cmd=get_result rc=0 value=(vector,(0,2,1)) found=TRUE

[proxy:0 at dpr740] got pmi command from downstream 0-0:

    cmd=get kvsname=kvs_844_0_801938186_dpr740 key=PMI_hwloc_xmlfile

[proxy:0 at dpr740] Sending PMI command:

    cmd=get_result rc=0 value=/tmp/hydra_hwloc_xmlfile_CeNRJN found=TRUE

[proxy:0 at dpr740] got pmi command from downstream 0-0:

    cmd=get kvsname=kvs_844_0_801938186_dpr740 key=PMI_mpi_memory_alloc_kinds

[proxy:0 at dpr740] Sending upstream internal PMI command:

    cmd=get kvsname=kvs_844_0_801938186_dpr740 key=PMI_mpi_memory_alloc_kinds

[proxy:0 at dpr740] Sending upstream hdr.cmd = CMD_PMI

[mpiexec at dpr740] [pgid: 0] got PMI command: cmd=get kvsname=kvs_844_0_801938186_dpr740 key=PMI_mpi_memory_alloc_kinds



[mpiexec at dpr740] Sending internal PMI command (proxy:0:0):

    cmd=get_result rc=1

[proxy:0 at dpr740] we don't understand the response get_result; forwarding downstream

[proxy:1 at ampere-altra-2-1] Sending upstream hdr.cmd = CMD_PID_LIST

[proxy:1 at ampere-altra-2-1] got pmi command from downstream 0-0:

    cmd=init pmi_version=1 pmi_subversion=1

[proxy:1 at ampere-altra-2-1] Sending PMI command:

    cmd=response_to_init rc=0 pmi_version=1 pmi_subversion=1

[proxy:1 at ampere-altra-2-1] got pmi command from downstream 0-0:

    cmd=get_maxes

[proxy:1 at ampere-altra-2-1] Sending PMI command:

    cmd=maxes rc=0 kvsname_max=256 keylen_max=64 vallen_max=1024

[proxy:1 at ampere-altra-2-1] got pmi command from downstream 0-0:

    cmd=get_appnum

[proxy:1 at ampere-altra-2-1] Sending PMI command:

    cmd=appnum rc=0 appnum=0

[proxy:1 at ampere-altra-2-1] got pmi command from downstream 0-0:

    cmd=get_my_kvsname

[proxy:1 at ampere-altra-2-1] Sending PMI command:

    cmd=my_kvsname rc=0 kvsname=kvs_844_0_801938186_dpr740

[proxy:1 at ampere-altra-2-1] got pmi command from downstream 0-0:

    cmd=get kvsname=kvs_844_0_801938186_dpr740 key=PMI_process_mapping

[proxy:1 at ampere-altra-2-1] Sending PMI command:

    cmd=get_result rc=0 value=(vector,(0,2,1)) found=TRUE

[proxy:1 at ampere-altra-2-1] got pmi command from downstream 0-0:

    cmd=get kvsname=kvs_844_0_801938186_dpr740 key=PMI_hwloc_xmlfile

[proxy:1 at ampere-altra-2-1] Sending PMI command:

    cmd=get_result rc=0 value=/tmp/hydra_hwloc_xmlfile_xv8EIG found=TRUE

[proxy:0 at dpr740] got pmi command from downstream 0-0:

    cmd=barrier_in

[proxy:0 at dpr740] Sending upstream internal PMI command:

    cmd=barrier_in

[proxy:0 at dpr740] Sending upstream hdr.cmd = CMD_PMI

[mpiexec at dpr740] [pgid: 0] got PMI command: cmd=barrier_in



[mpiexec at dpr740] [pgid: 0] got PMI command: cmd=get kvsname=kvs_844_0_801938186_dpr740 key=PMI_mpi_memory_alloc_kinds



[mpiexec at dpr740] Sending internal PMI command (proxy:0:1):

    cmd=get_result rc=1

[proxy:1 at ampere-altra-2-1] got pmi command from downstream 0-0:

    cmd=get kvsname=kvs_844_0_801938186_dpr740 key=PMI_mpi_memory_alloc_kinds

[proxy:1 at ampere-altra-2-1] Sending upstream internal PMI command:

    cmd=get kvsname=kvs_844_0_801938186_dpr740 key=PMI_mpi_memory_alloc_kinds

[proxy:1 at ampere-altra-2-1] Sending upstream hdr.cmd = CMD_PMI

[proxy:1 at ampere-altra-2-1] we don't understand the response get_result; forwarding downstream

[mpiexec at dpr740] [pgid: 0] got PMI command: cmd=barrier_in



[mpiexec at dpr740] Sending internal PMI command (proxy:0:0):

    cmd=barrier_out

[mpiexec at dpr740] Sending internal PMI command (proxy:0:1):

    cmd=barrier_out

[proxy:0 at dpr740] Sending PMI command:

    cmd=barrier_out

[proxy:1 at ampere-altra-2-1] got pmi command from downstream 0-0:

    cmd=barrier_in

[proxy:1 at ampere-altra-2-1] Sending upstream internal PMI command:

    cmd=barrier_in

[proxy:1 at ampere-altra-2-1] Sending upstream hdr.cmd = CMD_PMI

[proxy:1 at ampere-altra-2-1] Sending PMI command:

    cmd=barrier_out

[proxy:0 at dpr740] got pmi command from downstream 0-0:

    cmd=put kvsname=kvs_844_0_801938186_dpr740 key=-allgather-shm-1-0 value=0A00812D[4]FE80[6]526B4BFFFEFC134208[3]

[proxy:0 at dpr740] [proxy:1 at ampere-altra-2-1] got pmi command from downstream 0-0:

    cmd=put kvsname=kvs_844_0_801938186_dpr740 key=-allgather-shm-1-1 value=0200A8BFC0A80101[8]

cached command: -allgather-shm-1-0=0A00812D[4]FE80[6]526B4BFFFEFC134208[3]

[proxy:0 at dpr740] Sending PMI command:

    cmd=put_result rc=0

[proxy:0 at dpr740] got pmi command from downstream 0-0:

    cmd=barrier_in

[proxy:0 at dpr740] flushing 1 put command(s) out

[proxy:0 at dpr740] forwarding command upstream:

cmd=mput -allgather-shm-1-0=0A00812D[4]FE80[6]526B4BFFFEFC134208[3]

[proxy:0 at dpr740] Sending upstream internal PMI command:

    cmd=mput -allgather-shm-1-0=0A00812D[4]FE80[6]526B4BFFFEFC134208[3]

[proxy:0 at dpr740] Sending upstream hdr.cmd = CMD_PMI

[proxy:1 at ampere-altra-2-1] cached command: -allgather-shm-1-1=0200A8BFC0A80101[8]

[proxy:1 at ampere-altra-2-1] Sending PMI command:

[proxy:0 at dpr740] Sending upstream internal PMI command:

    cmd=barrier_in

[proxy:0 at dpr740] Sending upstream hdr.cmd = CMD_PMI

[mpiexec at dpr740] [pgid: 0] got PMI command: cmd=mput -allgather-shm-1-0=0A00812D[4]FE80[6]526B4BFFFEFC134208[3]



    cmd=put_result rc=0

[mpiexec at dpr740] [pgid: 0] got PMI command: cmd=barrier_in



[mpiexec at dpr740] [pgid: 0] got PMI command: cmd=mput -allgather-shm-1-1=0200A8BFC0A80101[8]



[proxy:1 at ampere-altra-2-1] got pmi command from downstream 0-0:

    cmd=barrier_in

[proxy:1 at ampere-altra-2-1] flushing 1 put command(s) out

[proxy:1 at ampere-altra-2-1] forwarding command upstream:

[mpiexec at dpr740] [pgid: 0] got PMI command: cmd=barrier_in



[mpiexec at dpr740] Sending internal PMI command (proxy:0:0):

    cmd=keyval_cache -allgather-shm-1-0=0A00812D[4]FE80[6]526B4BFFFEFC134208[3] -allgather-shm-1-1=0200A8BFC0A80101[8]

[mpiexec at dpr740] Sending internal PMI command (proxy:0:1):

    cmd=keyval_cache -allgather-shm-1-0=0A00812D[4]FE80[6]526B4BFFFEFC134208[3] -allgather-shm-1-1=0200A8BFC0A80101[8]

[mpiexec at dpr740] Sending internal PMI command (proxy:0:0):

    cmd=barrier_out

[mpiexec at dpr740] Sending internal PMI command (proxy:0:1):

    cmd=barrier_out

[proxy:0 at dpr740] Sending PMI command:

    cmd=barrier_out

cmd=mput -allgather-shm-1-1=0200A8BFC0A80101[8]

[proxy:1 at ampere-altra-2-1] Sending upstream internal PMI command:

    cmd=mput -allgather-shm-1-1=0200A8BFC0A80101[8]

[proxy:1 at ampere-altra-2-1] Sending upstream hdr.cmd = CMD_PMI

[proxy:1 at ampere-altra-2-1] Sending upstream internal PMI command:

    cmd=barrier_in

[proxy:1 at ampere-altra-2-1] Sending upstream hdr.cmd = CMD_PMI

[proxy:0 at dpr740] got pmi command from downstream 0-0:

    cmd=get kvsname=kvs_844_0_801938186_dpr740 key=-allgather-shm-1-0

[proxy:0 at dpr740] Sending PMI command:

    cmd=get_result rc=0 value=0A00812D[4]FE80[6]526B4BFFFEFC134208[3] found=TRUE

[proxy:0 at dpr740] got pmi command from downstream 0-0:

    cmd=get kvsname=kvs_844_0_801938186_dpr740 key=-allgather-shm-1-1

[proxy:0 at dpr740] Sending PMI command:

    cmd=get_result rc=0 value=0200A8BFC0A80101[8] found=TRUE

[proxy:1 at ampere-altra-2-1] Sending PMI command:

    cmd=barrier_out

[proxy:1 at ampere-altra-2-1] got pmi command from downstream 0-0:

    cmd=get kvsname=kvs_844_0_801938186_dpr740 key=-allgather-shm-1-0

[proxy:1 at ampere-altra-2-1] Sending PMI command:

    cmd=get_result rc=0 value=0A00812D[4]FE80[6]526B4BFFFEFC134208[3] found=TRUE

[proxy:1 at ampere-altra-2-1] got pmi command from downstream 0-0:

    cmd=get kvsname=kvs_844_0_801938186_dpr740 key=-allgather-shm-1-1

[proxy:1 at ampere-altra-2-1] Sending PMI command:

    cmd=get_result rc=0 value=0200A8BFC0A80101[8] found=TRUE

realloc(): invalid pointer

[proxy:1 at ampere-altra-2-1] Sending upstream hdr.cmd = CMD_STDERR



===================================================================================

=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES

=   PID 2404 RUNNING AT 10.118.91.158

=   EXIT CODE: 134

=   CLEANING UP REMAINING PROCESSES

=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES

===================================================================================

[proxy:1 at ampere-altra-2-1] Sending upstream hdr.cmd = CMD_EXIT_STATUS

[proxy:0 at dpr740] HYD_pmcd_pmip_control_cmd_cb (proxy/pmip_cb.c:484): assert (!closed) failed

[proxy:0 at dpr740] HYDT_dmxu_poll_wait_for_event (lib/tools/demux/demux_poll.c:76): callback returned error status

[proxy:0 at dpr740] main (proxy/pmip.c:122): demux engine error waiting for event

[mpiexec at dpr740] HYDT_bscu_wait_for_completion (lib/tools/bootstrap/utils/bscu_wait.c:109): one of the processes terminated badly; aborting

[mpiexec at dpr740] HYDT_bsci_wait_for_completion (lib/tools/bootstrap/src/bsci_wait.c:21): launcher returned error waiting for completion

[mpiexec at dpr740] HYD_pmci_wait_for_completion (mpiexec/pmiserv_pmci.c:189): launcher returned error waiting for completion

[mpiexec at dpr740] main (mpiexec/mpiexec.c:260): process manager error waiting for completion







Regards,

Niyaz
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20240617/a3791e43/attachment-0001.html>


More information about the discuss mailing list