[mpich-discuss] mpi hello-world error
Zhou, Hui
zhouh at anl.gov
Mon Jun 17 09:41:50 CDT 2024
Niyaz,
I am quite lost on the errors you encountered. The three errors seem all over the place. Are the two hosts on the same local network?
--
Hui
________________________________
From: Niyaz Murshed via discuss <discuss at mpich.org>
Sent: Monday, June 17, 2024 1:07 AM
To: discuss at mpich.org <discuss at mpich.org>
Cc: Niyaz Murshed <Niyaz.Murshed at arm.com>; nd <nd at arm.com>
Subject: Re: [mpich-discuss] mpi hello-world error
What is the best way to understand this log ? [proxy: 1@ ampere-altra-2-1] Sending upstream hdr. cmd = CMD_STDERR Abort(680650255) on node 1: Fatal error in internal_Init: Other MPI error, error stack: internal_Init(70). . . . . . . . . . . . . . . . : MPI_Init(argc=(nil),
ZjQcmQRYFpfptBannerStart
This Message Is From an External Sender
This message came from outside your organization.
ZjQcmQRYFpfptBannerEnd
What is the best way to understand this log ?
[proxy:1 at ampere-altra-2-1] Sending upstream hdr.cmd = CMD_STDERR
Abort(680650255) on node 1: Fatal error in internal_Init: Other MPI error, error stack:
internal_Init(70)................: MPI_Init(argc=(nil), argv=(nil)) failed
MPII_Init_thread(268)............:
MPIR_init_comm_world(34).........:
MPIR_Comm_commit(823)............:
MPID_Comm_commit_post_hook(222)..:
MPIDI_world_post_init(665).......:
MPIDI_OFI_init_vcis(851).........:
check_num_nics(900)..............:
MPIR_Allreduce_allcomm_auto(4726):
MPIC_Sendrecv(301)...............:
MPID_Isend(63)...................:
MPIDI_isend(35)..................:
(unknown)(): Other MPI error
[proxy:1 at ampere-altra-2-1] Sending upstream hdr.cmd = CMD_EXIT_STATUS
[proxy:0 at cesw-amp-gbt-2s-m12830-01] Sending upstream hdr.cmd = CMD_STDERR
Abort(680650255) on node 0: Fatal error in internal_Init: Other MPI error, error stack:
internal_Init(70)................: MPI_Init(argc=(nil), argv=(nil)) failed
MPII_Init_thread(268)............:
MPIR_init_comm_world(34).........:
MPIR_Comm_commit(823)............:
MPID_Comm_commit_post_hook(222)..:
MPIDI_world_post_init(665).......:
MPIDI_OFI_init_vcis(851).........:
check_num_nics(900)..............:
MPIR_Allreduce_allcomm_auto(4726):
MPIC_Sendrecv(301)...............:
MPID_Isend(63)...................:
MPIDI_isend(35)..................:
(unknown)(): Other MPI error
[proxy:0 at cesw-amp-gbt-2s-m12830-01] Sending upstream hdr.cmd = CMD_EXIT_STATUS
From: Niyaz Murshed via discuss <discuss at mpich.org>
Date: Saturday, June 15, 2024 at 10:53 AM
To: discuss at mpich.org <discuss at mpich.org>
Cc: Niyaz Murshed <Niyaz.Murshed at arm.com>, nd <nd at arm.com>
Subject: Re: [mpich-discuss] mpi hello-world error
Also seeing this error sometimes. root@ dpr740: /mpich/examples# export FI_PROVIDER=tcp root@ dpr740: /mpich/examples# mpirun -verbose -n 2 -hosts 10. 118. 91. 158,10. 118. 91. 159 ./a. out host: 10. 118. 91. 158 host: 10. 118. 91. 159 [mpiexec@ dpr740] Timeout
ZjQcmQRYFpfptBannerStart
This Message Is From an External Sender
This message came from outside your organization.
ZjQcmQRYFpfptBannerEnd
Also seeing this error sometimes.
root at dpr740:/mpich/examples# export FI_PROVIDER=tcp
root at dpr740:/mpich/examples# mpirun -verbose -n 2 -hosts 10.118.91.158,10.118.91.159 ./a.out
host: 10.118.91.158
host: 10.118.91.159
[mpiexec at dpr740] Timeout set to -1 (-1 means infinite)
==================================================================================================
mpiexec options:
----------------
Base path: /opt/mpich/bin/
Launcher: (null)
Debug level: 1
Enable X: -1
Global environment:
-------------------
PKG_CONFIG_PATH=:/opt/libfabric/lib/pkgconfig:/opt/mpich/lib/pkgconfig
HOSTNAME=dpr740
HYDRA_LAUNCHER_EXTRA_ARGS=-p 2233
PWD=/mpich/examples
HOME=/root
FI_PROVIDER=tcp
LS_COLORS=rs=0:di=01;34:ln=01;36:mh=00:pi=40;33:so=01;35:do=01;35:bd=40;33;01:cd=40;33;01:or=40;31;01:mi=00:su=37;41:sg=30;43:ca=30;41:tw=30;42:ow=34;42:st=37;44:ex=01;32:*.tar=01;31:*.tgz=01;31:*.arc=01;31:*.arj=01;31:*.taz=01;31:*.lha=01;31:*.lz4=01;31:*.lzh=01;31:*.lzma=01;31:*.tlz=01;31:*.txz=01;31:*.tzo=01;31:*.t7z=01;31:*.zip=01;31:*.z=01;31:*.dz=01;31:*.gz=01;31:*.lrz=01;31:*.lz=01;31:*.lzo=01;31:*.xz=01;31:*.zst=01;31:*.tzst=01;31:*.bz2=01;31:*.bz=01;31:*.tbz=01;31:*.tbz2=01;31:*.tz=01;31:*.deb=01;31:*.rpm=01;31:*.jar=01;31:*.war=01;31:*.ear=01;31:*.sar=01;31:*.rar=01;31:*.alz=01;31:*.ace=01;31:*.zoo=01;31:*.cpio=01;31:*.7z=01;31:*.rz=01;31:*.cab=01;31:*.wim=01;31:*.swm=01;31:*.dwm=01;31:*.esd=01;31:*.jpg=01;35:*.jpeg=01;35:*.mjpg=01;35:*.mjpeg=01;35:*.gif=01;35:*.bmp=01;35:*.pbm=01;35:*.pgm=01;35:*.ppm=01;35:*.tga=01;35:*.xbm=01;35:*.xpm=01;35:*.tif=01;35:*.tiff=01;35:*.png=01;35:*.svg=01;35:*.svgz=01;35:*.mng=01;35:*.pcx=01;35:*.mov=01;35:*.mpg=01;35:*.mpeg=01;35:*.m2v=01;35:*.mkv=01;35:*.webm=01;35:*.webp=01;35:*.ogm=01;35:*.mp4=01;35:*.m4v=01;35:*.mp4v=01;35:*.vob=01;35:*.qt=01;35:*.nuv=01;35:*.wmv=01;35:*.asf=01;35:*.rm=01;35:*.rmvb=01;35:*.flc=01;35:*.avi=01;35:*.fli=01;35:*.flv=01;35:*.gl=01;35:*.dl=01;35:*.xcf=01;35:*.xwd=01;35:*.yuv=01;35:*.cgm=01;35:*.emf=01;35:*.ogv=01;35:*.ogx=01;35:*.aac=00;36:*.au=00;36:*.flac=00;36:*.m4a=00;36:*.mid=00;36:*.midi=00;36:*.mka=00;36:*.mp3=00;36:*.mpc=00;36:*.ogg=00;36:*.ra=00;36:*.wav=00;36:*.oga=00;36:*.opus=00;36:*.spx=00;36:*.xspf=00;36:
LESSCLOSE=/usr/bin/lesspipe %s %s
TERM=xterm
LESSOPEN=| /usr/bin/lesspipe %s
SHLVL=1
LD_LIBRARY_PATH=:/opt/libfabric/lib:/opt/fabtests/lib:/opt/mpich/lib
PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/opt/libfabric/bin/:/opt/fabtest/bin:/opt/mpich/bin
OLDPWD=/
_=/opt/mpich/bin/mpirun
Hydra internal environment:
---------------------------
GFORTRAN_UNBUFFERED_PRECONNECTED=y
Proxy information:
*********************
[1] proxy: 10.118.91.158 (1 cores)
Exec list: ./a.out (1 processes);
[2] proxy: 10.118.91.159 (1 cores)
Exec list: ./a.out (1 processes);
==================================================================================================
Proxy launch args: /opt/mpich/bin/hydra_pmi_proxy --control-port 10.118.91.159:35625 --debug --rmk user --launcher ssh --demux poll --pgid 0 --retries 10 --usize -2 --pmi-port 0 --gpus-per-proc -2 --gpu-subdevs-per-proc -2 --proxy-id
Arguments being passed to proxy 0:
--version 4.3.0a1 --iface-ip-env-name MPIR_CVAR_CH3_INTERFACE_HOSTNAME --hostname 10.118.91.158 --global-core-map 0,1,2 --pmi-id-map 0,0 --global-process-count 2 --auto-cleanup 1 --pmi-kvsname kvs_1151_0_1450155337_dpr740 --pmi-process-mapping (vector,(0,2,1)) --global-inherited-env 15 'PKG_CONFIG_PATH=:/opt/libfabric/lib/pkgconfig:/opt/mpich/lib/pkgconfig' 'HOSTNAME=dpr740' 'HYDRA_LAUNCHER_EXTRA_ARGS=-p 2233' 'PWD=/mpich/examples' 'HOME=/root' 'FI_PROVIDER=tcp' 'LS_COLORS=rs=0:di=01;34:ln=01;36:mh=00:pi=40;33:so=01;35:do=01;35:bd=40;33;01:cd=40;33;01:or=40;31;01:mi=00:su=37;41:sg=30;43:ca=30;41:tw=30;42:ow=34;42:st=37;44:ex=01;32:*.tar=01;31:*.tgz=01;31:*.arc=01;31:*.arj=01;31:*.taz=01;31:*.lha=01;31:*.lz4=01;31:*.lzh=01;31:*.lzma=01;31:*.tlz=01;31:*.txz=01;31:*.tzo=01;31:*.t7z=01;31:*.zip=01;31:*.z=01;31:*.dz=01;31:*.gz=01;31:*.lrz=01;31:*.lz=01;31:*.lzo=01;31:*.xz=01;31:*.zst=01;31:*.tzst=01;31:*.bz2=01;31:*.bz=01;31:*.tbz=01;31:*.tbz2=01;31:*.tz=01;31:*.deb=01;31:*.rpm=01;31:*.jar=01;31:*.war=01;31:*.ear=01;31:*.sar=01;31:*.rar=01;31:*.alz=01;31:*.ace=01;31:*.zoo=01;31:*.cpio=01;31:*.7z=01;31:*.rz=01;31:*.cab=01;31:*.wim=01;31:*.swm=01;31:*.dwm=01;31:*.esd=01;31:*.jpg=01;35:*.jpeg=01;35:*.mjpg=01;35:*.mjpeg=01;35:*.gif=01;35:*.bmp=01;35:*.pbm=01;35:*.pgm=01;35:*.ppm=01;35:*.tga=01;35:*.xbm=01;35:*.xpm=01;35:*.tif=01;35:*.tiff=01;35:*.png=01;35:*.svg=01;35:*.svgz=01;35:*.mng=01;35:*.pcx=01;35:*.mov=01;35:*.mpg=01;35:*.mpeg=01;35:*.m2v=01;35:*.mkv=01;35:*.webm=01;35:*.webp=01;35:*.ogm=01;35:*.mp4=01;35:*.m4v=01;35:*.mp4v=01;35:*.vob=01;35:*.qt=01;35:*.nuv=01;35:*.wmv=01;35:*.asf=01;35:*.rm=01;35:*.rmvb=01;35:*.flc=01;35:*.avi=01;35:*.fli=01;35:*.flv=01;35:*.gl=01;35:*.dl=01;35:*.xcf=01;35:*.xwd=01;35:*.yuv=01;35:*.cgm=01;35:*.emf=01;35:*.ogv=01;35:*.ogx=01;35:*.aac=00;36:*.au=00;36:*.flac=00;36:*.m4a=00;36:*.mid=00;36:*.midi=00;36:*.mka=00;36:*.mp3=00;36:*.mpc=00;36:*.ogg=00;36:*.ra=00;36:*.wav=00;36:*.oga=00;36:*.opus=00;36:*.spx=00;36:*.xspf=00;36:' 'LESSCLOSE=/usr/bin/lesspipe %s %s' 'TERM=xterm' 'LESSOPEN=| /usr/bin/lesspipe %s' 'SHLVL=1' 'LD_LIBRARY_PATH=:/opt/libfabric/lib:/opt/fabtests/lib:/opt/mpich/lib' 'PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/opt/libfabric/bin/:/opt/fabtest/bin:/opt/mpich/bin' 'OLDPWD=/' '_=/opt/mpich/bin/mpirun' --global-user-env 0 --global-system-env 1 'GFORTRAN_UNBUFFERED_PRECONNECTED=y' --proxy-core-count 1 --exec --exec-appnum 0 --exec-proc-count 1 --exec-local-env 0 --exec-wdir /mpich/examples --exec-args 1 ./a.out
Arguments being passed to proxy 1:
--version 4.3.0a1 --iface-ip-env-name MPIR_CVAR_CH3_INTERFACE_HOSTNAME --hostname 10.118.91.159 --global-core-map 0,1,2 --pmi-id-map 0,1 --global-process-count 2 --auto-cleanup 1 --pmi-kvsname kvs_1151_0_1450155337_dpr740 --pmi-process-mapping (vector,(0,2,1)) --global-inherited-env 15 'PKG_CONFIG_PATH=:/opt/libfabric/lib/pkgconfig:/opt/mpich/lib/pkgconfig' 'HOSTNAME=dpr740' 'HYDRA_LAUNCHER_EXTRA_ARGS=-p 2233' 'PWD=/mpich/examples' 'HOME=/root' 'FI_PROVIDER=tcp' 'LS_COLORS=rs=0:di=01;34:ln=01;36:mh=00:pi=40;33:so=01;35:do=01;35:bd=40;33;01:cd=40;33;01:or=40;31;01:mi=00:su=37;41:sg=30;43:ca=30;41:tw=30;42:ow=34;42:st=37;44:ex=01;32:*.tar=01;31:*.tgz=01;31:*.arc=01;31:*.arj=01;31:*.taz=01;31:*.lha=01;31:*.lz4=01;31:*.lzh=01;31:*.lzma=01;31:*.tlz=01;31:*.txz=01;31:*.tzo=01;31:*.t7z=01;31:*.zip=01;31:*.z=01;31:*.dz=01;31:*.gz=01;31:*.lrz=01;31:*.lz=01;31:*.lzo=01;31:*.xz=01;31:*.zst=01;31:*.tzst=01;31:*.bz2=01;31:*.bz=01;31:*.tbz=01;31:*.tbz2=01;31:*.tz=01;31:*.deb=01;31:*.rpm=01;31:*.jar=01;31:*.war=01;31:*.ear=01;31:*.sar=01;31:*.rar=01;31:*.alz=01;31:*.ace=01;31:*.zoo=01;31:*.cpio=01;31:*.7z=01;31:*.rz=01;31:*.cab=01;31:*.wim=01;31:*.swm=01;31:*.dwm=01;31:*.esd=01;31:*.jpg=01;35:*.jpeg=01;35:*.mjpg=01;35:*.mjpeg=01;35:*.gif=01;35:*.bmp=01;35:*.pbm=01;35:*.pgm=01;35:*.ppm=01;35:*.tga=01;35:*.xbm=01;35:*.xpm=01;35:*.tif=01;35:*.tiff=01;35:*.png=01;35:*.svg=01;35:*.svgz=01;35:*.mng=01;35:*.pcx=01;35:*.mov=01;35:*.mpg=01;35:*.mpeg=01;35:*.m2v=01;35:*.mkv=01;35:*.webm=01;35:*.webp=01;35:*.ogm=01;35:*.mp4=01;35:*.m4v=01;35:*.mp4v=01;35:*.vob=01;35:*.qt=01;35:*.nuv=01;35:*.wmv=01;35:*.asf=01;35:*.rm=01;35:*.rmvb=01;35:*.flc=01;35:*.avi=01;35:*.fli=01;35:*.flv=01;35:*.gl=01;35:*.dl=01;35:*.xcf=01;35:*.xwd=01;35:*.yuv=01;35:*.cgm=01;35:*.emf=01;35:*.ogv=01;35:*.ogx=01;35:*.aac=00;36:*.au=00;36:*.flac=00;36:*.m4a=00;36:*.mid=00;36:*.midi=00;36:*.mka=00;36:*.mp3=00;36:*.mpc=00;36:*.ogg=00;36:*.ra=00;36:*.wav=00;36:*.oga=00;36:*.opus=00;36:*.spx=00;36:*.xspf=00;36:' 'LESSCLOSE=/usr/bin/lesspipe %s %s' 'TERM=xterm' 'LESSOPEN=| /usr/bin/lesspipe %s' 'SHLVL=1' 'LD_LIBRARY_PATH=:/opt/libfabric/lib:/opt/fabtests/lib:/opt/mpich/lib' 'PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/opt/libfabric/bin/:/opt/fabtest/bin:/opt/mpich/bin' 'OLDPWD=/' '_=/opt/mpich/bin/mpirun' --global-user-env 0 --global-system-env 1 'GFORTRAN_UNBUFFERED_PRECONNECTED=y' --proxy-core-count 1 --exec --exec-appnum 0 --exec-proc-count 1 --exec-local-env 0 --exec-wdir /mpich/examples --exec-args 1 ./a.out
[mpiexec at dpr740] Launch arguments: /usr/bin/ssh -x -p 2233 10.118.91.158 "/opt/mpich/bin/hydra_pmi_proxy" --control-port 10.118.91.159:35625 --debug --rmk user --launcher ssh --demux poll --pgid 0 --retries 10 --usize -2 --pmi-port 0 --gpus-per-proc -2 --gpu-subdevs-per-proc -2 --proxy-id 0
[mpiexec at dpr740] Launch arguments: /opt/mpich/bin/hydra_pmi_proxy --control-port 10.118.91.159:35625 --debug --rmk user --launcher ssh --demux poll --pgid 0 --retries 10 --usize -2 --pmi-port 0 --gpus-per-proc -2 --gpu-subdevs-per-proc -2 --proxy-id 1
[proxy:1 at dpr740] Sending upstream hdr.cmd = CMD_PID_LIST
[proxy:1 at dpr740] got pmi command from downstream 0-0:
cmd=init pmi_version=1 pmi_subversion=1
[proxy:1 at dpr740] Sending PMI command:
cmd=response_to_init rc=0 pmi_version=1 pmi_subversion=1
[proxy:1 at dpr740] got pmi command from downstream 0-0:
cmd=get_maxes
[proxy:1 at dpr740] Sending PMI command:
cmd=maxes rc=0 kvsname_max=256 keylen_max=64 vallen_max=1024
[proxy:1 at dpr740] got pmi command from downstream 0-0:
cmd=get_appnum
[proxy:1 at dpr740] Sending PMI command:
cmd=appnum rc=0 appnum=0
[proxy:1 at dpr740] got pmi command from downstream 0-0:
cmd=get_my_kvsname
[proxy:1 at dpr740] Sending PMI command:
cmd=my_kvsname rc=0 kvsname=kvs_1151_0_1450155337_dpr740
[proxy:1 at dpr740] got pmi command from downstream 0-0:
cmd=get kvsname=kvs_1151_0_1450155337_dpr740 key=PMI_process_mapping
[proxy:1 at dpr740] Sending PMI command:
cmd=get_result rc=0 value=(vector,(0,2,1)) found=TRUE
[proxy:1 at dpr740] got pmi command from downstream 0-0:
cmd=get kvsname=kvs_1151_0_1450155337_dpr740 key=PMI_hwloc_xmlfile
[proxy:1 at dpr740] Sending PMI command:
cmd=get_result rc=0 value=/tmp/hydra_hwloc_xmlfile_YPoAhr found=TRUE
[proxy:1 at dpr740] got pmi command from downstream 0-0:
cmd=get kvsname=kvs_1151_0_1450155337_dpr740 key=PMI_mpi_memory_alloc_kinds
[proxy:1 at dpr740] Sending upstream internal PMI command:
cmd=get kvsname=kvs_1151_0_1450155337_dpr740 key=PMI_mpi_memory_alloc_kinds
[proxy:1 at dpr740] Sending upstream hdr.cmd = CMD_PMI
[mpiexec at dpr740] [pgid: 0] got PMI command: cmd=get kvsname=kvs_1151_0_1450155337_dpr740 key=PMI_mpi_memory_alloc_kinds
[mpiexec at dpr740] Sending internal PMI command (proxy:0:1):
cmd=get_result rc=1
[proxy:1 at dpr740] we don't understand the response get_result; forwarding downstream
[proxy:1 at dpr740] got pmi command from downstream 0-0:
cmd=barrier_in
[proxy:1 at dpr740] Sending upstream internal PMI command:
cmd=barrier_in
[proxy:1 at dpr740] Sending upstream hdr.cmd = CMD_PMI
[mpiexec at dpr740] [pgid: 0] got PMI command: cmd=barrier_in
[proxy:0 at ampere-altra-2-1] Sending upstream hdr.cmd = CMD_PID_LIST
[proxy:0 at ampere-altra-2-1] got pmi command from downstream 0-0:
cmd=init pmi_version=1 pmi_subversion=1
[proxy:0 at ampere-altra-2-1] Sending PMI command:
cmd=response_to_init rc=0 pmi_version=1 pmi_subversion=1
[proxy:0 at ampere-altra-2-1] got pmi command from downstream 0-0:
cmd=get_maxes
[proxy:0 at ampere-altra-2-1] Sending PMI command:
cmd=maxes rc=0 kvsname_max=256 keylen_max=64 vallen_max=1024
[proxy:0 at ampere-altra-2-1] got pmi command from downstream 0-0:
cmd=get_appnum
[proxy:0 at ampere-altra-2-1] Sending PMI command:
cmd=appnum rc=0 appnum=0
[proxy:0 at ampere-altra-2-1] got pmi command from downstream 0-0:
cmd=get_my_kvsname
[proxy:0 at ampere-altra-2-1] Sending PMI command:
cmd=my_kvsname rc=0 kvsname=kvs_1151_0_1450155337_dpr740
[proxy:0 at ampere-altra-2-1] got pmi command from downstream 0-0:
cmd=get kvsname=kvs_1151_0_1450155337_dpr740 key=PMI_process_mapping
[proxy:0 at ampere-altra-2-1] Sending PMI command:
cmd=get_result rc=0 value=(vector,(0,2,1)) found=TRUE
[proxy:0 at ampere-altra-2-1] got pmi command from downstream 0-0:
cmd=get kvsname=kvs_1151_0_1450155337_dpr740 key=PMI_hwloc_xmlfile
[proxy:0 at ampere-altra-2-1] Sending PMI command:
cmd=get_result rc=0 value=/tmp/hydra_hwloc_xmlfile_68iqm3 found=TRUE
[mpiexec at dpr740] [pgid: 0] got PMI command: cmd=get kvsname=kvs_1151_0_1450155337_dpr740 key=PMI_mpi_memory_alloc_kinds
[mpiexec at dpr740] Sending internal PMI command (proxy:0:0):
cmd=get_result rc=1
[proxy:0 at ampere-altra-2-1] got pmi command from downstream 0-0:
cmd=get kvsname=kvs_1151_0_1450155337_dpr740 key=PMI_mpi_memory_alloc_kinds
[proxy:0 at ampere-altra-2-1] Sending upstream internal PMI command:
cmd=get kvsname=kvs_1151_0_1450155337_dpr740 key=PMI_mpi_memory_alloc_kinds
[proxy:0 at ampere-altra-2-1] Sending upstream hdr.cmd = CMD_PMI
[proxy:0 at ampere-altra-2-1] we don't understand the response get_result; forwarding downstream
[mpiexec at dpr740] [pgid: 0] got PMI command: cmd=barrier_in
[mpiexec at dpr740] Sending internal PMI command (proxy:0:0):
cmd=barrier_out
[mpiexec at dpr740] Sending internal PMI command (proxy:0:1):
cmd=barrier_out
[proxy:0 at ampere-altra-2-1] got pmi command from downstream 0-0:
cmd=barrier_in
[proxy:0 at ampere-altra-2-1] Sending upstream internal PMI command:
cmd=barrier_in
[proxy:0 at ampere-altra-2-1] [proxy:1 at dpr740] Sending PMI command:
cmd=barrier_out
Sending upstream hdr.cmd = CMD_PMI
[proxy:0 at ampere-altra-2-1] Sending PMI command:
cmd=barrier_out
[proxy:0 at ampere-altra-2-1] got pmi command from downstream 0-0:
cmd=put kvsname=kvs_1151_0_1450155337_dpr740 key=-allgather-shm-1-0 value=0200937DC0A80101[8]
[proxy:0 at ampere-altra-2-1] cached command: -allgather-shm-1-0=0200937DC0A80101[8]
[proxy:0 at ampere-altra-2-1] Sending PMI command:
cmd=put_result rc=0
[mpiexec at dpr740] [pgid: 0] got PMI command: cmd=mput -allgather-shm-1-0=0200937DC0A80101[8]
[mpiexec at dpr740] [pgid: 0] got PMI command: cmd=barrier_in
[proxy:0 at ampere-altra-2-1] got pmi command from downstream 0-0:
cmd=barrier_in
[proxy:0 at ampere-altra-2-1] flushing 1 put command(s) out
[proxy:0 at ampere-altra-2-1] forwarding command upstream:
cmd=mput -allgather-shm-1-0=0200937DC0A80101[8]
[proxy:0 at ampere-altra-2-1] Sending upstream internal PMI command:
cmd=mput -allgather-shm-1-0=0200937DC0A80101[8]
[proxy:0 at ampere-altra-2-1] Sending upstream hdr.cmd = CMD_PMI
[proxy:0 at ampere-altra-2-1] Sending upstream internal PMI command:
cmd=barrier_in
[proxy:0 at ampere-altra-2-1] Sending upstream hdr.cmd = CMD_PMI
[proxy:1 at dpr740] got pmi command from downstream 0-0:
cmd=put kvsname=kvs_1151_0_1450155337_dpr740 key=-allgather-shm-1-1 value=0A00B381[4]FE80[6]526B4BFFFEFC134208[3]
[proxy:1 at dpr740] cached command: -allgather-shm-1-1=0A00B381[4]FE80[6]526B4BFFFEFC134208[3]
[proxy:1 at dpr740] Sending PMI command:
cmd=put_result rc=0
[proxy:1 at dpr740] got pmi command from downstream 0-0:
cmd=barrier_in
[proxy:1 at dpr740] flushing 1 put command(s) out
[proxy:1 at dpr740] forwarding command upstream:
cmd=mput -allgather-shm-1-1=0A00B381[4]FE80[6]526B4BFFFEFC134208[3]
[proxy:1 at dpr740] Sending upstream internal PMI command:
cmd=mput -allgather-shm-1-1=0A00B381[4]FE80[6]526B4BFFFEFC134208[3]
[proxy:1 at dpr740] Sending upstream hdr.cmd = CMD_PMI
[proxy:1 at dpr740] Sending upstream internal PMI command:
cmd=barrier_in
[mpiexec at dpr740] [pgid: 0] got PMI command: cmd=mput -allgather-shm-1-1=0A00B381[4]FE80[6]526B4BFFFEFC134208[3]
[proxy:1 at dpr740] Sending upstream hdr.cmd = CMD_PMI
[mpiexec at dpr740] [pgid: 0] got PMI command: cmd=barrier_in
[mpiexec at dpr740] Sending internal PMI command (proxy:0:0):
cmd=keyval_cache -allgather-shm-1-0=0200937DC0A80101[8] -allgather-shm-1-1=0A00B381[4]FE80[6]526B4BFFFEFC134208[3]
[mpiexec at dpr740] Sending internal PMI command (proxy:0:1):
cmd=keyval_cache -allgather-shm-1-0=0200937DC0A80101[8] -allgather-shm-1-1=0A00B381[4]FE80[6]526B4BFFFEFC134208[3]
[mpiexec at dpr740] Sending internal PMI command (proxy:0:0):
cmd=barrier_out
[mpiexec at dpr740] Sending internal PMI command (proxy:0:1):
cmd=barrier_out
[proxy:1 at dpr740] Sending PMI command:
cmd=barrier_out
[proxy:1 at dpr740] got pmi command from downstream 0-0:
cmd=get kvsname=kvs_1151_0_1450155337_dpr740 key=-allgather-shm-1-0
[proxy:1 at dpr740] Sending PMI command:
cmd=get_result rc=0 value=0200937DC0A80101[8] found=TRUE
[proxy:1 at dpr740] got pmi command from downstream 0-0:
cmd=get kvsname=kvs_1151_0_1450155337_dpr740 key=-allgather-shm-1-1
[proxy:1 at dpr740] Sending PMI command:
cmd=get_result rc=0 value=0A00B381[4]FE80[6]526B4BFFFEFC134208[3] found=TRUE
[proxy:0 at ampere-altra-2-1] Sending PMI command:
cmd=barrier_out
[proxy:1 at dpr740] Sending upstream hdr.cmd = CMD_STDERR
[proxy:0 at ampere-altra-2-1] got pmi command from downstream 0-0:
cmd=get kvsname=kvs_1151_0_1450155337_dpr740 key=-allgather-shm-1-0
Assertion failed in file src/mpid/ch4/netmod/ofi/init_addrxchg.c at line 151: mapped_table[i] != FI_ADDR_NOTAVAIL
[proxy:0 at ampere-altra-2-1] Sending PMI command:
cmd=get_result rc=0 value=0200937DC0A80101[8] found=TRUE
[proxy:0 at ampere-altra-2-1] got pmi command from downstream 0-0:
cmd=get kvsname=kvs_1151_0_1450155337_dpr740 key=-allgather-shm-1-1
[proxy:0 at ampere-altra-2-1] Sending PMI command:
cmd=get_result rc=0 value=0A00B381[4]FE80[6]526B4BFFFEFC134208[3] found=TRUE
Assertion failed in file src/mpid/ch4/netmod/ofi/init_addrxchg.c at line 151: mapped_table[i] != FI_ADDR_NOTAVAIL
[proxy:0 at ampere-altra-2-1] Sending upstream hdr.cmd = CMD_STDERR
[proxy:1 at dpr740] Sending upstream hdr.cmd = CMD_STDERR
/opt/mpich/lib/libmpi.so.0(+0x58c005) [0x7f967920c005]
/opt/mpich/lib/libmpi.so.0(+0x491858) [0x7f9679111858]
/opt/mpich/lib/libmpi.so.0(+0x55428c) [0x7f96791d428c]
/opt/mpich/lib/libmpi.so.0(+0x53402d) [0x7f96791b402d]
/opt/mpich/lib/libmpi.so.0(+0x4dc71f) [0x7f967915c71f]
/opt/mpich/lib/libmpi.so.0(+0x4df09a) [0x7f967915f09a]
/opt/mpich/lib/libmpi.so.0(+0x3deab6) [0x7f967905eab6]
/opt/mpich/lib/libmpi.so.0(+0x3e0732) [0x7f9679060732]
/opt/mpich/lib/libmpi.so.0(+0x3dd075) [0x7f967905d075]
/opt/mpich/lib/libmpi.so.0(+0x418215) [0x7f9679098215]
[proxy:1 at dpr740] Sending upstream hdr.cmd = CMD_STDERR
/opt/mpich/lib/libmpi.so.0(+0x4188fa) [0x7f96790988fa]
/opt/mpich/lib/libmpi.so.0(MPI_Init+0x34) [0x7f9678d57594]
./a.out(+0x121a) [0x55b07f1cc21a]
/lib/x86_64-linux-gnu/libc.so.6(+0x29d90) [0x7f9678a7cd90]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80) [0x7f9678a7ce40]
./a.out(+0x1125) [0x55b07f1cc125]
Abort(1) on node 1: Internal error
/opt/mpich/lib/libmpi.so.0(+0x59a0fc) [0xffff91d0a0fc]
/opt/mpich/lib/libmpi.so.0(+0x4a6b58) [0xffff91c16b58]
/opt/mpich/lib/libmpi.so.0(+0x564740) [0xffff91cd4740]
/opt/mpich/lib/libmpi.so.0(+0x546c14) [0xffff91cb6c14]
/opt/mpich/lib/libmpi.so.0(+0x4f70cc) [0xffff91c670cc]
/opt/mpich/lib/libmpi.so.0(+0x4f9850) [0xffff91c69850]
/opt/mpich/lib/libmpi.so.0(+0x3ffd2c) [0xffff91b6fd2c]
/opt/mpich/lib/libmpi.so.0(+0x4017ec) [0xffff91b717ec]
/opt/mpich/lib/libmpi.so.0(+0x3fe384) [0xffff91b6e384]
/opt/mpich/lib/libmpi.so.0(+0x436a64) [0xffff91ba6a64]
[proxy:0 at ampere-altra-2-1] Sending upstream hdr.cmd = CMD_STDERR
[proxy:0 at ampere-altra-2-1] Sending upstream hdr.cmd = CMD_STDERR
/opt/mpich/lib/libmpi.so.0(+0x43700c) [0xffff91ba700c]
/opt/mpich/lib/libmpi.so.0(MPI_Init+0x44) [0xffff9189eeb4]
./a.out(+0x9c4) [0xaaaab5c709c4]
/lib/aarch64-linux-gnu/libc.so.6(+0x273fc) [0xffff915e73fc]
/lib/aarch64-linux-gnu/libc.so.6(__libc_start_main+0x98) [0xffff915e74cc]
./a.out(+0x8b0) [0xaaaab5c708b0]
Abort(1) on node 0: Internal error
[proxy:1 at dpr740] Sending upstream hdr.cmd = CMD_EXIT_STATUS
[proxy:0 at ampere-altra-2-1] Sending upstream hdr.cmd = CMD_EXIT_STATUS
From: Niyaz Murshed via discuss <discuss at mpich.org>
Date: Saturday, June 15, 2024 at 12:10 AM
To: discuss at mpich.org <discuss at mpich.org>
Cc: Niyaz Murshed <Niyaz.Murshed at arm.com>, nd <nd at arm.com>
Subject: [mpich-discuss] mpi hello-world error
Hello, I am trying to run the example hellow. c between 2 servers. I can run them individually and it works fine. 10. 118. 91. 158 is the machine I am running on. 10. 118. 91. 159 is the remote machine. root@ dpr740: /mpich/examples# mpirun -n 2 -hosts
ZjQcmQRYFpfptBannerStart
This Message Is From an External Sender
This message came from outside your organization.
ZjQcmQRYFpfptBannerEnd
Hello,
I am trying to run the example hellow.c between 2 servers.
I can run them individually and it works fine.
10.118.91.158 is the machine I am running on.
10.118.91.159 is the remote machine.
root at dpr740:/mpich/examples# mpirun -n 2 -hosts 10.118.91.158 ./a.out
Hello world from process 0 of 2
Hello world from process 1 of 2
root at dpr740:/mpich/examples# mpirun -n 2 -hosts 10.118.91.159 ./a.out
Hello world from process 1 of 2
Hello world from process 0 of 2
However, when I try to run them on both, I get the below error.
realloc(): invalid pointer
Is this a known issue ? Any suggestions?
root at dpr740:/mpich/examples# mpirun -verbose -n 2 -hosts 10.118.91.159,10.118.91.158 ./a.out
host: 10.118.91.159
host: 10.118.91.158
[mpiexec at dpr740] Timeout set to -1 (-1 means infinite)
==================================================================================================
mpiexec options:
----------------
Base path: /opt/mpich/bin/
Launcher: (null)
Debug level: 1
Enable X: -1
Global environment:
-------------------
PKG_CONFIG_PATH=:/opt/libfabric/lib/pkgconfig:/opt/mpich/lib/pkgconfig
HOSTNAME=dpr740
HYDRA_LAUNCHER_EXTRA_ARGS=-p 2233
PWD=/mpich/examples
HOME=/root
LS_COLORS=rs=0:di=01;34:ln=01;36:mh=00:pi=40;33:so=01;35:do=01;35:bd=40;33;01:cd=40;33;01:or=40;31;01:mi=00:su=37;41:sg=30;43:ca=30;41:tw=30;42:ow=34;42:st=37;44:ex=01;32:*.tar=01;31:*.tgz=01;31:*.arc=01;31:*.arj=01;31:*.taz=01;31:*.lha=01;31:*.lz4=01;31:*.lzh=01;31:*.lzma=01;31:*.tlz=01;31:*.txz=01;31:*.tzo=01;31:*.t7z=01;31:*.zip=01;31:*.z=01;31:*.dz=01;31:*.gz=01;31:*.lrz=01;31:*.lz=01;31:*.lzo=01;31:*.xz=01;31:*.zst=01;31:*.tzst=01;31:*.bz2=01;31:*.bz=01;31:*.tbz=01;31:*.tbz2=01;31:*.tz=01;31:*.deb=01;31:*.rpm=01;31:*.jar=01;31:*.war=01;31:*.ear=01;31:*.sar=01;31:*.rar=01;31:*.alz=01;31:*.ace=01;31:*.zoo=01;31:*.cpio=01;31:*.7z=01;31:*.rz=01;31:*.cab=01;31:*.wim=01;31:*.swm=01;31:*.dwm=01;31:*.esd=01;31:*.jpg=01;35:*.jpeg=01;35:*.mjpg=01;35:*.mjpeg=01;35:*.gif=01;35:*.bmp=01;35:*.pbm=01;35:*.pgm=01;35:*.ppm=01;35:*.tga=01;35:*.xbm=01;35:*.xpm=01;35:*.tif=01;35:*.tiff=01;35:*.png=01;35:*.svg=01;35:*.svgz=01;35:*.mng=01;35:*.pcx=01;35:*.mov=01;35:*.mpg=01;35:*.mpeg=01;35:*.m2v=01;35:*.mkv=01;35:*.webm=01;35:*.webp=01;35:*.ogm=01;35:*.mp4=01;35:*.m4v=01;35:*.mp4v=01;35:*.vob=01;35:*.qt=01;35:*.nuv=01;35:*.wmv=01;35:*.asf=01;35:*.rm=01;35:*.rmvb=01;35:*.flc=01;35:*.avi=01;35:*.fli=01;35:*.flv=01;35:*.gl=01;35:*.dl=01;35:*.xcf=01;35:*.xwd=01;35:*.yuv=01;35:*.cgm=01;35:*.emf=01;35:*.ogv=01;35:*.ogx=01;35:*.aac=00;36:*.au=00;36:*.flac=00;36:*.m4a=00;36:*.mid=00;36:*.midi=00;36:*.mka=00;36:*.mp3=00;36:*.mpc=00;36:*.ogg=00;36:*.ra=00;36:*.wav=00;36:*.oga=00;36:*.opus=00;36:*.spx=00;36:*.xspf=00;36:
LESSCLOSE=/usr/bin/lesspipe %s %s
TERM=xterm
LESSOPEN=| /usr/bin/lesspipe %s
SHLVL=1
LD_LIBRARY_PATH=:/opt/libfabric/lib:/opt/fabtests/lib:/opt/mpich/lib
PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/opt/libfabric/bin/:/opt/fabtest/bin:/opt/mpich/bin
_=/opt/mpich/bin/mpirun
OLDPWD=/
Hydra internal environment:
---------------------------
GFORTRAN_UNBUFFERED_PRECONNECTED=y
Proxy information:
*********************
[1] proxy: 10.118.91.159 (1 cores)
Exec list: ./a.out (1 processes);
[2] proxy: 10.118.91.158 (1 cores)
Exec list: ./a.out (1 processes);
==================================================================================================
Proxy launch args: /opt/mpich/bin/hydra_pmi_proxy --control-port 10.118.91.159:33909 --debug --rmk user --launcher ssh --demux poll --pgid 0 --retries 10 --usize -2 --pmi-port 0 --gpus-per-proc -2 --gpu-subdevs-per-proc -2 --proxy-id
Arguments being passed to proxy 0:
--version 4.3.0a1 --iface-ip-env-name MPIR_CVAR_CH3_INTERFACE_HOSTNAME --hostname 10.118.91.159 --global-core-map 0,1,2 --pmi-id-map 0,0 --global-process-count 2 --auto-cleanup 1 --pmi-kvsname kvs_844_0_801938186_dpr740 --pmi-process-mapping (vector,(0,2,1)) --global-inherited-env 14 'PKG_CONFIG_PATH=:/opt/libfabric/lib/pkgconfig:/opt/mpich/lib/pkgconfig' 'HOSTNAME=dpr740' 'HYDRA_LAUNCHER_EXTRA_ARGS=-p 2233' 'PWD=/mpich/examples' 'HOME=/root' 'LS_COLORS=rs=0:di=01;34:ln=01;36:mh=00:pi=40;33:so=01;35:do=01;35:bd=40;33;01:cd=40;33;01:or=40;31;01:mi=00:su=37;41:sg=30;43:ca=30;41:tw=30;42:ow=34;42:st=37;44:ex=01;32:*.tar=01;31:*.tgz=01;31:*.arc=01;31:*.arj=01;31:*.taz=01;31:*.lha=01;31:*.lz4=01;31:*.lzh=01;31:*.lzma=01;31:*.tlz=01;31:*.txz=01;31:*.tzo=01;31:*.t7z=01;31:*.zip=01;31:*.z=01;31:*.dz=01;31:*.gz=01;31:*.lrz=01;31:*.lz=01;31:*.lzo=01;31:*.xz=01;31:*.zst=01;31:*.tzst=01;31:*.bz2=01;31:*.bz=01;31:*.tbz=01;31:*.tbz2=01;31:*.tz=01;31:*.deb=01;31:*.rpm=01;31:*.jar=01;31:*.war=01;31:*.ear=01;31:*.sar=01;31:*.rar=01;31:*.alz=01;31:*.ace=01;31:*.zoo=01;31:*.cpio=01;31:*.7z=01;31:*.rz=01;31:*.cab=01;31:*.wim=01;31:*.swm=01;31:*.dwm=01;31:*.esd=01;31:*.jpg=01;35:*.jpeg=01;35:*.mjpg=01;35:*.mjpeg=01;35:*.gif=01;35:*.bmp=01;35:*.pbm=01;35:*.pgm=01;35:*.ppm=01;35:*.tga=01;35:*.xbm=01;35:*.xpm=01;35:*.tif=01;35:*.tiff=01;35:*.png=01;35:*.svg=01;35:*.svgz=01;35:*.mng=01;35:*.pcx=01;35:*.mov=01;35:*.mpg=01;35:*.mpeg=01;35:*.m2v=01;35:*.mkv=01;35:*.webm=01;35:*.webp=01;35:*.ogm=01;35:*.mp4=01;35:*.m4v=01;35:*.mp4v=01;35:*.vob=01;35:*.qt=01;35:*.nuv=01;35:*.wmv=01;35:*.asf=01;35:*.rm=01;35:*.rmvb=01;35:*.flc=01;35:*.avi=01;35:*.fli=01;35:*.flv=01;35:*.gl=01;35:*.dl=01;35:*.xcf=01;35:*.xwd=01;35:*.yuv=01;35:*.cgm=01;35:*.emf=01;35:*.ogv=01;35:*.ogx=01;35:*.aac=00;36:*.au=00;36:*.flac=00;36:*.m4a=00;36:*.mid=00;36:*.midi=00;36:*.mka=00;36:*.mp3=00;36:*.mpc=00;36:*.ogg=00;36:*.ra=00;36:*.wav=00;36:*.oga=00;36:*.opus=00;36:*.spx=00;36:*.xspf=00;36:' 'LESSCLOSE=/usr/bin/lesspipe %s %s' 'TERM=xterm' 'LESSOPEN=| /usr/bin/lesspipe %s' 'SHLVL=1' 'LD_LIBRARY_PATH=:/opt/libfabric/lib:/opt/fabtests/lib:/opt/mpich/lib' 'PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/opt/libfabric/bin/:/opt/fabtest/bin:/opt/mpich/bin' '_=/opt/mpich/bin/mpirun' 'OLDPWD=/' --global-user-env 0 --global-system-env 1 'GFORTRAN_UNBUFFERED_PRECONNECTED=y' --proxy-core-count 1 --exec --exec-appnum 0 --exec-proc-count 1 --exec-local-env 0 --exec-wdir /mpich/examples --exec-args 1 ./a.out
Arguments being passed to proxy 1:
--version 4.3.0a1 --iface-ip-env-name MPIR_CVAR_CH3_INTERFACE_HOSTNAME --hostname 10.118.91.158 --global-core-map 0,1,2 --pmi-id-map 0,1 --global-process-count 2 --auto-cleanup 1 --pmi-kvsname kvs_844_0_801938186_dpr740 --pmi-process-mapping (vector,(0,2,1)) --global-inherited-env 14 'PKG_CONFIG_PATH=:/opt/libfabric/lib/pkgconfig:/opt/mpich/lib/pkgconfig' 'HOSTNAME=dpr740' 'HYDRA_LAUNCHER_EXTRA_ARGS=-p 2233' 'PWD=/mpich/examples' 'HOME=/root' 'LS_COLORS=rs=0:di=01;34:ln=01;36:mh=00:pi=40;33:so=01;35:do=01;35:bd=40;33;01:cd=40;33;01:or=40;31;01:mi=00:su=37;41:sg=30;43:ca=30;41:tw=30;42:ow=34;42:st=37;44:ex=01;32:*.tar=01;31:*.tgz=01;31:*.arc=01;31:*.arj=01;31:*.taz=01;31:*.lha=01;31:*.lz4=01;31:*.lzh=01;31:*.lzma=01;31:*.tlz=01;31:*.txz=01;31:*.tzo=01;31:*.t7z=01;31:*.zip=01;31:*.z=01;31:*.dz=01;31:*.gz=01;31:*.lrz=01;31:*.lz=01;31:*.lzo=01;31:*.xz=01;31:*.zst=01;31:*.tzst=01;31:*.bz2=01;31:*.bz=01;31:*.tbz=01;31:*.tbz2=01;31:*.tz=01;31:*.deb=01;31:*.rpm=01;31:*.jar=01;31:*.war=01;31:*.ear=01;31:*.sar=01;31:*.rar=01;31:*.alz=01;31:*.ace=01;31:*.zoo=01;31:*.cpio=01;31:*.7z=01;31:*.rz=01;31:*.cab=01;31:*.wim=01;31:*.swm=01;31:*.dwm=01;31:*.esd=01;31:*.jpg=01;35:*.jpeg=01;35:*.mjpg=01;35:*.mjpeg=01;35:*.gif=01;35:*.bmp=01;35:*.pbm=01;35:*.pgm=01;35:*.ppm=01;35:*.tga=01;35:*.xbm=01;35:*.xpm=01;35:*.tif=01;35:*.tiff=01;35:*.png=01;35:*.svg=01;35:*.svgz=01;35:*.mng=01;35:*.pcx=01;35:*.mov=01;35:*.mpg=01;35:*.mpeg=01;35:*.m2v=01;35:*.mkv=01;35:*.webm=01;35:*.webp=01;35:*.ogm=01;35:*.mp4=01;35:*.m4v=01;35:*.mp4v=01;35:*.vob=01;35:*.qt=01;35:*.nuv=01;35:*.wmv=01;35:*.asf=01;35:*.rm=01;35:*.rmvb=01;35:*.flc=01;35:*.avi=01;35:*.fli=01;35:*.flv=01;35:*.gl=01;35:*.dl=01;35:*.xcf=01;35:*.xwd=01;35:*.yuv=01;35:*.cgm=01;35:*.emf=01;35:*.ogv=01;35:*.ogx=01;35:*.aac=00;36:*.au=00;36:*.flac=00;36:*.m4a=00;36:*.mid=00;36:*.midi=00;36:*.mka=00;36:*.mp3=00;36:*.mpc=00;36:*.ogg=00;36:*.ra=00;36:*.wav=00;36:*.oga=00;36:*.opus=00;36:*.spx=00;36:*.xspf=00;36:' 'LESSCLOSE=/usr/bin/lesspipe %s %s' 'TERM=xterm' 'LESSOPEN=| /usr/bin/lesspipe %s' 'SHLVL=1' 'LD_LIBRARY_PATH=:/opt/libfabric/lib:/opt/fabtests/lib:/opt/mpich/lib' 'PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/opt/libfabric/bin/:/opt/fabtest/bin:/opt/mpich/bin' '_=/opt/mpich/bin/mpirun' 'OLDPWD=/' --global-user-env 0 --global-system-env 1 'GFORTRAN_UNBUFFERED_PRECONNECTED=y' --proxy-core-count 1 --exec --exec-appnum 0 --exec-proc-count 1 --exec-local-env 0 --exec-wdir /mpich/examples --exec-args 1 ./a.out
[mpiexec at dpr740] Launch arguments: /opt/mpich/bin/hydra_pmi_proxy --control-port 10.118.91.159:33909 --debug --rmk user --launcher ssh --demux poll --pgid 0 --retries 10 --usize -2 --pmi-port 0 --gpus-per-proc -2 --gpu-subdevs-per-proc -2 --proxy-id 0
[mpiexec at dpr740] Launch arguments: /usr/bin/ssh -x -p 2233 10.118.91.158 "/opt/mpich/bin/hydra_pmi_proxy" --control-port 10.118.91.159:33909 --debug --rmk user --launcher ssh --demux poll --pgid 0 --retries 10 --usize -2 --pmi-port 0 --gpus-per-proc -2 --gpu-subdevs-per-proc -2 --proxy-id 1
[proxy:0 at dpr740] Sending upstream hdr.cmd = CMD_PID_LIST
[proxy:0 at dpr740] got pmi command from downstream 0-0:
cmd=init pmi_version=1 pmi_subversion=1
[proxy:0 at dpr740] Sending PMI command:
cmd=response_to_init rc=0 pmi_version=1 pmi_subversion=1
[proxy:0 at dpr740] got pmi command from downstream 0-0:
cmd=get_maxes
[proxy:0 at dpr740] Sending PMI command:
cmd=maxes rc=0 kvsname_max=256 keylen_max=64 vallen_max=1024
[proxy:0 at dpr740] got pmi command from downstream 0-0:
cmd=get_appnum
[proxy:0 at dpr740] Sending PMI command:
cmd=appnum rc=0 appnum=0
[proxy:0 at dpr740] got pmi command from downstream 0-0:
cmd=get_my_kvsname
[proxy:0 at dpr740] Sending PMI command:
cmd=my_kvsname rc=0 kvsname=kvs_844_0_801938186_dpr740
[proxy:0 at dpr740] got pmi command from downstream 0-0:
cmd=get kvsname=kvs_844_0_801938186_dpr740 key=PMI_process_mapping
[proxy:0 at dpr740] Sending PMI command:
cmd=get_result rc=0 value=(vector,(0,2,1)) found=TRUE
[proxy:0 at dpr740] got pmi command from downstream 0-0:
cmd=get kvsname=kvs_844_0_801938186_dpr740 key=PMI_hwloc_xmlfile
[proxy:0 at dpr740] Sending PMI command:
cmd=get_result rc=0 value=/tmp/hydra_hwloc_xmlfile_CeNRJN found=TRUE
[proxy:0 at dpr740] got pmi command from downstream 0-0:
cmd=get kvsname=kvs_844_0_801938186_dpr740 key=PMI_mpi_memory_alloc_kinds
[proxy:0 at dpr740] Sending upstream internal PMI command:
cmd=get kvsname=kvs_844_0_801938186_dpr740 key=PMI_mpi_memory_alloc_kinds
[proxy:0 at dpr740] Sending upstream hdr.cmd = CMD_PMI
[mpiexec at dpr740] [pgid: 0] got PMI command: cmd=get kvsname=kvs_844_0_801938186_dpr740 key=PMI_mpi_memory_alloc_kinds
[mpiexec at dpr740] Sending internal PMI command (proxy:0:0):
cmd=get_result rc=1
[proxy:0 at dpr740] we don't understand the response get_result; forwarding downstream
[proxy:1 at ampere-altra-2-1] Sending upstream hdr.cmd = CMD_PID_LIST
[proxy:1 at ampere-altra-2-1] got pmi command from downstream 0-0:
cmd=init pmi_version=1 pmi_subversion=1
[proxy:1 at ampere-altra-2-1] Sending PMI command:
cmd=response_to_init rc=0 pmi_version=1 pmi_subversion=1
[proxy:1 at ampere-altra-2-1] got pmi command from downstream 0-0:
cmd=get_maxes
[proxy:1 at ampere-altra-2-1] Sending PMI command:
cmd=maxes rc=0 kvsname_max=256 keylen_max=64 vallen_max=1024
[proxy:1 at ampere-altra-2-1] got pmi command from downstream 0-0:
cmd=get_appnum
[proxy:1 at ampere-altra-2-1] Sending PMI command:
cmd=appnum rc=0 appnum=0
[proxy:1 at ampere-altra-2-1] got pmi command from downstream 0-0:
cmd=get_my_kvsname
[proxy:1 at ampere-altra-2-1] Sending PMI command:
cmd=my_kvsname rc=0 kvsname=kvs_844_0_801938186_dpr740
[proxy:1 at ampere-altra-2-1] got pmi command from downstream 0-0:
cmd=get kvsname=kvs_844_0_801938186_dpr740 key=PMI_process_mapping
[proxy:1 at ampere-altra-2-1] Sending PMI command:
cmd=get_result rc=0 value=(vector,(0,2,1)) found=TRUE
[proxy:1 at ampere-altra-2-1] got pmi command from downstream 0-0:
cmd=get kvsname=kvs_844_0_801938186_dpr740 key=PMI_hwloc_xmlfile
[proxy:1 at ampere-altra-2-1] Sending PMI command:
cmd=get_result rc=0 value=/tmp/hydra_hwloc_xmlfile_xv8EIG found=TRUE
[proxy:0 at dpr740] got pmi command from downstream 0-0:
cmd=barrier_in
[proxy:0 at dpr740] Sending upstream internal PMI command:
cmd=barrier_in
[proxy:0 at dpr740] Sending upstream hdr.cmd = CMD_PMI
[mpiexec at dpr740] [pgid: 0] got PMI command: cmd=barrier_in
[mpiexec at dpr740] [pgid: 0] got PMI command: cmd=get kvsname=kvs_844_0_801938186_dpr740 key=PMI_mpi_memory_alloc_kinds
[mpiexec at dpr740] Sending internal PMI command (proxy:0:1):
cmd=get_result rc=1
[proxy:1 at ampere-altra-2-1] got pmi command from downstream 0-0:
cmd=get kvsname=kvs_844_0_801938186_dpr740 key=PMI_mpi_memory_alloc_kinds
[proxy:1 at ampere-altra-2-1] Sending upstream internal PMI command:
cmd=get kvsname=kvs_844_0_801938186_dpr740 key=PMI_mpi_memory_alloc_kinds
[proxy:1 at ampere-altra-2-1] Sending upstream hdr.cmd = CMD_PMI
[proxy:1 at ampere-altra-2-1] we don't understand the response get_result; forwarding downstream
[mpiexec at dpr740] [pgid: 0] got PMI command: cmd=barrier_in
[mpiexec at dpr740] Sending internal PMI command (proxy:0:0):
cmd=barrier_out
[mpiexec at dpr740] Sending internal PMI command (proxy:0:1):
cmd=barrier_out
[proxy:0 at dpr740] Sending PMI command:
cmd=barrier_out
[proxy:1 at ampere-altra-2-1] got pmi command from downstream 0-0:
cmd=barrier_in
[proxy:1 at ampere-altra-2-1] Sending upstream internal PMI command:
cmd=barrier_in
[proxy:1 at ampere-altra-2-1] Sending upstream hdr.cmd = CMD_PMI
[proxy:1 at ampere-altra-2-1] Sending PMI command:
cmd=barrier_out
[proxy:0 at dpr740] got pmi command from downstream 0-0:
cmd=put kvsname=kvs_844_0_801938186_dpr740 key=-allgather-shm-1-0 value=0A00812D[4]FE80[6]526B4BFFFEFC134208[3]
[proxy:0 at dpr740] [proxy:1 at ampere-altra-2-1] got pmi command from downstream 0-0:
cmd=put kvsname=kvs_844_0_801938186_dpr740 key=-allgather-shm-1-1 value=0200A8BFC0A80101[8]
cached command: -allgather-shm-1-0=0A00812D[4]FE80[6]526B4BFFFEFC134208[3]
[proxy:0 at dpr740] Sending PMI command:
cmd=put_result rc=0
[proxy:0 at dpr740] got pmi command from downstream 0-0:
cmd=barrier_in
[proxy:0 at dpr740] flushing 1 put command(s) out
[proxy:0 at dpr740] forwarding command upstream:
cmd=mput -allgather-shm-1-0=0A00812D[4]FE80[6]526B4BFFFEFC134208[3]
[proxy:0 at dpr740] Sending upstream internal PMI command:
cmd=mput -allgather-shm-1-0=0A00812D[4]FE80[6]526B4BFFFEFC134208[3]
[proxy:0 at dpr740] Sending upstream hdr.cmd = CMD_PMI
[proxy:1 at ampere-altra-2-1] cached command: -allgather-shm-1-1=0200A8BFC0A80101[8]
[proxy:1 at ampere-altra-2-1] Sending PMI command:
[proxy:0 at dpr740] Sending upstream internal PMI command:
cmd=barrier_in
[proxy:0 at dpr740] Sending upstream hdr.cmd = CMD_PMI
[mpiexec at dpr740] [pgid: 0] got PMI command: cmd=mput -allgather-shm-1-0=0A00812D[4]FE80[6]526B4BFFFEFC134208[3]
cmd=put_result rc=0
[mpiexec at dpr740] [pgid: 0] got PMI command: cmd=barrier_in
[mpiexec at dpr740] [pgid: 0] got PMI command: cmd=mput -allgather-shm-1-1=0200A8BFC0A80101[8]
[proxy:1 at ampere-altra-2-1] got pmi command from downstream 0-0:
cmd=barrier_in
[proxy:1 at ampere-altra-2-1] flushing 1 put command(s) out
[proxy:1 at ampere-altra-2-1] forwarding command upstream:
[mpiexec at dpr740] [pgid: 0] got PMI command: cmd=barrier_in
[mpiexec at dpr740] Sending internal PMI command (proxy:0:0):
cmd=keyval_cache -allgather-shm-1-0=0A00812D[4]FE80[6]526B4BFFFEFC134208[3] -allgather-shm-1-1=0200A8BFC0A80101[8]
[mpiexec at dpr740] Sending internal PMI command (proxy:0:1):
cmd=keyval_cache -allgather-shm-1-0=0A00812D[4]FE80[6]526B4BFFFEFC134208[3] -allgather-shm-1-1=0200A8BFC0A80101[8]
[mpiexec at dpr740] Sending internal PMI command (proxy:0:0):
cmd=barrier_out
[mpiexec at dpr740] Sending internal PMI command (proxy:0:1):
cmd=barrier_out
[proxy:0 at dpr740] Sending PMI command:
cmd=barrier_out
cmd=mput -allgather-shm-1-1=0200A8BFC0A80101[8]
[proxy:1 at ampere-altra-2-1] Sending upstream internal PMI command:
cmd=mput -allgather-shm-1-1=0200A8BFC0A80101[8]
[proxy:1 at ampere-altra-2-1] Sending upstream hdr.cmd = CMD_PMI
[proxy:1 at ampere-altra-2-1] Sending upstream internal PMI command:
cmd=barrier_in
[proxy:1 at ampere-altra-2-1] Sending upstream hdr.cmd = CMD_PMI
[proxy:0 at dpr740] got pmi command from downstream 0-0:
cmd=get kvsname=kvs_844_0_801938186_dpr740 key=-allgather-shm-1-0
[proxy:0 at dpr740] Sending PMI command:
cmd=get_result rc=0 value=0A00812D[4]FE80[6]526B4BFFFEFC134208[3] found=TRUE
[proxy:0 at dpr740] got pmi command from downstream 0-0:
cmd=get kvsname=kvs_844_0_801938186_dpr740 key=-allgather-shm-1-1
[proxy:0 at dpr740] Sending PMI command:
cmd=get_result rc=0 value=0200A8BFC0A80101[8] found=TRUE
[proxy:1 at ampere-altra-2-1] Sending PMI command:
cmd=barrier_out
[proxy:1 at ampere-altra-2-1] got pmi command from downstream 0-0:
cmd=get kvsname=kvs_844_0_801938186_dpr740 key=-allgather-shm-1-0
[proxy:1 at ampere-altra-2-1] Sending PMI command:
cmd=get_result rc=0 value=0A00812D[4]FE80[6]526B4BFFFEFC134208[3] found=TRUE
[proxy:1 at ampere-altra-2-1] got pmi command from downstream 0-0:
cmd=get kvsname=kvs_844_0_801938186_dpr740 key=-allgather-shm-1-1
[proxy:1 at ampere-altra-2-1] Sending PMI command:
cmd=get_result rc=0 value=0200A8BFC0A80101[8] found=TRUE
realloc(): invalid pointer
[proxy:1 at ampere-altra-2-1] Sending upstream hdr.cmd = CMD_STDERR
===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= PID 2404 RUNNING AT 10.118.91.158
= EXIT CODE: 134
= CLEANING UP REMAINING PROCESSES
= YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
[proxy:1 at ampere-altra-2-1] Sending upstream hdr.cmd = CMD_EXIT_STATUS
[proxy:0 at dpr740] HYD_pmcd_pmip_control_cmd_cb (proxy/pmip_cb.c:484): assert (!closed) failed
[proxy:0 at dpr740] HYDT_dmxu_poll_wait_for_event (lib/tools/demux/demux_poll.c:76): callback returned error status
[proxy:0 at dpr740] main (proxy/pmip.c:122): demux engine error waiting for event
[mpiexec at dpr740] HYDT_bscu_wait_for_completion (lib/tools/bootstrap/utils/bscu_wait.c:109): one of the processes terminated badly; aborting
[mpiexec at dpr740] HYDT_bsci_wait_for_completion (lib/tools/bootstrap/src/bsci_wait.c:21): launcher returned error waiting for completion
[mpiexec at dpr740] HYD_pmci_wait_for_completion (mpiexec/pmiserv_pmci.c:189): launcher returned error waiting for completion
[mpiexec at dpr740] main (mpiexec/mpiexec.c:260): process manager error waiting for completion
Regards,
Niyaz
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20240617/a3791e43/attachment-0001.html>
More information about the discuss
mailing list