[mpich-discuss] mpi hello-world error

Zhou, Hui zhouh at anl.gov
Mon Jun 17 11:08:22 CDT 2024


Could you set env "MPIR_CVAR_DEBUG_SUMMARY=1` and rerun the test?

Hui
________________________________
From: Niyaz Murshed <Niyaz.Murshed at arm.com>
Sent: Monday, June 17, 2024 11:05 AM
To: Zhou, Hui <zhouh at anl.gov>; discuss at mpich.org <discuss at mpich.org>
Cc: nd <nd at arm.com>
Subject: Re: mpi hello-world error

Yes, one of the hosts. I have 2 servers. Hostname1: dpr740/10. 118. 91. 159 Hostname2 : ampere-altra-2-1/10. 118. 91. 158 I am running the application on dpr740 Adding both hosts: root@ dpr740: /mpich/examples# mpirun -n 2 -hosts 10. 118. 91. 158,10. 118. 91. 159
ZjQcmQRYFpfptBannerStart
This Message Is From an External Sender
This message came from outside your organization.

ZjQcmQRYFpfptBannerEnd

Yes, one of the hosts.

I have 2 servers.

Hostname1: dpr740/10.118.91.159

Hostname2 : ampere-altra-2-1/10.118.91.158



I am running the application on dpr740





Adding both hosts:



root at dpr740:/mpich/examples# mpirun  -n 2 -hosts 10.118.91.158,10.118.91.159 ./a.out



Assertion failed in file src/mpid/ch4/netmod/ofi/init_addrxchg.c at line 151: mapped_table[i] != FI_ADDR_NOTAVAIL

/opt/mpich/lib/libmpi.so.0(+0x59a0fc) [0xffffa063a0fc]

/opt/mpich/lib/libmpi.so.0(+0x4a6b58) [0xffffa0546b58]

/opt/mpich/lib/libmpi.so.0(+0x564740) [0xffffa0604740]

/opt/mpich/lib/libmpi.so.0(+0x546c14) [0xffffa05e6c14]

/opt/mpich/lib/libmpi.so.0(+0x4f70cc) [0xffffa05970cc]

/opt/mpich/lib/libmpi.so.0(+0x4f9850) [0xffffa0599850]

/opt/mpich/lib/libmpi.so.0(+0x3ffd2c) [0xffffa049fd2c]

/opt/mpich/lib/libmpi.so.0(+0x4017ec) [0xffffa04a17ec]

/opt/mpich/lib/libmpi.so.0(+0x3fe384) [0xffffa049e384]

/opt/mpich/lib/libmpi.so.0(+0x436a64) [0xffffa04d6a64]

/opt/mpich/lib/libmpi.so.0(+0x43700c) [0xffffa04d700c]

/opt/mpich/lib/libmpi.so.0(MPI_Init+0x44) [0xffffa01ceeb4]

./a.out(+0x9c4) [0xaaaacd3309c4]

/lib/aarch64-linux-gnu/libc.so.6(+0x273fc) [0xffff9ff173fc]

/lib/aarch64-linux-gnu/libc.so.6(__libc_start_main+0x98) [0xffff9ff174cc]

./a.out(+0x8b0) [0xaaaacd3308b0]

Abort(1) on node 0: Internal error



[mpiexec at dpr740] HYDU_sock_write (lib/utils/sock.c:250): write error (Bad file descriptor)

[mpiexec at dpr740] send_hdr_downstream (mpiexec/pmiserv_cb.c:28): sock write error

^C[mpiexec at dpr740] Sending Ctrl-C to processes as requested

[mpiexec at dpr740] Press Ctrl-C again to force abort

[mpiexec at dpr740] HYDU_sock_write (lib/utils/sock.c:250): write error (Bad file descriptor)

[mpiexec at dpr740] send_hdr_downstream (mpiexec/pmiserv_cb.c:28): sock write error

[mpiexec at dpr740] HYD_pmcd_pmiserv_send_signal (mpiexec/pmiserv_cb.c:218): unable to write data to proxy

[mpiexec at dpr740] ui_cmd_cb (mpiexec/pmiserv_pmci.c:61): unable to send signal downstream

[mpiexec at dpr740] HYDT_dmxu_poll_wait_for_event (lib/tools/demux/demux_poll.c:76): callback returned error status

[mpiexec at dpr740] HYD_pmci_wait_for_completion (mpiexec/pmiserv_pmci.c:173): error waiting for event

[mpiexec at dpr740] main (mpiexec/mpiexec.c:260): process manager error waiting for completion





If I just add the remote host, it will run successfully.



root at dpr740:/mpich/examples# mpirun  -n 2 -hosts 10.118.91.158 ./a.out

Hello world from process 0 of 2

Hello world from process 1 of 2













From: Zhou, Hui <zhouh at anl.gov>
Date: Monday, June 17, 2024 at 10:33 AM
To: Niyaz Murshed <Niyaz.Murshed at arm.com>, discuss at mpich.org <discuss at mpich.org>
Subject: Re: mpi hello-world error

Alright. Let's focus on the case of two fixed nodes running



   mpirun  -n 2 -hosts 10.118.91.158,10.118.91.159 ./a.out

Is the error consistent every time?

Are you running the command from one of the host? Out of curiosity, why the host names looks like from two different naming systems?



--

Hui

________________________________

From: Niyaz Murshed <Niyaz.Murshed at arm.com>
Sent: Monday, June 17, 2024 10:23 AM
To: discuss at mpich.org <discuss at mpich.org>
Cc: Zhou, Hui <zhouh at anl.gov>
Subject: Re: mpi hello-world error



Hi Hui, Apologies for this, I just assumed more logs would give more information. Yes, both servers are on the same network. In the first email, I can run the hello-world application from server1 to server2 and vice versa. Its only when I add

ZjQcmQRYFpfptBannerStart

This Message Is From an External Sender

This message came from outside your organization.



ZjQcmQRYFpfptBannerEnd

Hi Hui,

Apologies for this, I just assumed more logs would give more information.



Yes, both servers are on the same network.

In the first email, I can run the hello-world application from server1 to server2 and vice versa.



Its only when I add both servers in the parameters, the error is seen.



Get Outlook for iOS<https://urldefense.us/v3/__https:/aka.ms/o0ukef__;!!G_uCfscf7eWS!aA6_K_xqXWVqnjCozuSNlnNItijBkb7EDA_6xOPPs1AVXK3mV0yGRzhJT1WJ1N0oqX4ZZThlLzZPSpU2nxU$>

________________________________

From: Zhou, Hui via discuss <discuss at mpich.org>
Sent: Monday, June 17, 2024 9:41:50 AM
To: discuss at mpich.org <discuss at mpich.org>
Cc: Zhou, Hui <zhouh at anl.gov>
Subject: Re: [mpich-discuss] mpi hello-world error



Niyaz,



I am quite lost on the errors you encountered. The three errors seem all over the place.  Are the two hosts on the same local network?



--

Hui

________________________________

From: Niyaz Murshed via discuss <discuss at mpich.org>
Sent: Monday, June 17, 2024 1:07 AM
To: discuss at mpich.org <discuss at mpich.org>
Cc: Niyaz Murshed <Niyaz.Murshed at arm.com>; nd <nd at arm.com>
Subject: Re: [mpich-discuss] mpi hello-world error



What is the best way to understand this log ? [proxy: 1@ ampere-altra-2-1] Sending upstream hdr. cmd = CMD_STDERR Abort(680650255) on node 1: Fatal error in internal_Init: Other MPI error, error stack: internal_Init(70). . . . . . . . . . . . . . . . : MPI_Init(argc=(nil),

ZjQcmQRYFpfptBannerStart

This Message Is From an External Sender

This message came from outside your organization.



ZjQcmQRYFpfptBannerEnd

What is the best way to understand this log ?





[proxy:1 at ampere-altra-2-1] Sending upstream hdr.cmd = CMD_STDERR

Abort(680650255) on node 1: Fatal error in internal_Init: Other MPI error, error stack:

internal_Init(70)................: MPI_Init(argc=(nil), argv=(nil)) failed

MPII_Init_thread(268)............:

MPIR_init_comm_world(34).........:

MPIR_Comm_commit(823)............:

MPID_Comm_commit_post_hook(222)..:

MPIDI_world_post_init(665).......:

MPIDI_OFI_init_vcis(851).........:

check_num_nics(900)..............:

MPIR_Allreduce_allcomm_auto(4726):

MPIC_Sendrecv(301)...............:

MPID_Isend(63)...................:

MPIDI_isend(35)..................:

(unknown)(): Other MPI error

[proxy:1 at ampere-altra-2-1] Sending upstream hdr.cmd = CMD_EXIT_STATUS



[proxy:0 at cesw-amp-gbt-2s-m12830-01] Sending upstream hdr.cmd = CMD_STDERR

Abort(680650255) on node 0: Fatal error in internal_Init: Other MPI error, error stack:

internal_Init(70)................: MPI_Init(argc=(nil), argv=(nil)) failed

MPII_Init_thread(268)............:

MPIR_init_comm_world(34).........:

MPIR_Comm_commit(823)............:

MPID_Comm_commit_post_hook(222)..:

MPIDI_world_post_init(665).......:

MPIDI_OFI_init_vcis(851).........:

check_num_nics(900)..............:

MPIR_Allreduce_allcomm_auto(4726):

MPIC_Sendrecv(301)...............:

MPID_Isend(63)...................:

MPIDI_isend(35)..................:

(unknown)(): Other MPI error

[proxy:0 at cesw-amp-gbt-2s-m12830-01] Sending upstream hdr.cmd = CMD_EXIT_STATUS







From: Niyaz Murshed via discuss <discuss at mpich.org>
Date: Saturday, June 15, 2024 at 10:53 AM
To: discuss at mpich.org <discuss at mpich.org>
Cc: Niyaz Murshed <Niyaz.Murshed at arm.com>, nd <nd at arm.com>
Subject: Re: [mpich-discuss] mpi hello-world error

Also seeing this error sometimes. root@ dpr740: /mpich/examples# export FI_PROVIDER=tcp root@ dpr740: /mpich/examples# mpirun -verbose -n 2 -hosts 10. 118. 91. 158,10. 118. 91. 159 ./a. out host: 10. 118. 91. 158 host: 10. 118. 91. 159 [mpiexec@ dpr740] Timeout

ZjQcmQRYFpfptBannerStart

This Message Is From an External Sender

This message came from outside your organization.



ZjQcmQRYFpfptBannerEnd

Also seeing this error sometimes.





root at dpr740:/mpich/examples# export FI_PROVIDER=tcp

root at dpr740:/mpich/examples# mpirun  -verbose -n 2 -hosts 10.118.91.158,10.118.91.159 ./a.out

host: 10.118.91.158

host: 10.118.91.159

[mpiexec at dpr740] Timeout set to -1 (-1 means infinite)



==================================================================================================

mpiexec options:

----------------

  Base path: /opt/mpich/bin/

  Launcher: (null)

  Debug level: 1

  Enable X: -1



  Global environment:

  -------------------

    PKG_CONFIG_PATH=:/opt/libfabric/lib/pkgconfig:/opt/mpich/lib/pkgconfig

    HOSTNAME=dpr740

    HYDRA_LAUNCHER_EXTRA_ARGS=-p 2233

    PWD=/mpich/examples

    HOME=/root

    FI_PROVIDER=tcp

    LS_COLORS=rs=0:di=01;34:ln=01;36:mh=00:pi=40;33:so=01;35:do=01;35:bd=40;33;01:cd=40;33;01:or=40;31;01:mi=00:su=37;41:sg=30;43:ca=30;41:tw=30;42:ow=34;42:st=37;44:ex=01;32:*.tar=01;31:*.tgz=01;31:*.arc=01;31:*.arj=01;31:*.taz=01;31:*.lha=01;31:*.lz4=01;31:*.lzh=01;31:*.lzma=01;31:*.tlz=01;31:*.txz=01;31:*.tzo=01;31:*.t7z=01;31:*.zip=01;31:*.z=01;31:*.dz=01;31:*.gz=01;31:*.lrz=01;31:*.lz=01;31:*.lzo=01;31:*.xz=01;31:*.zst=01;31:*.tzst=01;31:*.bz2=01;31:*.bz=01;31:*.tbz=01;31:*.tbz2=01;31:*.tz=01;31:*.deb=01;31:*.rpm=01;31:*.jar=01;31:*.war=01;31:*.ear=01;31:*.sar=01;31:*.rar=01;31:*.alz=01;31:*.ace=01;31:*.zoo=01;31:*.cpio=01;31:*.7z=01;31:*.rz=01;31:*.cab=01;31:*.wim=01;31:*.swm=01;31:*.dwm=01;31:*.esd=01;31:*.jpg=01;35:*.jpeg=01;35:*.mjpg=01;35:*.mjpeg=01;35:*.gif=01;35:*.bmp=01;35:*.pbm=01;35:*.pgm=01;35:*.ppm=01;35:*.tga=01;35:*.xbm=01;35:*.xpm=01;35:*.tif=01;35:*.tiff=01;35:*.png=01;35:*.svg=01;35:*.svgz=01;35:*.mng=01;35:*.pcx=01;35:*.mov=01;35:*.mpg=01;35:*.mpeg=01;35:*.m2v=01;35:*.mkv=01;35:*.webm=01;35:*.webp=01;35:*.ogm=01;35:*.mp4=01;35:*.m4v=01;35:*.mp4v=01;35:*.vob=01;35:*.qt=01;35:*.nuv=01;35:*.wmv=01;35:*.asf=01;35:*.rm=01;35:*.rmvb=01;35:*.flc=01;35:*.avi=01;35:*.fli=01;35:*.flv=01;35:*.gl=01;35:*.dl=01;35:*.xcf=01;35:*.xwd=01;35:*.yuv=01;35:*.cgm=01;35:*.emf=01;35:*.ogv=01;35:*.ogx=01;35:*.aac=00;36:*.au=00;36:*.flac=00;36:*.m4a=00;36:*.mid=00;36:*.midi=00;36:*.mka=00;36:*.mp3=00;36:*.mpc=00;36:*.ogg=00;36:*.ra=00;36:*.wav=00;36:*.oga=00;36:*.opus=00;36:*.spx=00;36:*.xspf=00;36:

    LESSCLOSE=/usr/bin/lesspipe %s %s

    TERM=xterm

    LESSOPEN=| /usr/bin/lesspipe %s

    SHLVL=1

    LD_LIBRARY_PATH=:/opt/libfabric/lib:/opt/fabtests/lib:/opt/mpich/lib

    PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/opt/libfabric/bin/:/opt/fabtest/bin:/opt/mpich/bin

    OLDPWD=/

    _=/opt/mpich/bin/mpirun



  Hydra internal environment:

  ---------------------------

    GFORTRAN_UNBUFFERED_PRECONNECTED=y





    Proxy information:

    *********************

      [1] proxy: 10.118.91.158 (1 cores)

      Exec list: ./a.out (1 processes);



      [2] proxy: 10.118.91.159 (1 cores)

      Exec list: ./a.out (1 processes);





==================================================================================================





Proxy launch args: /opt/mpich/bin/hydra_pmi_proxy --control-port 10.118.91.159:35625 --debug --rmk user --launcher ssh --demux poll --pgid 0 --retries 10 --usize -2 --pmi-port 0 --gpus-per-proc -2 --gpu-subdevs-per-proc -2 --proxy-id



Arguments being passed to proxy 0:

--version 4.3.0a1 --iface-ip-env-name MPIR_CVAR_CH3_INTERFACE_HOSTNAME --hostname 10.118.91.158 --global-core-map 0,1,2 --pmi-id-map 0,0 --global-process-count 2 --auto-cleanup 1 --pmi-kvsname kvs_1151_0_1450155337_dpr740 --pmi-process-mapping (vector,(0,2,1)) --global-inherited-env 15 'PKG_CONFIG_PATH=:/opt/libfabric/lib/pkgconfig:/opt/mpich/lib/pkgconfig' 'HOSTNAME=dpr740' 'HYDRA_LAUNCHER_EXTRA_ARGS=-p 2233' 'PWD=/mpich/examples' 'HOME=/root' 'FI_PROVIDER=tcp' 'LS_COLORS=rs=0:di=01;34:ln=01;36:mh=00:pi=40;33:so=01;35:do=01;35:bd=40;33;01:cd=40;33;01:or=40;31;01:mi=00:su=37;41:sg=30;43:ca=30;41:tw=30;42:ow=34;42:st=37;44:ex=01;32:*.tar=01;31:*.tgz=01;31:*.arc=01;31:*.arj=01;31:*.taz=01;31:*.lha=01;31:*.lz4=01;31:*.lzh=01;31:*.lzma=01;31:*.tlz=01;31:*.txz=01;31:*.tzo=01;31:*.t7z=01;31:*.zip=01;31:*.z=01;31:*.dz=01;31:*.gz=01;31:*.lrz=01;31:*.lz=01;31:*.lzo=01;31:*.xz=01;31:*.zst=01;31:*.tzst=01;31:*.bz2=01;31:*.bz=01;31:*.tbz=01;31:*.tbz2=01;31:*.tz=01;31:*.deb=01;31:*.rpm=01;31:*.jar=01;31:*.war=01;31:*.ear=01;31:*.sar=01;31:*.rar=01;31:*.alz=01;31:*.ace=01;31:*.zoo=01;31:*.cpio=01;31:*.7z=01;31:*.rz=01;31:*.cab=01;31:*.wim=01;31:*.swm=01;31:*.dwm=01;31:*.esd=01;31:*.jpg=01;35:*.jpeg=01;35:*.mjpg=01;35:*.mjpeg=01;35:*.gif=01;35:*.bmp=01;35:*.pbm=01;35:*.pgm=01;35:*.ppm=01;35:*.tga=01;35:*.xbm=01;35:*.xpm=01;35:*.tif=01;35:*.tiff=01;35:*.png=01;35:*.svg=01;35:*.svgz=01;35:*.mng=01;35:*.pcx=01;35:*.mov=01;35:*.mpg=01;35:*.mpeg=01;35:*.m2v=01;35:*.mkv=01;35:*.webm=01;35:*.webp=01;35:*.ogm=01;35:*.mp4=01;35:*.m4v=01;35:*.mp4v=01;35:*.vob=01;35:*.qt=01;35:*.nuv=01;35:*.wmv=01;35:*.asf=01;35:*.rm=01;35:*.rmvb=01;35:*.flc=01;35:*.avi=01;35:*.fli=01;35:*.flv=01;35:*.gl=01;35:*.dl=01;35:*.xcf=01;35:*.xwd=01;35:*.yuv=01;35:*.cgm=01;35:*.emf=01;35:*.ogv=01;35:*.ogx=01;35:*.aac=00;36:*.au=00;36:*.flac=00;36:*.m4a=00;36:*.mid=00;36:*.midi=00;36:*.mka=00;36:*.mp3=00;36:*.mpc=00;36:*.ogg=00;36:*.ra=00;36:*.wav=00;36:*.oga=00;36:*.opus=00;36:*.spx=00;36:*.xspf=00;36:' 'LESSCLOSE=/usr/bin/lesspipe %s %s' 'TERM=xterm' 'LESSOPEN=| /usr/bin/lesspipe %s' 'SHLVL=1' 'LD_LIBRARY_PATH=:/opt/libfabric/lib:/opt/fabtests/lib:/opt/mpich/lib' 'PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/opt/libfabric/bin/:/opt/fabtest/bin:/opt/mpich/bin' 'OLDPWD=/' '_=/opt/mpich/bin/mpirun' --global-user-env 0 --global-system-env 1 'GFORTRAN_UNBUFFERED_PRECONNECTED=y' --proxy-core-count 1 --exec --exec-appnum 0 --exec-proc-count 1 --exec-local-env 0 --exec-wdir /mpich/examples --exec-args 1 ./a.out



Arguments being passed to proxy 1:

--version 4.3.0a1 --iface-ip-env-name MPIR_CVAR_CH3_INTERFACE_HOSTNAME --hostname 10.118.91.159 --global-core-map 0,1,2 --pmi-id-map 0,1 --global-process-count 2 --auto-cleanup 1 --pmi-kvsname kvs_1151_0_1450155337_dpr740 --pmi-process-mapping (vector,(0,2,1)) --global-inherited-env 15 'PKG_CONFIG_PATH=:/opt/libfabric/lib/pkgconfig:/opt/mpich/lib/pkgconfig' 'HOSTNAME=dpr740' 'HYDRA_LAUNCHER_EXTRA_ARGS=-p 2233' 'PWD=/mpich/examples' 'HOME=/root' 'FI_PROVIDER=tcp' 'LS_COLORS=rs=0:di=01;34:ln=01;36:mh=00:pi=40;33:so=01;35:do=01;35:bd=40;33;01:cd=40;33;01:or=40;31;01:mi=00:su=37;41:sg=30;43:ca=30;41:tw=30;42:ow=34;42:st=37;44:ex=01;32:*.tar=01;31:*.tgz=01;31:*.arc=01;31:*.arj=01;31:*.taz=01;31:*.lha=01;31:*.lz4=01;31:*.lzh=01;31:*.lzma=01;31:*.tlz=01;31:*.txz=01;31:*.tzo=01;31:*.t7z=01;31:*.zip=01;31:*.z=01;31:*.dz=01;31:*.gz=01;31:*.lrz=01;31:*.lz=01;31:*.lzo=01;31:*.xz=01;31:*.zst=01;31:*.tzst=01;31:*.bz2=01;31:*.bz=01;31:*.tbz=01;31:*.tbz2=01;31:*.tz=01;31:*.deb=01;31:*.rpm=01;31:*.jar=01;31:*.war=01;31:*.ear=01;31:*.sar=01;31:*.rar=01;31:*.alz=01;31:*.ace=01;31:*.zoo=01;31:*.cpio=01;31:*.7z=01;31:*.rz=01;31:*.cab=01;31:*.wim=01;31:*.swm=01;31:*.dwm=01;31:*.esd=01;31:*.jpg=01;35:*.jpeg=01;35:*.mjpg=01;35:*.mjpeg=01;35:*.gif=01;35:*.bmp=01;35:*.pbm=01;35:*.pgm=01;35:*.ppm=01;35:*.tga=01;35:*.xbm=01;35:*.xpm=01;35:*.tif=01;35:*.tiff=01;35:*.png=01;35:*.svg=01;35:*.svgz=01;35:*.mng=01;35:*.pcx=01;35:*.mov=01;35:*.mpg=01;35:*.mpeg=01;35:*.m2v=01;35:*.mkv=01;35:*.webm=01;35:*.webp=01;35:*.ogm=01;35:*.mp4=01;35:*.m4v=01;35:*.mp4v=01;35:*.vob=01;35:*.qt=01;35:*.nuv=01;35:*.wmv=01;35:*.asf=01;35:*.rm=01;35:*.rmvb=01;35:*.flc=01;35:*.avi=01;35:*.fli=01;35:*.flv=01;35:*.gl=01;35:*.dl=01;35:*.xcf=01;35:*.xwd=01;35:*.yuv=01;35:*.cgm=01;35:*.emf=01;35:*.ogv=01;35:*.ogx=01;35:*.aac=00;36:*.au=00;36:*.flac=00;36:*.m4a=00;36:*.mid=00;36:*.midi=00;36:*.mka=00;36:*.mp3=00;36:*.mpc=00;36:*.ogg=00;36:*.ra=00;36:*.wav=00;36:*.oga=00;36:*.opus=00;36:*.spx=00;36:*.xspf=00;36:' 'LESSCLOSE=/usr/bin/lesspipe %s %s' 'TERM=xterm' 'LESSOPEN=| /usr/bin/lesspipe %s' 'SHLVL=1' 'LD_LIBRARY_PATH=:/opt/libfabric/lib:/opt/fabtests/lib:/opt/mpich/lib' 'PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/opt/libfabric/bin/:/opt/fabtest/bin:/opt/mpich/bin' 'OLDPWD=/' '_=/opt/mpich/bin/mpirun' --global-user-env 0 --global-system-env 1 'GFORTRAN_UNBUFFERED_PRECONNECTED=y' --proxy-core-count 1 --exec --exec-appnum 0 --exec-proc-count 1 --exec-local-env 0 --exec-wdir /mpich/examples --exec-args 1 ./a.out



[mpiexec at dpr740] Launch arguments: /usr/bin/ssh -x -p 2233 10.118.91.158 "/opt/mpich/bin/hydra_pmi_proxy" --control-port 10.118.91.159:35625 --debug --rmk user --launcher ssh --demux poll --pgid 0 --retries 10 --usize -2 --pmi-port 0 --gpus-per-proc -2 --gpu-subdevs-per-proc -2 --proxy-id 0

[mpiexec at dpr740] Launch arguments: /opt/mpich/bin/hydra_pmi_proxy --control-port 10.118.91.159:35625 --debug --rmk user --launcher ssh --demux poll --pgid 0 --retries 10 --usize -2 --pmi-port 0 --gpus-per-proc -2 --gpu-subdevs-per-proc -2 --proxy-id 1

[proxy:1 at dpr740] Sending upstream hdr.cmd = CMD_PID_LIST

[proxy:1 at dpr740] got pmi command from downstream 0-0:

    cmd=init pmi_version=1 pmi_subversion=1

[proxy:1 at dpr740] Sending PMI command:

    cmd=response_to_init rc=0 pmi_version=1 pmi_subversion=1

[proxy:1 at dpr740] got pmi command from downstream 0-0:

    cmd=get_maxes

[proxy:1 at dpr740] Sending PMI command:

    cmd=maxes rc=0 kvsname_max=256 keylen_max=64 vallen_max=1024

[proxy:1 at dpr740] got pmi command from downstream 0-0:

    cmd=get_appnum

[proxy:1 at dpr740] Sending PMI command:

    cmd=appnum rc=0 appnum=0

[proxy:1 at dpr740] got pmi command from downstream 0-0:

    cmd=get_my_kvsname

[proxy:1 at dpr740] Sending PMI command:

    cmd=my_kvsname rc=0 kvsname=kvs_1151_0_1450155337_dpr740

[proxy:1 at dpr740] got pmi command from downstream 0-0:

    cmd=get kvsname=kvs_1151_0_1450155337_dpr740 key=PMI_process_mapping

[proxy:1 at dpr740] Sending PMI command:

    cmd=get_result rc=0 value=(vector,(0,2,1)) found=TRUE

[proxy:1 at dpr740] got pmi command from downstream 0-0:

    cmd=get kvsname=kvs_1151_0_1450155337_dpr740 key=PMI_hwloc_xmlfile

[proxy:1 at dpr740] Sending PMI command:

    cmd=get_result rc=0 value=/tmp/hydra_hwloc_xmlfile_YPoAhr found=TRUE

[proxy:1 at dpr740] got pmi command from downstream 0-0:

    cmd=get kvsname=kvs_1151_0_1450155337_dpr740 key=PMI_mpi_memory_alloc_kinds

[proxy:1 at dpr740] Sending upstream internal PMI command:

    cmd=get kvsname=kvs_1151_0_1450155337_dpr740 key=PMI_mpi_memory_alloc_kinds

[proxy:1 at dpr740] Sending upstream hdr.cmd = CMD_PMI

[mpiexec at dpr740] [pgid: 0] got PMI command: cmd=get kvsname=kvs_1151_0_1450155337_dpr740 key=PMI_mpi_memory_alloc_kinds



[mpiexec at dpr740] Sending internal PMI command (proxy:0:1):

    cmd=get_result rc=1

[proxy:1 at dpr740] we don't understand the response get_result; forwarding downstream

[proxy:1 at dpr740] got pmi command from downstream 0-0:

    cmd=barrier_in

[proxy:1 at dpr740] Sending upstream internal PMI command:

    cmd=barrier_in

[proxy:1 at dpr740] Sending upstream hdr.cmd = CMD_PMI

[mpiexec at dpr740] [pgid: 0] got PMI command: cmd=barrier_in



[proxy:0 at ampere-altra-2-1] Sending upstream hdr.cmd = CMD_PID_LIST

[proxy:0 at ampere-altra-2-1] got pmi command from downstream 0-0:

    cmd=init pmi_version=1 pmi_subversion=1

[proxy:0 at ampere-altra-2-1] Sending PMI command:

    cmd=response_to_init rc=0 pmi_version=1 pmi_subversion=1

[proxy:0 at ampere-altra-2-1] got pmi command from downstream 0-0:

    cmd=get_maxes

[proxy:0 at ampere-altra-2-1] Sending PMI command:

    cmd=maxes rc=0 kvsname_max=256 keylen_max=64 vallen_max=1024

[proxy:0 at ampere-altra-2-1] got pmi command from downstream 0-0:

    cmd=get_appnum

[proxy:0 at ampere-altra-2-1] Sending PMI command:

    cmd=appnum rc=0 appnum=0

[proxy:0 at ampere-altra-2-1] got pmi command from downstream 0-0:

    cmd=get_my_kvsname

[proxy:0 at ampere-altra-2-1] Sending PMI command:

    cmd=my_kvsname rc=0 kvsname=kvs_1151_0_1450155337_dpr740

[proxy:0 at ampere-altra-2-1] got pmi command from downstream 0-0:

    cmd=get kvsname=kvs_1151_0_1450155337_dpr740 key=PMI_process_mapping

[proxy:0 at ampere-altra-2-1] Sending PMI command:

    cmd=get_result rc=0 value=(vector,(0,2,1)) found=TRUE

[proxy:0 at ampere-altra-2-1] got pmi command from downstream 0-0:

    cmd=get kvsname=kvs_1151_0_1450155337_dpr740 key=PMI_hwloc_xmlfile

[proxy:0 at ampere-altra-2-1] Sending PMI command:

    cmd=get_result rc=0 value=/tmp/hydra_hwloc_xmlfile_68iqm3 found=TRUE

[mpiexec at dpr740] [pgid: 0] got PMI command: cmd=get kvsname=kvs_1151_0_1450155337_dpr740 key=PMI_mpi_memory_alloc_kinds



[mpiexec at dpr740] Sending internal PMI command (proxy:0:0):

    cmd=get_result rc=1

[proxy:0 at ampere-altra-2-1] got pmi command from downstream 0-0:

    cmd=get kvsname=kvs_1151_0_1450155337_dpr740 key=PMI_mpi_memory_alloc_kinds

[proxy:0 at ampere-altra-2-1] Sending upstream internal PMI command:

    cmd=get kvsname=kvs_1151_0_1450155337_dpr740 key=PMI_mpi_memory_alloc_kinds

[proxy:0 at ampere-altra-2-1] Sending upstream hdr.cmd = CMD_PMI

[proxy:0 at ampere-altra-2-1] we don't understand the response get_result; forwarding downstream

[mpiexec at dpr740] [pgid: 0] got PMI command: cmd=barrier_in



[mpiexec at dpr740] Sending internal PMI command (proxy:0:0):

    cmd=barrier_out

[mpiexec at dpr740] Sending internal PMI command (proxy:0:1):

    cmd=barrier_out

[proxy:0 at ampere-altra-2-1] got pmi command from downstream 0-0:

    cmd=barrier_in

[proxy:0 at ampere-altra-2-1] Sending upstream internal PMI command:

    cmd=barrier_in

[proxy:0 at ampere-altra-2-1] [proxy:1 at dpr740] Sending PMI command:

    cmd=barrier_out

Sending upstream hdr.cmd = CMD_PMI

[proxy:0 at ampere-altra-2-1] Sending PMI command:

    cmd=barrier_out

[proxy:0 at ampere-altra-2-1] got pmi command from downstream 0-0:

    cmd=put kvsname=kvs_1151_0_1450155337_dpr740 key=-allgather-shm-1-0 value=0200937DC0A80101[8]

[proxy:0 at ampere-altra-2-1] cached command: -allgather-shm-1-0=0200937DC0A80101[8]

[proxy:0 at ampere-altra-2-1] Sending PMI command:

    cmd=put_result rc=0

[mpiexec at dpr740] [pgid: 0] got PMI command: cmd=mput -allgather-shm-1-0=0200937DC0A80101[8]



[mpiexec at dpr740] [pgid: 0] got PMI command: cmd=barrier_in



[proxy:0 at ampere-altra-2-1] got pmi command from downstream 0-0:

    cmd=barrier_in

[proxy:0 at ampere-altra-2-1] flushing 1 put command(s) out

[proxy:0 at ampere-altra-2-1] forwarding command upstream:

cmd=mput -allgather-shm-1-0=0200937DC0A80101[8]

[proxy:0 at ampere-altra-2-1] Sending upstream internal PMI command:

    cmd=mput -allgather-shm-1-0=0200937DC0A80101[8]

[proxy:0 at ampere-altra-2-1] Sending upstream hdr.cmd = CMD_PMI

[proxy:0 at ampere-altra-2-1] Sending upstream internal PMI command:

    cmd=barrier_in

[proxy:0 at ampere-altra-2-1] Sending upstream hdr.cmd = CMD_PMI

[proxy:1 at dpr740] got pmi command from downstream 0-0:

    cmd=put kvsname=kvs_1151_0_1450155337_dpr740 key=-allgather-shm-1-1 value=0A00B381[4]FE80[6]526B4BFFFEFC134208[3]

[proxy:1 at dpr740] cached command: -allgather-shm-1-1=0A00B381[4]FE80[6]526B4BFFFEFC134208[3]

[proxy:1 at dpr740] Sending PMI command:

    cmd=put_result rc=0

[proxy:1 at dpr740] got pmi command from downstream 0-0:

    cmd=barrier_in

[proxy:1 at dpr740] flushing 1 put command(s) out

[proxy:1 at dpr740] forwarding command upstream:

cmd=mput -allgather-shm-1-1=0A00B381[4]FE80[6]526B4BFFFEFC134208[3]

[proxy:1 at dpr740] Sending upstream internal PMI command:

    cmd=mput -allgather-shm-1-1=0A00B381[4]FE80[6]526B4BFFFEFC134208[3]

[proxy:1 at dpr740] Sending upstream hdr.cmd = CMD_PMI

[proxy:1 at dpr740] Sending upstream internal PMI command:

    cmd=barrier_in

[mpiexec at dpr740] [pgid: 0] got PMI command: cmd=mput -allgather-shm-1-1=0A00B381[4]FE80[6]526B4BFFFEFC134208[3]



[proxy:1 at dpr740] Sending upstream hdr.cmd = CMD_PMI

[mpiexec at dpr740] [pgid: 0] got PMI command: cmd=barrier_in



[mpiexec at dpr740] Sending internal PMI command (proxy:0:0):

    cmd=keyval_cache -allgather-shm-1-0=0200937DC0A80101[8] -allgather-shm-1-1=0A00B381[4]FE80[6]526B4BFFFEFC134208[3]

[mpiexec at dpr740] Sending internal PMI command (proxy:0:1):

    cmd=keyval_cache -allgather-shm-1-0=0200937DC0A80101[8] -allgather-shm-1-1=0A00B381[4]FE80[6]526B4BFFFEFC134208[3]

[mpiexec at dpr740] Sending internal PMI command (proxy:0:0):

    cmd=barrier_out

[mpiexec at dpr740] Sending internal PMI command (proxy:0:1):

    cmd=barrier_out

[proxy:1 at dpr740] Sending PMI command:

    cmd=barrier_out

[proxy:1 at dpr740] got pmi command from downstream 0-0:

    cmd=get kvsname=kvs_1151_0_1450155337_dpr740 key=-allgather-shm-1-0

[proxy:1 at dpr740] Sending PMI command:

    cmd=get_result rc=0 value=0200937DC0A80101[8] found=TRUE

[proxy:1 at dpr740] got pmi command from downstream 0-0:

    cmd=get kvsname=kvs_1151_0_1450155337_dpr740 key=-allgather-shm-1-1

[proxy:1 at dpr740] Sending PMI command:

    cmd=get_result rc=0 value=0A00B381[4]FE80[6]526B4BFFFEFC134208[3] found=TRUE

[proxy:0 at ampere-altra-2-1] Sending PMI command:

    cmd=barrier_out

[proxy:1 at dpr740] Sending upstream hdr.cmd = CMD_STDERR

[proxy:0 at ampere-altra-2-1] got pmi command from downstream 0-0:

    cmd=get kvsname=kvs_1151_0_1450155337_dpr740 key=-allgather-shm-1-0

Assertion failed in file src/mpid/ch4/netmod/ofi/init_addrxchg.c at line 151: mapped_table[i] != FI_ADDR_NOTAVAIL

[proxy:0 at ampere-altra-2-1] Sending PMI command:

    cmd=get_result rc=0 value=0200937DC0A80101[8] found=TRUE

[proxy:0 at ampere-altra-2-1] got pmi command from downstream 0-0:

    cmd=get kvsname=kvs_1151_0_1450155337_dpr740 key=-allgather-shm-1-1

[proxy:0 at ampere-altra-2-1] Sending PMI command:

    cmd=get_result rc=0 value=0A00B381[4]FE80[6]526B4BFFFEFC134208[3] found=TRUE

Assertion failed in file src/mpid/ch4/netmod/ofi/init_addrxchg.c at line 151: mapped_table[i] != FI_ADDR_NOTAVAIL

[proxy:0 at ampere-altra-2-1] Sending upstream hdr.cmd = CMD_STDERR

[proxy:1 at dpr740] Sending upstream hdr.cmd = CMD_STDERR

/opt/mpich/lib/libmpi.so.0(+0x58c005) [0x7f967920c005]

/opt/mpich/lib/libmpi.so.0(+0x491858) [0x7f9679111858]

/opt/mpich/lib/libmpi.so.0(+0x55428c) [0x7f96791d428c]

/opt/mpich/lib/libmpi.so.0(+0x53402d) [0x7f96791b402d]

/opt/mpich/lib/libmpi.so.0(+0x4dc71f) [0x7f967915c71f]

/opt/mpich/lib/libmpi.so.0(+0x4df09a) [0x7f967915f09a]

/opt/mpich/lib/libmpi.so.0(+0x3deab6) [0x7f967905eab6]

/opt/mpich/lib/libmpi.so.0(+0x3e0732) [0x7f9679060732]

/opt/mpich/lib/libmpi.so.0(+0x3dd075) [0x7f967905d075]

/opt/mpich/lib/libmpi.so.0(+0x418215) [0x7f9679098215]

[proxy:1 at dpr740] Sending upstream hdr.cmd = CMD_STDERR

/opt/mpich/lib/libmpi.so.0(+0x4188fa) [0x7f96790988fa]

/opt/mpich/lib/libmpi.so.0(MPI_Init+0x34) [0x7f9678d57594]

./a.out(+0x121a) [0x55b07f1cc21a]

/lib/x86_64-linux-gnu/libc.so.6(+0x29d90) [0x7f9678a7cd90]

/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80) [0x7f9678a7ce40]

./a.out(+0x1125) [0x55b07f1cc125]

Abort(1) on node 1: Internal error

/opt/mpich/lib/libmpi.so.0(+0x59a0fc) [0xffff91d0a0fc]

/opt/mpich/lib/libmpi.so.0(+0x4a6b58) [0xffff91c16b58]

/opt/mpich/lib/libmpi.so.0(+0x564740) [0xffff91cd4740]

/opt/mpich/lib/libmpi.so.0(+0x546c14) [0xffff91cb6c14]

/opt/mpich/lib/libmpi.so.0(+0x4f70cc) [0xffff91c670cc]

/opt/mpich/lib/libmpi.so.0(+0x4f9850) [0xffff91c69850]

/opt/mpich/lib/libmpi.so.0(+0x3ffd2c) [0xffff91b6fd2c]

/opt/mpich/lib/libmpi.so.0(+0x4017ec) [0xffff91b717ec]

/opt/mpich/lib/libmpi.so.0(+0x3fe384) [0xffff91b6e384]

/opt/mpich/lib/libmpi.so.0(+0x436a64) [0xffff91ba6a64]

[proxy:0 at ampere-altra-2-1] Sending upstream hdr.cmd = CMD_STDERR

[proxy:0 at ampere-altra-2-1] Sending upstream hdr.cmd = CMD_STDERR

/opt/mpich/lib/libmpi.so.0(+0x43700c) [0xffff91ba700c]

/opt/mpich/lib/libmpi.so.0(MPI_Init+0x44) [0xffff9189eeb4]

./a.out(+0x9c4) [0xaaaab5c709c4]

/lib/aarch64-linux-gnu/libc.so.6(+0x273fc) [0xffff915e73fc]

/lib/aarch64-linux-gnu/libc.so.6(__libc_start_main+0x98) [0xffff915e74cc]

./a.out(+0x8b0) [0xaaaab5c708b0]

Abort(1) on node 0: Internal error

[proxy:1 at dpr740] Sending upstream hdr.cmd = CMD_EXIT_STATUS

[proxy:0 at ampere-altra-2-1] Sending upstream hdr.cmd = CMD_EXIT_STATUS





From: Niyaz Murshed via discuss <discuss at mpich.org>
Date: Saturday, June 15, 2024 at 12:10 AM
To: discuss at mpich.org <discuss at mpich.org>
Cc: Niyaz Murshed <Niyaz.Murshed at arm.com>, nd <nd at arm.com>
Subject: [mpich-discuss] mpi hello-world error

Hello, I am trying to run the example hellow. c between 2 servers. I can run them individually and it works fine. 10. 118. 91. 158 is the machine I am running on. 10. 118. 91. 159 is the remote machine. root@ dpr740: /mpich/examples# mpirun -n 2 -hosts

ZjQcmQRYFpfptBannerStart

This Message Is From an External Sender

This message came from outside your organization.



ZjQcmQRYFpfptBannerEnd

Hello,



I am trying to run the example hellow.c between 2 servers.

I can run them individually and it works fine.



10.118.91.158  is the machine I am running on.

10.118.91.159 is the remote machine.



root at dpr740:/mpich/examples# mpirun  -n 2 -hosts 10.118.91.158  ./a.out

Hello world from process 0 of 2

Hello world from process 1 of 2



root at dpr740:/mpich/examples# mpirun  -n 2 -hosts 10.118.91.159  ./a.out

Hello world from process 1 of 2

Hello world from process 0 of 2



However, when I try to run them on both, I get the below error.

realloc(): invalid pointer



Is this a known issue ? Any suggestions?





root at dpr740:/mpich/examples# mpirun -verbose  -n 2 -hosts 10.118.91.159,10.118.91.158  ./a.out

host: 10.118.91.159

host: 10.118.91.158

[mpiexec at dpr740] Timeout set to -1 (-1 means infinite)



==================================================================================================

mpiexec options:

----------------

  Base path: /opt/mpich/bin/

  Launcher: (null)

  Debug level: 1

  Enable X: -1



  Global environment:

  -------------------

    PKG_CONFIG_PATH=:/opt/libfabric/lib/pkgconfig:/opt/mpich/lib/pkgconfig

    HOSTNAME=dpr740

    HYDRA_LAUNCHER_EXTRA_ARGS=-p 2233

    PWD=/mpich/examples

    HOME=/root

    LS_COLORS=rs=0:di=01;34:ln=01;36:mh=00:pi=40;33:so=01;35:do=01;35:bd=40;33;01:cd=40;33;01:or=40;31;01:mi=00:su=37;41:sg=30;43:ca=30;41:tw=30;42:ow=34;42:st=37;44:ex=01;32:*.tar=01;31:*.tgz=01;31:*.arc=01;31:*.arj=01;31:*.taz=01;31:*.lha=01;31:*.lz4=01;31:*.lzh=01;31:*.lzma=01;31:*.tlz=01;31:*.txz=01;31:*.tzo=01;31:*.t7z=01;31:*.zip=01;31:*.z=01;31:*.dz=01;31:*.gz=01;31:*.lrz=01;31:*.lz=01;31:*.lzo=01;31:*.xz=01;31:*.zst=01;31:*.tzst=01;31:*.bz2=01;31:*.bz=01;31:*.tbz=01;31:*.tbz2=01;31:*.tz=01;31:*.deb=01;31:*.rpm=01;31:*.jar=01;31:*.war=01;31:*.ear=01;31:*.sar=01;31:*.rar=01;31:*.alz=01;31:*.ace=01;31:*.zoo=01;31:*.cpio=01;31:*.7z=01;31:*.rz=01;31:*.cab=01;31:*.wim=01;31:*.swm=01;31:*.dwm=01;31:*.esd=01;31:*.jpg=01;35:*.jpeg=01;35:*.mjpg=01;35:*.mjpeg=01;35:*.gif=01;35:*.bmp=01;35:*.pbm=01;35:*.pgm=01;35:*.ppm=01;35:*.tga=01;35:*.xbm=01;35:*.xpm=01;35:*.tif=01;35:*.tiff=01;35:*.png=01;35:*.svg=01;35:*.svgz=01;35:*.mng=01;35:*.pcx=01;35:*.mov=01;35:*.mpg=01;35:*.mpeg=01;35:*.m2v=01;35:*.mkv=01;35:*.webm=01;35:*.webp=01;35:*.ogm=01;35:*.mp4=01;35:*.m4v=01;35:*.mp4v=01;35:*.vob=01;35:*.qt=01;35:*.nuv=01;35:*.wmv=01;35:*.asf=01;35:*.rm=01;35:*.rmvb=01;35:*.flc=01;35:*.avi=01;35:*.fli=01;35:*.flv=01;35:*.gl=01;35:*.dl=01;35:*.xcf=01;35:*.xwd=01;35:*.yuv=01;35:*.cgm=01;35:*.emf=01;35:*.ogv=01;35:*.ogx=01;35:*.aac=00;36:*.au=00;36:*.flac=00;36:*.m4a=00;36:*.mid=00;36:*.midi=00;36:*.mka=00;36:*.mp3=00;36:*.mpc=00;36:*.ogg=00;36:*.ra=00;36:*.wav=00;36:*.oga=00;36:*.opus=00;36:*.spx=00;36:*.xspf=00;36:

    LESSCLOSE=/usr/bin/lesspipe %s %s

    TERM=xterm

    LESSOPEN=| /usr/bin/lesspipe %s

    SHLVL=1

    LD_LIBRARY_PATH=:/opt/libfabric/lib:/opt/fabtests/lib:/opt/mpich/lib

    PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/opt/libfabric/bin/:/opt/fabtest/bin:/opt/mpich/bin

    _=/opt/mpich/bin/mpirun

    OLDPWD=/



  Hydra internal environment:

  ---------------------------

    GFORTRAN_UNBUFFERED_PRECONNECTED=y





    Proxy information:

    *********************

      [1] proxy: 10.118.91.159 (1 cores)

      Exec list: ./a.out (1 processes);



      [2] proxy: 10.118.91.158 (1 cores)

      Exec list: ./a.out (1 processes);





==================================================================================================





Proxy launch args: /opt/mpich/bin/hydra_pmi_proxy --control-port 10.118.91.159:33909 --debug --rmk user --launcher ssh --demux poll --pgid 0 --retries 10 --usize -2 --pmi-port 0 --gpus-per-proc -2 --gpu-subdevs-per-proc -2 --proxy-id



Arguments being passed to proxy 0:

--version 4.3.0a1 --iface-ip-env-name MPIR_CVAR_CH3_INTERFACE_HOSTNAME --hostname 10.118.91.159 --global-core-map 0,1,2 --pmi-id-map 0,0 --global-process-count 2 --auto-cleanup 1 --pmi-kvsname kvs_844_0_801938186_dpr740 --pmi-process-mapping (vector,(0,2,1)) --global-inherited-env 14 'PKG_CONFIG_PATH=:/opt/libfabric/lib/pkgconfig:/opt/mpich/lib/pkgconfig' 'HOSTNAME=dpr740' 'HYDRA_LAUNCHER_EXTRA_ARGS=-p 2233' 'PWD=/mpich/examples' 'HOME=/root' 'LS_COLORS=rs=0:di=01;34:ln=01;36:mh=00:pi=40;33:so=01;35:do=01;35:bd=40;33;01:cd=40;33;01:or=40;31;01:mi=00:su=37;41:sg=30;43:ca=30;41:tw=30;42:ow=34;42:st=37;44:ex=01;32:*.tar=01;31:*.tgz=01;31:*.arc=01;31:*.arj=01;31:*.taz=01;31:*.lha=01;31:*.lz4=01;31:*.lzh=01;31:*.lzma=01;31:*.tlz=01;31:*.txz=01;31:*.tzo=01;31:*.t7z=01;31:*.zip=01;31:*.z=01;31:*.dz=01;31:*.gz=01;31:*.lrz=01;31:*.lz=01;31:*.lzo=01;31:*.xz=01;31:*.zst=01;31:*.tzst=01;31:*.bz2=01;31:*.bz=01;31:*.tbz=01;31:*.tbz2=01;31:*.tz=01;31:*.deb=01;31:*.rpm=01;31:*.jar=01;31:*.war=01;31:*.ear=01;31:*.sar=01;31:*.rar=01;31:*.alz=01;31:*.ace=01;31:*.zoo=01;31:*.cpio=01;31:*.7z=01;31:*.rz=01;31:*.cab=01;31:*.wim=01;31:*.swm=01;31:*.dwm=01;31:*.esd=01;31:*.jpg=01;35:*.jpeg=01;35:*.mjpg=01;35:*.mjpeg=01;35:*.gif=01;35:*.bmp=01;35:*.pbm=01;35:*.pgm=01;35:*.ppm=01;35:*.tga=01;35:*.xbm=01;35:*.xpm=01;35:*.tif=01;35:*.tiff=01;35:*.png=01;35:*.svg=01;35:*.svgz=01;35:*.mng=01;35:*.pcx=01;35:*.mov=01;35:*.mpg=01;35:*.mpeg=01;35:*.m2v=01;35:*.mkv=01;35:*.webm=01;35:*.webp=01;35:*.ogm=01;35:*.mp4=01;35:*.m4v=01;35:*.mp4v=01;35:*.vob=01;35:*.qt=01;35:*.nuv=01;35:*.wmv=01;35:*.asf=01;35:*.rm=01;35:*.rmvb=01;35:*.flc=01;35:*.avi=01;35:*.fli=01;35:*.flv=01;35:*.gl=01;35:*.dl=01;35:*.xcf=01;35:*.xwd=01;35:*.yuv=01;35:*.cgm=01;35:*.emf=01;35:*.ogv=01;35:*.ogx=01;35:*.aac=00;36:*.au=00;36:*.flac=00;36:*.m4a=00;36:*.mid=00;36:*.midi=00;36:*.mka=00;36:*.mp3=00;36:*.mpc=00;36:*.ogg=00;36:*.ra=00;36:*.wav=00;36:*.oga=00;36:*.opus=00;36:*.spx=00;36:*.xspf=00;36:' 'LESSCLOSE=/usr/bin/lesspipe %s %s' 'TERM=xterm' 'LESSOPEN=| /usr/bin/lesspipe %s' 'SHLVL=1' 'LD_LIBRARY_PATH=:/opt/libfabric/lib:/opt/fabtests/lib:/opt/mpich/lib' 'PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/opt/libfabric/bin/:/opt/fabtest/bin:/opt/mpich/bin' '_=/opt/mpich/bin/mpirun' 'OLDPWD=/' --global-user-env 0 --global-system-env 1 'GFORTRAN_UNBUFFERED_PRECONNECTED=y' --proxy-core-count 1 --exec --exec-appnum 0 --exec-proc-count 1 --exec-local-env 0 --exec-wdir /mpich/examples --exec-args 1 ./a.out



Arguments being passed to proxy 1:

--version 4.3.0a1 --iface-ip-env-name MPIR_CVAR_CH3_INTERFACE_HOSTNAME --hostname 10.118.91.158 --global-core-map 0,1,2 --pmi-id-map 0,1 --global-process-count 2 --auto-cleanup 1 --pmi-kvsname kvs_844_0_801938186_dpr740 --pmi-process-mapping (vector,(0,2,1)) --global-inherited-env 14 'PKG_CONFIG_PATH=:/opt/libfabric/lib/pkgconfig:/opt/mpich/lib/pkgconfig' 'HOSTNAME=dpr740' 'HYDRA_LAUNCHER_EXTRA_ARGS=-p 2233' 'PWD=/mpich/examples' 'HOME=/root' 'LS_COLORS=rs=0:di=01;34:ln=01;36:mh=00:pi=40;33:so=01;35:do=01;35:bd=40;33;01:cd=40;33;01:or=40;31;01:mi=00:su=37;41:sg=30;43:ca=30;41:tw=30;42:ow=34;42:st=37;44:ex=01;32:*.tar=01;31:*.tgz=01;31:*.arc=01;31:*.arj=01;31:*.taz=01;31:*.lha=01;31:*.lz4=01;31:*.lzh=01;31:*.lzma=01;31:*.tlz=01;31:*.txz=01;31:*.tzo=01;31:*.t7z=01;31:*.zip=01;31:*.z=01;31:*.dz=01;31:*.gz=01;31:*.lrz=01;31:*.lz=01;31:*.lzo=01;31:*.xz=01;31:*.zst=01;31:*.tzst=01;31:*.bz2=01;31:*.bz=01;31:*.tbz=01;31:*.tbz2=01;31:*.tz=01;31:*.deb=01;31:*.rpm=01;31:*.jar=01;31:*.war=01;31:*.ear=01;31:*.sar=01;31:*.rar=01;31:*.alz=01;31:*.ace=01;31:*.zoo=01;31:*.cpio=01;31:*.7z=01;31:*.rz=01;31:*.cab=01;31:*.wim=01;31:*.swm=01;31:*.dwm=01;31:*.esd=01;31:*.jpg=01;35:*.jpeg=01;35:*.mjpg=01;35:*.mjpeg=01;35:*.gif=01;35:*.bmp=01;35:*.pbm=01;35:*.pgm=01;35:*.ppm=01;35:*.tga=01;35:*.xbm=01;35:*.xpm=01;35:*.tif=01;35:*.tiff=01;35:*.png=01;35:*.svg=01;35:*.svgz=01;35:*.mng=01;35:*.pcx=01;35:*.mov=01;35:*.mpg=01;35:*.mpeg=01;35:*.m2v=01;35:*.mkv=01;35:*.webm=01;35:*.webp=01;35:*.ogm=01;35:*.mp4=01;35:*.m4v=01;35:*.mp4v=01;35:*.vob=01;35:*.qt=01;35:*.nuv=01;35:*.wmv=01;35:*.asf=01;35:*.rm=01;35:*.rmvb=01;35:*.flc=01;35:*.avi=01;35:*.fli=01;35:*.flv=01;35:*.gl=01;35:*.dl=01;35:*.xcf=01;35:*.xwd=01;35:*.yuv=01;35:*.cgm=01;35:*.emf=01;35:*.ogv=01;35:*.ogx=01;35:*.aac=00;36:*.au=00;36:*.flac=00;36:*.m4a=00;36:*.mid=00;36:*.midi=00;36:*.mka=00;36:*.mp3=00;36:*.mpc=00;36:*.ogg=00;36:*.ra=00;36:*.wav=00;36:*.oga=00;36:*.opus=00;36:*.spx=00;36:*.xspf=00;36:' 'LESSCLOSE=/usr/bin/lesspipe %s %s' 'TERM=xterm' 'LESSOPEN=| /usr/bin/lesspipe %s' 'SHLVL=1' 'LD_LIBRARY_PATH=:/opt/libfabric/lib:/opt/fabtests/lib:/opt/mpich/lib' 'PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/opt/libfabric/bin/:/opt/fabtest/bin:/opt/mpich/bin' '_=/opt/mpich/bin/mpirun' 'OLDPWD=/' --global-user-env 0 --global-system-env 1 'GFORTRAN_UNBUFFERED_PRECONNECTED=y' --proxy-core-count 1 --exec --exec-appnum 0 --exec-proc-count 1 --exec-local-env 0 --exec-wdir /mpich/examples --exec-args 1 ./a.out



[mpiexec at dpr740] Launch arguments: /opt/mpich/bin/hydra_pmi_proxy --control-port 10.118.91.159:33909 --debug --rmk user --launcher ssh --demux poll --pgid 0 --retries 10 --usize -2 --pmi-port 0 --gpus-per-proc -2 --gpu-subdevs-per-proc -2 --proxy-id 0

[mpiexec at dpr740] Launch arguments: /usr/bin/ssh -x -p 2233 10.118.91.158 "/opt/mpich/bin/hydra_pmi_proxy" --control-port 10.118.91.159:33909 --debug --rmk user --launcher ssh --demux poll --pgid 0 --retries 10 --usize -2 --pmi-port 0 --gpus-per-proc -2 --gpu-subdevs-per-proc -2 --proxy-id 1

[proxy:0 at dpr740] Sending upstream hdr.cmd = CMD_PID_LIST

[proxy:0 at dpr740] got pmi command from downstream 0-0:

    cmd=init pmi_version=1 pmi_subversion=1

[proxy:0 at dpr740] Sending PMI command:

    cmd=response_to_init rc=0 pmi_version=1 pmi_subversion=1

[proxy:0 at dpr740] got pmi command from downstream 0-0:

    cmd=get_maxes

[proxy:0 at dpr740] Sending PMI command:

    cmd=maxes rc=0 kvsname_max=256 keylen_max=64 vallen_max=1024

[proxy:0 at dpr740] got pmi command from downstream 0-0:

    cmd=get_appnum

[proxy:0 at dpr740] Sending PMI command:

    cmd=appnum rc=0 appnum=0

[proxy:0 at dpr740] got pmi command from downstream 0-0:

    cmd=get_my_kvsname

[proxy:0 at dpr740] Sending PMI command:

    cmd=my_kvsname rc=0 kvsname=kvs_844_0_801938186_dpr740

[proxy:0 at dpr740] got pmi command from downstream 0-0:

    cmd=get kvsname=kvs_844_0_801938186_dpr740 key=PMI_process_mapping

[proxy:0 at dpr740] Sending PMI command:

    cmd=get_result rc=0 value=(vector,(0,2,1)) found=TRUE

[proxy:0 at dpr740] got pmi command from downstream 0-0:

    cmd=get kvsname=kvs_844_0_801938186_dpr740 key=PMI_hwloc_xmlfile

[proxy:0 at dpr740] Sending PMI command:

    cmd=get_result rc=0 value=/tmp/hydra_hwloc_xmlfile_CeNRJN found=TRUE

[proxy:0 at dpr740] got pmi command from downstream 0-0:

    cmd=get kvsname=kvs_844_0_801938186_dpr740 key=PMI_mpi_memory_alloc_kinds

[proxy:0 at dpr740] Sending upstream internal PMI command:

    cmd=get kvsname=kvs_844_0_801938186_dpr740 key=PMI_mpi_memory_alloc_kinds

[proxy:0 at dpr740] Sending upstream hdr.cmd = CMD_PMI

[mpiexec at dpr740] [pgid: 0] got PMI command: cmd=get kvsname=kvs_844_0_801938186_dpr740 key=PMI_mpi_memory_alloc_kinds



[mpiexec at dpr740] Sending internal PMI command (proxy:0:0):

    cmd=get_result rc=1

[proxy:0 at dpr740] we don't understand the response get_result; forwarding downstream

[proxy:1 at ampere-altra-2-1] Sending upstream hdr.cmd = CMD_PID_LIST

[proxy:1 at ampere-altra-2-1] got pmi command from downstream 0-0:

    cmd=init pmi_version=1 pmi_subversion=1

[proxy:1 at ampere-altra-2-1] Sending PMI command:

    cmd=response_to_init rc=0 pmi_version=1 pmi_subversion=1

[proxy:1 at ampere-altra-2-1] got pmi command from downstream 0-0:

    cmd=get_maxes

[proxy:1 at ampere-altra-2-1] Sending PMI command:

    cmd=maxes rc=0 kvsname_max=256 keylen_max=64 vallen_max=1024

[proxy:1 at ampere-altra-2-1] got pmi command from downstream 0-0:

    cmd=get_appnum

[proxy:1 at ampere-altra-2-1] Sending PMI command:

    cmd=appnum rc=0 appnum=0

[proxy:1 at ampere-altra-2-1] got pmi command from downstream 0-0:

    cmd=get_my_kvsname

[proxy:1 at ampere-altra-2-1] Sending PMI command:

    cmd=my_kvsname rc=0 kvsname=kvs_844_0_801938186_dpr740

[proxy:1 at ampere-altra-2-1] got pmi command from downstream 0-0:

    cmd=get kvsname=kvs_844_0_801938186_dpr740 key=PMI_process_mapping

[proxy:1 at ampere-altra-2-1] Sending PMI command:

    cmd=get_result rc=0 value=(vector,(0,2,1)) found=TRUE

[proxy:1 at ampere-altra-2-1] got pmi command from downstream 0-0:

    cmd=get kvsname=kvs_844_0_801938186_dpr740 key=PMI_hwloc_xmlfile

[proxy:1 at ampere-altra-2-1] Sending PMI command:

    cmd=get_result rc=0 value=/tmp/hydra_hwloc_xmlfile_xv8EIG found=TRUE

[proxy:0 at dpr740] got pmi command from downstream 0-0:

    cmd=barrier_in

[proxy:0 at dpr740] Sending upstream internal PMI command:

    cmd=barrier_in

[proxy:0 at dpr740] Sending upstream hdr.cmd = CMD_PMI

[mpiexec at dpr740] [pgid: 0] got PMI command: cmd=barrier_in



[mpiexec at dpr740] [pgid: 0] got PMI command: cmd=get kvsname=kvs_844_0_801938186_dpr740 key=PMI_mpi_memory_alloc_kinds



[mpiexec at dpr740] Sending internal PMI command (proxy:0:1):

    cmd=get_result rc=1

[proxy:1 at ampere-altra-2-1] got pmi command from downstream 0-0:

    cmd=get kvsname=kvs_844_0_801938186_dpr740 key=PMI_mpi_memory_alloc_kinds

[proxy:1 at ampere-altra-2-1] Sending upstream internal PMI command:

    cmd=get kvsname=kvs_844_0_801938186_dpr740 key=PMI_mpi_memory_alloc_kinds

[proxy:1 at ampere-altra-2-1] Sending upstream hdr.cmd = CMD_PMI

[proxy:1 at ampere-altra-2-1] we don't understand the response get_result; forwarding downstream

[mpiexec at dpr740] [pgid: 0] got PMI command: cmd=barrier_in



[mpiexec at dpr740] Sending internal PMI command (proxy:0:0):

    cmd=barrier_out

[mpiexec at dpr740] Sending internal PMI command (proxy:0:1):

    cmd=barrier_out

[proxy:0 at dpr740] Sending PMI command:

    cmd=barrier_out

[proxy:1 at ampere-altra-2-1] got pmi command from downstream 0-0:

    cmd=barrier_in

[proxy:1 at ampere-altra-2-1] Sending upstream internal PMI command:

    cmd=barrier_in

[proxy:1 at ampere-altra-2-1] Sending upstream hdr.cmd = CMD_PMI

[proxy:1 at ampere-altra-2-1] Sending PMI command:

    cmd=barrier_out

[proxy:0 at dpr740] got pmi command from downstream 0-0:

    cmd=put kvsname=kvs_844_0_801938186_dpr740 key=-allgather-shm-1-0 value=0A00812D[4]FE80[6]526B4BFFFEFC134208[3]

[proxy:0 at dpr740] [proxy:1 at ampere-altra-2-1] got pmi command from downstream 0-0:

    cmd=put kvsname=kvs_844_0_801938186_dpr740 key=-allgather-shm-1-1 value=0200A8BFC0A80101[8]

cached command: -allgather-shm-1-0=0A00812D[4]FE80[6]526B4BFFFEFC134208[3]

[proxy:0 at dpr740] Sending PMI command:

    cmd=put_result rc=0

[proxy:0 at dpr740] got pmi command from downstream 0-0:

    cmd=barrier_in

[proxy:0 at dpr740] flushing 1 put command(s) out

[proxy:0 at dpr740] forwarding command upstream:

cmd=mput -allgather-shm-1-0=0A00812D[4]FE80[6]526B4BFFFEFC134208[3]

[proxy:0 at dpr740] Sending upstream internal PMI command:

    cmd=mput -allgather-shm-1-0=0A00812D[4]FE80[6]526B4BFFFEFC134208[3]

[proxy:0 at dpr740] Sending upstream hdr.cmd = CMD_PMI

[proxy:1 at ampere-altra-2-1] cached command: -allgather-shm-1-1=0200A8BFC0A80101[8]

[proxy:1 at ampere-altra-2-1] Sending PMI command:

[proxy:0 at dpr740] Sending upstream internal PMI command:

    cmd=barrier_in

[proxy:0 at dpr740] Sending upstream hdr.cmd = CMD_PMI

[mpiexec at dpr740] [pgid: 0] got PMI command: cmd=mput -allgather-shm-1-0=0A00812D[4]FE80[6]526B4BFFFEFC134208[3]



    cmd=put_result rc=0

[mpiexec at dpr740] [pgid: 0] got PMI command: cmd=barrier_in



[mpiexec at dpr740] [pgid: 0] got PMI command: cmd=mput -allgather-shm-1-1=0200A8BFC0A80101[8]



[proxy:1 at ampere-altra-2-1] got pmi command from downstream 0-0:

    cmd=barrier_in

[proxy:1 at ampere-altra-2-1] flushing 1 put command(s) out

[proxy:1 at ampere-altra-2-1] forwarding command upstream:

[mpiexec at dpr740] [pgid: 0] got PMI command: cmd=barrier_in



[mpiexec at dpr740] Sending internal PMI command (proxy:0:0):

    cmd=keyval_cache -allgather-shm-1-0=0A00812D[4]FE80[6]526B4BFFFEFC134208[3] -allgather-shm-1-1=0200A8BFC0A80101[8]

[mpiexec at dpr740] Sending internal PMI command (proxy:0:1):

    cmd=keyval_cache -allgather-shm-1-0=0A00812D[4]FE80[6]526B4BFFFEFC134208[3] -allgather-shm-1-1=0200A8BFC0A80101[8]

[mpiexec at dpr740] Sending internal PMI command (proxy:0:0):

    cmd=barrier_out

[mpiexec at dpr740] Sending internal PMI command (proxy:0:1):

    cmd=barrier_out

[proxy:0 at dpr740] Sending PMI command:

    cmd=barrier_out

cmd=mput -allgather-shm-1-1=0200A8BFC0A80101[8]

[proxy:1 at ampere-altra-2-1] Sending upstream internal PMI command:

    cmd=mput -allgather-shm-1-1=0200A8BFC0A80101[8]

[proxy:1 at ampere-altra-2-1] Sending upstream hdr.cmd = CMD_PMI

[proxy:1 at ampere-altra-2-1] Sending upstream internal PMI command:

    cmd=barrier_in

[proxy:1 at ampere-altra-2-1] Sending upstream hdr.cmd = CMD_PMI

[proxy:0 at dpr740] got pmi command from downstream 0-0:

    cmd=get kvsname=kvs_844_0_801938186_dpr740 key=-allgather-shm-1-0

[proxy:0 at dpr740] Sending PMI command:

    cmd=get_result rc=0 value=0A00812D[4]FE80[6]526B4BFFFEFC134208[3] found=TRUE

[proxy:0 at dpr740] got pmi command from downstream 0-0:

    cmd=get kvsname=kvs_844_0_801938186_dpr740 key=-allgather-shm-1-1

[proxy:0 at dpr740] Sending PMI command:

    cmd=get_result rc=0 value=0200A8BFC0A80101[8] found=TRUE

[proxy:1 at ampere-altra-2-1] Sending PMI command:

    cmd=barrier_out

[proxy:1 at ampere-altra-2-1] got pmi command from downstream 0-0:

    cmd=get kvsname=kvs_844_0_801938186_dpr740 key=-allgather-shm-1-0

[proxy:1 at ampere-altra-2-1] Sending PMI command:

    cmd=get_result rc=0 value=0A00812D[4]FE80[6]526B4BFFFEFC134208[3] found=TRUE

[proxy:1 at ampere-altra-2-1] got pmi command from downstream 0-0:

    cmd=get kvsname=kvs_844_0_801938186_dpr740 key=-allgather-shm-1-1

[proxy:1 at ampere-altra-2-1] Sending PMI command:

    cmd=get_result rc=0 value=0200A8BFC0A80101[8] found=TRUE

realloc(): invalid pointer

[proxy:1 at ampere-altra-2-1] Sending upstream hdr.cmd = CMD_STDERR



===================================================================================

=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES

=   PID 2404 RUNNING AT 10.118.91.158

=   EXIT CODE: 134

=   CLEANING UP REMAINING PROCESSES

=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES

===================================================================================

[proxy:1 at ampere-altra-2-1] Sending upstream hdr.cmd = CMD_EXIT_STATUS

[proxy:0 at dpr740] HYD_pmcd_pmip_control_cmd_cb (proxy/pmip_cb.c:484): assert (!closed) failed

[proxy:0 at dpr740] HYDT_dmxu_poll_wait_for_event (lib/tools/demux/demux_poll.c:76): callback returned error status

[proxy:0 at dpr740] main (proxy/pmip.c:122): demux engine error waiting for event

[mpiexec at dpr740] HYDT_bscu_wait_for_completion (lib/tools/bootstrap/utils/bscu_wait.c:109): one of the processes terminated badly; aborting

[mpiexec at dpr740] HYDT_bsci_wait_for_completion (lib/tools/bootstrap/src/bsci_wait.c:21): launcher returned error waiting for completion

[mpiexec at dpr740] HYD_pmci_wait_for_completion (mpiexec/pmiserv_pmci.c:189): launcher returned error waiting for completion

[mpiexec at dpr740] main (mpiexec/mpiexec.c:260): process manager error waiting for completion







Regards,

Niyaz

IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20240617/8676597d/attachment-0001.html>


More information about the discuss mailing list