[mpich-discuss] mpi hello-world error
Niyaz Murshed
Niyaz.Murshed at arm.com
Mon Jun 17 11:59:34 CDT 2024
I see there is an option called “iface” ..
-iface network interface to use
However, it did not help.
From: Zhou, Hui <zhouh at anl.gov>
Date: Monday, June 17, 2024 at 11:48 AM
To: Niyaz Murshed <Niyaz.Murshed at arm.com>, discuss at mpich.org <discuss at mpich.org>
Cc: nd <nd at arm.com>
Subject: Re: mpi hello-world error
It is picking 192.168.1.1 as local address rather than 10.118.91.158. Try use 192.168.1.x for both hosts or remove the 192.168.1.x network. I don't think we have a way to select NIC interface. We'll put that in plans.
--
Hui
________________________________
From: Niyaz Murshed <Niyaz.Murshed at arm.com>
Sent: Monday, June 17, 2024 11:10 AM
To: Zhou, Hui <zhouh at anl.gov>; discuss at mpich.org <discuss at mpich.org>
Cc: nd <nd at arm.com>
Subject: Re: mpi hello-world error
Please find below : Interestingly , I don’t see verbs provider in the list. root@ dpr740: /mpich/examples# mpirun -n 2 -hosts 10. 118. 91. 158,10. 118. 91. 159 ./a. out ==== Various sizes and limits ==== sizeof(MPIDI_per_vci_t): 192 Required minimum
ZjQcmQRYFpfptBannerStart
This Message Is From an External Sender
This message came from outside your organization.
ZjQcmQRYFpfptBannerEnd
Please find below :
Interestingly , I don’t see verbs provider in the list.
root at dpr740:/mpich/examples# mpirun -n 2 -hosts 10.118.91.158,10.118.91.159 ./a.out
==== Various sizes and limits ====
sizeof(MPIDI_per_vci_t): 192
Required minimum FI_VERSION: 0, current version: 10015
provider: tcp, score = 4, pref = 0, FI_SOCKADDR_IN [16] 192.168.1.1
provider: tcp, score = 4, pref = 0, FI_SOCKADDR_IN6 [28] fe80::42a6:b7ff:fe28:c008
provider: tcp, score = 4, pref = 0, FI_SOCKADDR_IN [16] 10.118.91.158
provider: tcp, score = 4, pref = 0, FI_SOCKADDR_IN6 [28] fe80::f66b:8cff:fe55:657c
provider: tcp, score = 4, pref = 0, FI_SOCKADDR_IN [16] 127.0.0.1
provider: tcp, score = 4, pref = 0, FI_SOCKADDR_IN6 [28] ::1
provider: tcp;ofi_rxm, score = 4, pref = 0, FI_SOCKADDR_IN [16] 192.168.1.1
provider: tcp;ofi_rxm, score = 4, pref = 0, FI_SOCKADDR_IN6 [28] fe80::42a6:b7ff:fe28:c008
provider: tcp;ofi_rxm, score = 4, pref = 0, FI_SOCKADDR_IN [16] 10.118.91.158
provider: tcp;ofi_rxm, score = 4, pref = 0, FI_SOCKADDR_IN6 [28] fe80::f66b:8cff:fe55:657c
provider: tcp;ofi_rxm, score = 4, pref = 0, FI_SOCKADDR_IN [16] 127.0.0.1
provider: tcp;ofi_rxm, score = 4, pref = 0, FI_SOCKADDR_IN6 [28] ::1
provider: tcp;ofi_rxm, score = 4, pref = 0, FI_SOCKADDR_IN [16] 192.168.1.1
provider: tcp;ofi_rxm, score = 4, pref = 0, FI_SOCKADDR_IN6 [28] fe80::42a6:b7ff:fe28:c008
provider: tcp;ofi_rxm, score = 4, pref = 0, FI_SOCKADDR_IN [16] 10.118.91.158
provider: tcp;ofi_rxm, score = 4, pref = 0, FI_SOCKADDR_IN6 [28] fe80::f66b:8cff:fe55:657c
provider: tcp;ofi_rxm, score = 4, pref = 0, FI_SOCKADDR_IN [16] 127.0.0.1
provider: tcp;ofi_rxm, score = 4, pref = 0, FI_SOCKADDR_IN6 [28] ::1
provider: tcp;ofi_rxm, score = 4, pref = 0, FI_SOCKADDR_IN [16] 192.168.1.1
provider: tcp;ofi_rxm, score = 4, pref = 0, FI_SOCKADDR_IN6 [28] fe80::42a6:b7ff:fe28:c008
provider: tcp;ofi_rxm, score = 4, pref = 0, FI_SOCKADDR_IN [16] 10.118.91.158
provider: tcp;ofi_rxm, score = 4, pref = 0, FI_SOCKADDR_IN6 [28] fe80::f66b:8cff:fe55:657c
provider: tcp;ofi_rxm, score = 4, pref = 0, FI_SOCKADDR_IN [16] 127.0.0.1
provider: tcp;ofi_rxm, score = 4, pref = 0, FI_SOCKADDR_IN6 [28] ::1
provider: tcp;ofi_rxm, score = 4, pref = 0, FI_SOCKADDR_IN [16] 192.168.1.1
provider: tcp;ofi_rxm, score = 4, pref = 0, FI_SOCKADDR_IN6 [28] fe80::42a6:b7ff:fe28:c008
provider: tcp;ofi_rxm, score = 4, pref = 0, FI_SOCKADDR_IN [16] 10.118.91.158
provider: tcp;ofi_rxm, score = 4, pref = 0, FI_SOCKADDR_IN6 [28] fe80::f66b:8cff:fe55:657c
provider: tcp;ofi_rxm, score = 4, pref = 0, FI_SOCKADDR_IN [16] 127.0.0.1
provider: tcp;ofi_rxm, score = 4, pref = 0, FI_SOCKADDR_IN6 [28] ::1
provider: udp;ofi_rxd, score = 5, pref = -2, FI_SOCKADDR_IN [16] 192.168.1.1
provider: udp;ofi_rxd, score = 5, pref = -2, FI_SOCKADDR_IN6 [28] fe80::42a6:b7ff:fe28:c008
provider: udp;ofi_rxd, score = 5, pref = -2, FI_SOCKADDR_IN [16] 10.118.91.158
provider: udp;ofi_rxd, score = 5, pref = -2, FI_SOCKADDR_IN6 [28] fe80::f66b:8cff:fe55:657c
provider: udp;ofi_rxd, score = 5, pref = -2, FI_SOCKADDR_IN [16] 127.0.0.1
provider: udp;ofi_rxd, score = 5, pref = -2, FI_SOCKADDR_IN6 [28] ::1
provider: shm, score = 4, pref = -2, FI_ADDR_STR [13] - fi_shm://694
provider: shm, score = 4, pref = -2, FI_ADDR_STR [13] - fi_shm://694
provider: udp, score = 0, pref = 0, FI_SOCKADDR_IN [16] 192.168.1.1
provider: udp, score = 0, pref = 0, FI_SOCKADDR_IN6 [28] fe80::42a6:b7ff:fe28:c008
provider: udp, score = 0, pref = 0, FI_SOCKADDR_IN [16] 10.118.91.158
provider: udp, score = 0, pref = 0, FI_SOCKADDR_IN6 [28] fe80::f66b:8cff:fe55:657c
provider: udp, score = 0, pref = 0, FI_SOCKADDR_IN [16] 127.0.0.1
provider: udp, score = 0, pref = 0, FI_SOCKADDR_IN6 [28] ::1
provider: tcp, score = 0, pref = 0, FI_SOCKADDR_IN [16] 192.168.1.1
provider: tcp, score = 0, pref = 0, FI_SOCKADDR_IN6 [28] fe80::42a6:b7ff:fe28:c008
provider: tcp, score = 0, pref = 0, FI_SOCKADDR_IN [16] 10.118.91.158
provider: tcp, score = 0, pref = 0, FI_SOCKADDR_IN6 [28] fe80::f66b:8cff:fe55:657c
provider: tcp, score = 0, pref = 0, FI_SOCKADDR_IN [16] 127.0.0.1
provider: tcp, score = 0, pref = 0, FI_SOCKADDR_IN6 [28] ::1
provider: tcp, score = 4, pref = 0, FI_SOCKADDR_IN [16] 192.168.1.1
provider: tcp, score = 4, pref = 0, FI_SOCKADDR_IN6 [28] fe80::42a6:b7ff:fe28:c008
provider: tcp, score = 4, pref = 0, FI_SOCKADDR_IN [16] 10.118.91.158
provider: tcp, score = 4, pref = 0, FI_SOCKADDR_IN6 [28] fe80::f66b:8cff:fe55:657c
provider: tcp, score = 4, pref = 0, FI_SOCKADDR_IN [16] 127.0.0.1
provider: tcp, score = 4, pref = 0, FI_SOCKADDR_IN6 [28] ::1
provider: sockets, score = 3, pref = 0, FI_SOCKADDR_IN [16] 192.168.1.1
provider: sockets, score = 3, pref = 0, FI_SOCKADDR_IN6 [28] fe80::42a6:b7ff:fe28:c008
provider: sockets, score = 3, pref = 0, FI_SOCKADDR_IN [16] 10.118.91.158
provider: sockets, score = 3, pref = 0, FI_SOCKADDR_IN6 [28] fe80::f66b:8cff:fe55:657c
provider: sockets, score = 3, pref = 0, FI_SOCKADDR_IN [16] 127.0.0.1
provider: sockets, score = 3, pref = 0, FI_SOCKADDR_IN6 [28] ::1
provider: sockets, score = 5, pref = 0, FI_SOCKADDR_IN [16] 192.168.1.1
provider: sockets, score = 5, pref = 0, FI_SOCKADDR_IN6 [28] fe80::42a6:b7ff:fe28:c008
provider: sockets, score = 5, pref = 0, FI_SOCKADDR_IN [16] 10.118.91.158
provider: sockets, score = 5, pref = 0, FI_SOCKADDR_IN6 [28] fe80::f66b:8cff:fe55:657c
provider: sockets, score = 5, pref = 0, FI_SOCKADDR_IN [16] 127.0.0.1
provider: sockets, score = 5, pref = 0, FI_SOCKADDR_IN6 [28] ::1
provider: sockets, score = 5, pref = 0, FI_SOCKADDR_IN [16] 192.168.1.1
provider: sockets, score = 5, pref = 0, FI_SOCKADDR_IN6 [28] fe80::42a6:b7ff:fe28:c008
provider: sockets, score = 5, pref = 0, FI_SOCKADDR_IN [16] 10.118.91.158
provider: sockets, score = 5, pref = 0, FI_SOCKADDR_IN6 [28] fe80::f66b:8cff:fe55:657c
provider: sockets, score = 5, pref = 0, FI_SOCKADDR_IN [16] 127.0.0.1
provider: sockets, score = 5, pref = 0, FI_SOCKADDR_IN6 [28] ::1
provider: sm2, score = 3, pref = 0, FI_ADDR_STR [12] - fi_sm2://694
provider: sm2, score = 3, pref = 0, FI_ADDR_STR [12] - fi_sm2://694
Required minimum FI_VERSION: 10005, current version: 10015
==== Capability set configuration ====
libfabric provider: sockets - 192.168.1.0/24
MPIDI_OFI_ENABLE_DATA: 1
MPIDI_OFI_ENABLE_AV_TABLE: 1
MPIDI_OFI_ENABLE_SCALABLE_ENDPOINTS: 1
MPIDI_OFI_ENABLE_SHARED_CONTEXTS: 0
MPIDI_OFI_ENABLE_MR_VIRT_ADDRESS: 0
MPIDI_OFI_ENABLE_MR_ALLOCATED: 0
MPIDI_OFI_ENABLE_MR_REGISTER_NULL: 1
MPIDI_OFI_ENABLE_MR_PROV_KEY: 0
MPIDI_OFI_ENABLE_TAGGED: 1
MPIDI_OFI_ENABLE_AM: 1
MPIDI_OFI_ENABLE_RMA: 1
MPIDI_OFI_ENABLE_ATOMICS: 1
MPIDI_OFI_FETCH_ATOMIC_IOVECS: 1
MPIDI_OFI_ENABLE_DATA_AUTO_PROGRESS: 0
MPIDI_OFI_ENABLE_CONTROL_AUTO_PROGRESS: 0
MPIDI_OFI_ENABLE_PT2PT_NOPACK: 1
MPIDI_OFI_ENABLE_TRIGGERED: 0
MPIDI_OFI_ENABLE_HMEM: 0
MPIDI_OFI_NUM_AM_BUFFERS: 8
MPIDI_OFI_NUM_OPTIMIZED_MEMORY_REGIONS: 0
MPIDI_OFI_CONTEXT_BITS: 20
MPIDI_OFI_SOURCE_BITS: 0
MPIDI_OFI_TAG_BITS: 31
MPIDI_OFI_VNI_USE_DOMAIN: 1
MAXIMUM SUPPORTED RANKS: 4294967296
MAXIMUM TAG: 2147483648
==== Provider global thresholds ====
max_buffered_send: 255
max_buffered_write: 255
max_msg_size: 9223372036854775807
max_order_raw: -1
max_order_war: -1
max_order_waw: -1
tx_iov_limit: 8
rx_iov_limit: 8
rma_iov_limit: 8
max_mr_key_size: 8
==== Various sizes and limits ====
MPIDI_OFI_AM_MSG_HEADER_SIZE: 24
MPIDI_OFI_MAX_AM_HDR_SIZE: 255
sizeof(MPIDI_OFI_am_request_header_t): 416
sizeof(MPIDI_OFI_per_vci_t): 52480
MPIDI_OFI_AM_HDR_POOL_CELL_SIZE: 1024
MPIDI_OFI_DEFAULT_SHORT_SEND_SIZE: 16384
Assertion failed in file src/mpid/ch4/netmod/ofi/init_addrxchg.c at line 151: mapped_table[i] != FI_ADDR_NOTAVAIL
/opt/mpich/lib/libmpi.so.0(+0x59a0fc) [0xffffa661a0fc]
/opt/mpich/lib/libmpi.so.0(+0x4a6b58) [0xffffa6526b58]
/opt/mpich/lib/libmpi.so.0(+0x564740) [0xffffa65e4740]
/opt/mpich/lib/libmpi.so.0(+0x546c14) [0xffffa65c6c14]
/opt/mpich/lib/libmpi.so.0(+0x4f70cc) [0xffffa65770cc]
/opt/mpich/lib/libmpi.so.0(+0x4f9850) [0xffffa6579850]
/opt/mpich/lib/libmpi.so.0(+0x3ffd2c) [0xffffa647fd2c]
/opt/mpich/lib/libmpi.so.0(+0x4017ec) [0xffffa64817ec]
/opt/mpich/lib/libmpi.so.0(+0x3fe384) [0xffffa647e384]
/opt/mpich/lib/libmpi.so.0(+0x436a64) [0xffffa64b6a64]
/opt/mpich/lib/libmpi.so.0(+0x43700c) [0xffffa64b700c]
/opt/mpich/lib/libmpi.so.0(MPI_Init+0x44) [0xffffa61aeeb4]
./a.out(+0x9c4) [0xaaaac2a209c4]
/lib/aarch64-linux-gnu/libc.so.6(+0x273fc) [0xffffa5ef73fc]
/lib/aarch64-linux-gnu/libc.so.6(__libc_start_main+0x98) [0xffffa5ef74cc]
./a.out(+0x8b0) [0xaaaac2a208b0]
Abort(1) on node 0: Internal error
From: Zhou, Hui <zhouh at anl.gov>
Date: Monday, June 17, 2024 at 11:08 AM
To: Niyaz Murshed <Niyaz.Murshed at arm.com>, discuss at mpich.org <discuss at mpich.org>
Cc: nd <nd at arm.com>
Subject: Re: mpi hello-world error
Could you set env "MPIR_CVAR_DEBUG_SUMMARY=1` and rerun the test?
Hui
________________________________
From: Niyaz Murshed <Niyaz.Murshed at arm.com>
Sent: Monday, June 17, 2024 11:05 AM
To: Zhou, Hui <zhouh at anl.gov>; discuss at mpich.org <discuss at mpich.org>
Cc: nd <nd at arm.com>
Subject: Re: mpi hello-world error
Yes, one of the hosts. I have 2 servers. Hostname1: dpr740/10. 118. 91. 159 Hostname2 : ampere-altra-2-1/10. 118. 91. 158 I am running the application on dpr740 Adding both hosts: root@ dpr740: /mpich/examples# mpirun -n 2 -hosts 10. 118. 91. 158,10. 118. 91. 159
ZjQcmQRYFpfptBannerStart
This Message Is From an External Sender
This message came from outside your organization.
ZjQcmQRYFpfptBannerEnd
Yes, one of the hosts.
I have 2 servers.
Hostname1: dpr740/10.118.91.159
Hostname2 : ampere-altra-2-1/10.118.91.158
I am running the application on dpr740
Adding both hosts:
root at dpr740:/mpich/examples# mpirun -n 2 -hosts 10.118.91.158,10.118.91.159 ./a.out
Assertion failed in file src/mpid/ch4/netmod/ofi/init_addrxchg.c at line 151: mapped_table[i] != FI_ADDR_NOTAVAIL
/opt/mpich/lib/libmpi.so.0(+0x59a0fc) [0xffffa063a0fc]
/opt/mpich/lib/libmpi.so.0(+0x4a6b58) [0xffffa0546b58]
/opt/mpich/lib/libmpi.so.0(+0x564740) [0xffffa0604740]
/opt/mpich/lib/libmpi.so.0(+0x546c14) [0xffffa05e6c14]
/opt/mpich/lib/libmpi.so.0(+0x4f70cc) [0xffffa05970cc]
/opt/mpich/lib/libmpi.so.0(+0x4f9850) [0xffffa0599850]
/opt/mpich/lib/libmpi.so.0(+0x3ffd2c) [0xffffa049fd2c]
/opt/mpich/lib/libmpi.so.0(+0x4017ec) [0xffffa04a17ec]
/opt/mpich/lib/libmpi.so.0(+0x3fe384) [0xffffa049e384]
/opt/mpich/lib/libmpi.so.0(+0x436a64) [0xffffa04d6a64]
/opt/mpich/lib/libmpi.so.0(+0x43700c) [0xffffa04d700c]
/opt/mpich/lib/libmpi.so.0(MPI_Init+0x44) [0xffffa01ceeb4]
./a.out(+0x9c4) [0xaaaacd3309c4]
/lib/aarch64-linux-gnu/libc.so.6(+0x273fc) [0xffff9ff173fc]
/lib/aarch64-linux-gnu/libc.so.6(__libc_start_main+0x98) [0xffff9ff174cc]
./a.out(+0x8b0) [0xaaaacd3308b0]
Abort(1) on node 0: Internal error
[mpiexec at dpr740] HYDU_sock_write (lib/utils/sock.c:250): write error (Bad file descriptor)
[mpiexec at dpr740] send_hdr_downstream (mpiexec/pmiserv_cb.c:28): sock write error
^C[mpiexec at dpr740] Sending Ctrl-C to processes as requested
[mpiexec at dpr740] Press Ctrl-C again to force abort
[mpiexec at dpr740] HYDU_sock_write (lib/utils/sock.c:250): write error (Bad file descriptor)
[mpiexec at dpr740] send_hdr_downstream (mpiexec/pmiserv_cb.c:28): sock write error
[mpiexec at dpr740] HYD_pmcd_pmiserv_send_signal (mpiexec/pmiserv_cb.c:218): unable to write data to proxy
[mpiexec at dpr740] ui_cmd_cb (mpiexec/pmiserv_pmci.c:61): unable to send signal downstream
[mpiexec at dpr740] HYDT_dmxu_poll_wait_for_event (lib/tools/demux/demux_poll.c:76): callback returned error status
[mpiexec at dpr740] HYD_pmci_wait_for_completion (mpiexec/pmiserv_pmci.c:173): error waiting for event
[mpiexec at dpr740] main (mpiexec/mpiexec.c:260): process manager error waiting for completion
If I just add the remote host, it will run successfully.
root at dpr740:/mpich/examples# mpirun -n 2 -hosts 10.118.91.158 ./a.out
Hello world from process 0 of 2
Hello world from process 1 of 2
From: Zhou, Hui <zhouh at anl.gov>
Date: Monday, June 17, 2024 at 10:33 AM
To: Niyaz Murshed <Niyaz.Murshed at arm.com>, discuss at mpich.org <discuss at mpich.org>
Subject: Re: mpi hello-world error
Alright. Let's focus on the case of two fixed nodes running
mpirun -n 2 -hosts 10.118.91.158,10.118.91.159 ./a.out
Is the error consistent every time?
Are you running the command from one of the host? Out of curiosity, why the host names looks like from two different naming systems?
--
Hui
________________________________
From: Niyaz Murshed <Niyaz.Murshed at arm.com>
Sent: Monday, June 17, 2024 10:23 AM
To: discuss at mpich.org <discuss at mpich.org>
Cc: Zhou, Hui <zhouh at anl.gov>
Subject: Re: mpi hello-world error
Hi Hui, Apologies for this, I just assumed more logs would give more information. Yes, both servers are on the same network. In the first email, I can run the hello-world application from server1 to server2 and vice versa. Its only when I add
ZjQcmQRYFpfptBannerStart
This Message Is From an External Sender
This message came from outside your organization.
ZjQcmQRYFpfptBannerEnd
Hi Hui,
Apologies for this, I just assumed more logs would give more information.
Yes, both servers are on the same network.
In the first email, I can run the hello-world application from server1 to server2 and vice versa.
Its only when I add both servers in the parameters, the error is seen.
Get Outlook for iOS<https://urldefense.us/v3/__https:/aka.ms/o0ukef__;!!G_uCfscf7eWS!aA6_K_xqXWVqnjCozuSNlnNItijBkb7EDA_6xOPPs1AVXK3mV0yGRzhJT1WJ1N0oqX4ZZThlLzZPSpU2nxU$>
________________________________
From: Zhou, Hui via discuss <discuss at mpich.org>
Sent: Monday, June 17, 2024 9:41:50 AM
To: discuss at mpich.org <discuss at mpich.org>
Cc: Zhou, Hui <zhouh at anl.gov>
Subject: Re: [mpich-discuss] mpi hello-world error
Niyaz,
I am quite lost on the errors you encountered. The three errors seem all over the place. Are the two hosts on the same local network?
--
Hui
________________________________
From: Niyaz Murshed via discuss <discuss at mpich.org>
Sent: Monday, June 17, 2024 1:07 AM
To: discuss at mpich.org <discuss at mpich.org>
Cc: Niyaz Murshed <Niyaz.Murshed at arm.com>; nd <nd at arm.com>
Subject: Re: [mpich-discuss] mpi hello-world error
What is the best way to understand this log ? [proxy: 1@ ampere-altra-2-1] Sending upstream hdr. cmd = CMD_STDERR Abort(680650255) on node 1: Fatal error in internal_Init: Other MPI error, error stack: internal_Init(70). . . . . . . . . . . . . . . . : MPI_Init(argc=(nil),
ZjQcmQRYFpfptBannerStart
This Message Is From an External Sender
This message came from outside your organization.
ZjQcmQRYFpfptBannerEnd
What is the best way to understand this log ?
[proxy:1 at ampere-altra-2-1] Sending upstream hdr.cmd = CMD_STDERR
Abort(680650255) on node 1: Fatal error in internal_Init: Other MPI error, error stack:
internal_Init(70)................: MPI_Init(argc=(nil), argv=(nil)) failed
MPII_Init_thread(268)............:
MPIR_init_comm_world(34).........:
MPIR_Comm_commit(823)............:
MPID_Comm_commit_post_hook(222)..:
MPIDI_world_post_init(665).......:
MPIDI_OFI_init_vcis(851).........:
check_num_nics(900)..............:
MPIR_Allreduce_allcomm_auto(4726):
MPIC_Sendrecv(301)...............:
MPID_Isend(63)...................:
MPIDI_isend(35)..................:
(unknown)(): Other MPI error
[proxy:1 at ampere-altra-2-1] Sending upstream hdr.cmd = CMD_EXIT_STATUS
[proxy:0 at cesw-amp-gbt-2s-m12830-01] Sending upstream hdr.cmd = CMD_STDERR
Abort(680650255) on node 0: Fatal error in internal_Init: Other MPI error, error stack:
internal_Init(70)................: MPI_Init(argc=(nil), argv=(nil)) failed
MPII_Init_thread(268)............:
MPIR_init_comm_world(34).........:
MPIR_Comm_commit(823)............:
MPID_Comm_commit_post_hook(222)..:
MPIDI_world_post_init(665).......:
MPIDI_OFI_init_vcis(851).........:
check_num_nics(900)..............:
MPIR_Allreduce_allcomm_auto(4726):
MPIC_Sendrecv(301)...............:
MPID_Isend(63)...................:
MPIDI_isend(35)..................:
(unknown)(): Other MPI error
[proxy:0 at cesw-amp-gbt-2s-m12830-01] Sending upstream hdr.cmd = CMD_EXIT_STATUS
From: Niyaz Murshed via discuss <discuss at mpich.org>
Date: Saturday, June 15, 2024 at 10:53 AM
To: discuss at mpich.org <discuss at mpich.org>
Cc: Niyaz Murshed <Niyaz.Murshed at arm.com>, nd <nd at arm.com>
Subject: Re: [mpich-discuss] mpi hello-world error
Also seeing this error sometimes. root@ dpr740: /mpich/examples# export FI_PROVIDER=tcp root@ dpr740: /mpich/examples# mpirun -verbose -n 2 -hosts 10. 118. 91. 158,10. 118. 91. 159 ./a. out host: 10. 118. 91. 158 host: 10. 118. 91. 159 [mpiexec@ dpr740] Timeout
ZjQcmQRYFpfptBannerStart
This Message Is From an External Sender
This message came from outside your organization.
ZjQcmQRYFpfptBannerEnd
Also seeing this error sometimes.
root at dpr740:/mpich/examples# export FI_PROVIDER=tcp
root at dpr740:/mpich/examples# mpirun -verbose -n 2 -hosts 10.118.91.158,10.118.91.159 ./a.out
host: 10.118.91.158
host: 10.118.91.159
[mpiexec at dpr740] Timeout set to -1 (-1 means infinite)
==================================================================================================
mpiexec options:
----------------
Base path: /opt/mpich/bin/
Launcher: (null)
Debug level: 1
Enable X: -1
Global environment:
-------------------
PKG_CONFIG_PATH=:/opt/libfabric/lib/pkgconfig:/opt/mpich/lib/pkgconfig
HOSTNAME=dpr740
HYDRA_LAUNCHER_EXTRA_ARGS=-p 2233
PWD=/mpich/examples
HOME=/root
FI_PROVIDER=tcp
LS_COLORS=rs=0:di=01;34:ln=01;36:mh=00:pi=40;33:so=01;35:do=01;35:bd=40;33;01:cd=40;33;01:or=40;31;01:mi=00:su=37;41:sg=30;43:ca=30;41:tw=30;42:ow=34;42:st=37;44:ex=01;32:*.tar=01;31:*.tgz=01;31:*.arc=01;31:*.arj=01;31:*.taz=01;31:*.lha=01;31:*.lz4=01;31:*.lzh=01;31:*.lzma=01;31:*.tlz=01;31:*.txz=01;31:*.tzo=01;31:*.t7z=01;31:*.zip=01;31:*.z=01;31:*.dz=01;31:*.gz=01;31:*.lrz=01;31:*.lz=01;31:*.lzo=01;31:*.xz=01;31:*.zst=01;31:*.tzst=01;31:*.bz2=01;31:*.bz=01;31:*.tbz=01;31:*.tbz2=01;31:*.tz=01;31:*.deb=01;31:*.rpm=01;31:*.jar=01;31:*.war=01;31:*.ear=01;31:*.sar=01;31:*.rar=01;31:*.alz=01;31:*.ace=01;31:*.zoo=01;31:*.cpio=01;31:*.7z=01;31:*.rz=01;31:*.cab=01;31:*.wim=01;31:*.swm=01;31:*.dwm=01;31:*.esd=01;31:*.jpg=01;35:*.jpeg=01;35:*.mjpg=01;35:*.mjpeg=01;35:*.gif=01;35:*.bmp=01;35:*.pbm=01;35:*.pgm=01;35:*.ppm=01;35:*.tga=01;35:*.xbm=01;35:*.xpm=01;35:*.tif=01;35:*.tiff=01;35:*.png=01;35:*.svg=01;35:*.svgz=01;35:*.mng=01;35:*.pcx=01;35:*.mov=01;35:*.mpg=01;35:*.mpeg=01;35:*.m2v=01;35:*.mkv=01;35:*.webm=01;35:*.webp=01;35:*.ogm=01;35:*.mp4=01;35:*.m4v=01;35:*.mp4v=01;35:*.vob=01;35:*.qt=01;35:*.nuv=01;35:*.wmv=01;35:*.asf=01;35:*.rm=01;35:*.rmvb=01;35:*.flc=01;35:*.avi=01;35:*.fli=01;35:*.flv=01;35:*.gl=01;35:*.dl=01;35:*.xcf=01;35:*.xwd=01;35:*.yuv=01;35:*.cgm=01;35:*.emf=01;35:*.ogv=01;35:*.ogx=01;35:*.aac=00;36:*.au=00;36:*.flac=00;36:*.m4a=00;36:*.mid=00;36:*.midi=00;36:*.mka=00;36:*.mp3=00;36:*.mpc=00;36:*.ogg=00;36:*.ra=00;36:*.wav=00;36:*.oga=00;36:*.opus=00;36:*.spx=00;36:*.xspf=00;36:
LESSCLOSE=/usr/bin/lesspipe %s %s
TERM=xterm
LESSOPEN=| /usr/bin/lesspipe %s
SHLVL=1
LD_LIBRARY_PATH=:/opt/libfabric/lib:/opt/fabtests/lib:/opt/mpich/lib
PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/opt/libfabric/bin/:/opt/fabtest/bin:/opt/mpich/bin
OLDPWD=/
_=/opt/mpich/bin/mpirun
Hydra internal environment:
---------------------------
GFORTRAN_UNBUFFERED_PRECONNECTED=y
Proxy information:
*********************
[1] proxy: 10.118.91.158 (1 cores)
Exec list: ./a.out (1 processes);
[2] proxy: 10.118.91.159 (1 cores)
Exec list: ./a.out (1 processes);
==================================================================================================
Proxy launch args: /opt/mpich/bin/hydra_pmi_proxy --control-port 10.118.91.159:35625 --debug --rmk user --launcher ssh --demux poll --pgid 0 --retries 10 --usize -2 --pmi-port 0 --gpus-per-proc -2 --gpu-subdevs-per-proc -2 --proxy-id
Arguments being passed to proxy 0:
--version 4.3.0a1 --iface-ip-env-name MPIR_CVAR_CH3_INTERFACE_HOSTNAME --hostname 10.118.91.158 --global-core-map 0,1,2 --pmi-id-map 0,0 --global-process-count 2 --auto-cleanup 1 --pmi-kvsname kvs_1151_0_1450155337_dpr740 --pmi-process-mapping (vector,(0,2,1)) --global-inherited-env 15 'PKG_CONFIG_PATH=:/opt/libfabric/lib/pkgconfig:/opt/mpich/lib/pkgconfig' 'HOSTNAME=dpr740' 'HYDRA_LAUNCHER_EXTRA_ARGS=-p 2233' 'PWD=/mpich/examples' 'HOME=/root' 'FI_PROVIDER=tcp' 'LS_COLORS=rs=0:di=01;34:ln=01;36:mh=00:pi=40;33:so=01;35:do=01;35:bd=40;33;01:cd=40;33;01:or=40;31;01:mi=00:su=37;41:sg=30;43:ca=30;41:tw=30;42:ow=34;42:st=37;44:ex=01;32:*.tar=01;31:*.tgz=01;31:*.arc=01;31:*.arj=01;31:*.taz=01;31:*.lha=01;31:*.lz4=01;31:*.lzh=01;31:*.lzma=01;31:*.tlz=01;31:*.txz=01;31:*.tzo=01;31:*.t7z=01;31:*.zip=01;31:*.z=01;31:*.dz=01;31:*.gz=01;31:*.lrz=01;31:*.lz=01;31:*.lzo=01;31:*.xz=01;31:*.zst=01;31:*.tzst=01;31:*.bz2=01;31:*.bz=01;31:*.tbz=01;31:*.tbz2=01;31:*.tz=01;31:*.deb=01;31:*.rpm=01;31:*.jar=01;31:*.war=01;31:*.ear=01;31:*.sar=01;31:*.rar=01;31:*.alz=01;31:*.ace=01;31:*.zoo=01;31:*.cpio=01;31:*.7z=01;31:*.rz=01;31:*.cab=01;31:*.wim=01;31:*.swm=01;31:*.dwm=01;31:*.esd=01;31:*.jpg=01;35:*.jpeg=01;35:*.mjpg=01;35:*.mjpeg=01;35:*.gif=01;35:*.bmp=01;35:*.pbm=01;35:*.pgm=01;35:*.ppm=01;35:*.tga=01;35:*.xbm=01;35:*.xpm=01;35:*.tif=01;35:*.tiff=01;35:*.png=01;35:*.svg=01;35:*.svgz=01;35:*.mng=01;35:*.pcx=01;35:*.mov=01;35:*.mpg=01;35:*.mpeg=01;35:*.m2v=01;35:*.mkv=01;35:*.webm=01;35:*.webp=01;35:*.ogm=01;35:*.mp4=01;35:*.m4v=01;35:*.mp4v=01;35:*.vob=01;35:*.qt=01;35:*.nuv=01;35:*.wmv=01;35:*.asf=01;35:*.rm=01;35:*.rmvb=01;35:*.flc=01;35:*.avi=01;35:*.fli=01;35:*.flv=01;35:*.gl=01;35:*.dl=01;35:*.xcf=01;35:*.xwd=01;35:*.yuv=01;35:*.cgm=01;35:*.emf=01;35:*.ogv=01;35:*.ogx=01;35:*.aac=00;36:*.au=00;36:*.flac=00;36:*.m4a=00;36:*.mid=00;36:*.midi=00;36:*.mka=00;36:*.mp3=00;36:*.mpc=00;36:*.ogg=00;36:*.ra=00;36:*.wav=00;36:*.oga=00;36:*.opus=00;36:*.spx=00;36:*.xspf=00;36:' 'LESSCLOSE=/usr/bin/lesspipe %s %s' 'TERM=xterm' 'LESSOPEN=| /usr/bin/lesspipe %s' 'SHLVL=1' 'LD_LIBRARY_PATH=:/opt/libfabric/lib:/opt/fabtests/lib:/opt/mpich/lib' 'PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/opt/libfabric/bin/:/opt/fabtest/bin:/opt/mpich/bin' 'OLDPWD=/' '_=/opt/mpich/bin/mpirun' --global-user-env 0 --global-system-env 1 'GFORTRAN_UNBUFFERED_PRECONNECTED=y' --proxy-core-count 1 --exec --exec-appnum 0 --exec-proc-count 1 --exec-local-env 0 --exec-wdir /mpich/examples --exec-args 1 ./a.out
Arguments being passed to proxy 1:
--version 4.3.0a1 --iface-ip-env-name MPIR_CVAR_CH3_INTERFACE_HOSTNAME --hostname 10.118.91.159 --global-core-map 0,1,2 --pmi-id-map 0,1 --global-process-count 2 --auto-cleanup 1 --pmi-kvsname kvs_1151_0_1450155337_dpr740 --pmi-process-mapping (vector,(0,2,1)) --global-inherited-env 15 'PKG_CONFIG_PATH=:/opt/libfabric/lib/pkgconfig:/opt/mpich/lib/pkgconfig' 'HOSTNAME=dpr740' 'HYDRA_LAUNCHER_EXTRA_ARGS=-p 2233' 'PWD=/mpich/examples' 'HOME=/root' 'FI_PROVIDER=tcp' 'LS_COLORS=rs=0:di=01;34:ln=01;36:mh=00:pi=40;33:so=01;35:do=01;35:bd=40;33;01:cd=40;33;01:or=40;31;01:mi=00:su=37;41:sg=30;43:ca=30;41:tw=30;42:ow=34;42:st=37;44:ex=01;32:*.tar=01;31:*.tgz=01;31:*.arc=01;31:*.arj=01;31:*.taz=01;31:*.lha=01;31:*.lz4=01;31:*.lzh=01;31:*.lzma=01;31:*.tlz=01;31:*.txz=01;31:*.tzo=01;31:*.t7z=01;31:*.zip=01;31:*.z=01;31:*.dz=01;31:*.gz=01;31:*.lrz=01;31:*.lz=01;31:*.lzo=01;31:*.xz=01;31:*.zst=01;31:*.tzst=01;31:*.bz2=01;31:*.bz=01;31:*.tbz=01;31:*.tbz2=01;31:*.tz=01;31:*.deb=01;31:*.rpm=01;31:*.jar=01;31:*.war=01;31:*.ear=01;31:*.sar=01;31:*.rar=01;31:*.alz=01;31:*.ace=01;31:*.zoo=01;31:*.cpio=01;31:*.7z=01;31:*.rz=01;31:*.cab=01;31:*.wim=01;31:*.swm=01;31:*.dwm=01;31:*.esd=01;31:*.jpg=01;35:*.jpeg=01;35:*.mjpg=01;35:*.mjpeg=01;35:*.gif=01;35:*.bmp=01;35:*.pbm=01;35:*.pgm=01;35:*.ppm=01;35:*.tga=01;35:*.xbm=01;35:*.xpm=01;35:*.tif=01;35:*.tiff=01;35:*.png=01;35:*.svg=01;35:*.svgz=01;35:*.mng=01;35:*.pcx=01;35:*.mov=01;35:*.mpg=01;35:*.mpeg=01;35:*.m2v=01;35:*.mkv=01;35:*.webm=01;35:*.webp=01;35:*.ogm=01;35:*.mp4=01;35:*.m4v=01;35:*.mp4v=01;35:*.vob=01;35:*.qt=01;35:*.nuv=01;35:*.wmv=01;35:*.asf=01;35:*.rm=01;35:*.rmvb=01;35:*.flc=01;35:*.avi=01;35:*.fli=01;35:*.flv=01;35:*.gl=01;35:*.dl=01;35:*.xcf=01;35:*.xwd=01;35:*.yuv=01;35:*.cgm=01;35:*.emf=01;35:*.ogv=01;35:*.ogx=01;35:*.aac=00;36:*.au=00;36:*.flac=00;36:*.m4a=00;36:*.mid=00;36:*.midi=00;36:*.mka=00;36:*.mp3=00;36:*.mpc=00;36:*.ogg=00;36:*.ra=00;36:*.wav=00;36:*.oga=00;36:*.opus=00;36:*.spx=00;36:*.xspf=00;36:' 'LESSCLOSE=/usr/bin/lesspipe %s %s' 'TERM=xterm' 'LESSOPEN=| /usr/bin/lesspipe %s' 'SHLVL=1' 'LD_LIBRARY_PATH=:/opt/libfabric/lib:/opt/fabtests/lib:/opt/mpich/lib' 'PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/opt/libfabric/bin/:/opt/fabtest/bin:/opt/mpich/bin' 'OLDPWD=/' '_=/opt/mpich/bin/mpirun' --global-user-env 0 --global-system-env 1 'GFORTRAN_UNBUFFERED_PRECONNECTED=y' --proxy-core-count 1 --exec --exec-appnum 0 --exec-proc-count 1 --exec-local-env 0 --exec-wdir /mpich/examples --exec-args 1 ./a.out
[mpiexec at dpr740] Launch arguments: /usr/bin/ssh -x -p 2233 10.118.91.158 "/opt/mpich/bin/hydra_pmi_proxy" --control-port 10.118.91.159:35625 --debug --rmk user --launcher ssh --demux poll --pgid 0 --retries 10 --usize -2 --pmi-port 0 --gpus-per-proc -2 --gpu-subdevs-per-proc -2 --proxy-id 0
[mpiexec at dpr740] Launch arguments: /opt/mpich/bin/hydra_pmi_proxy --control-port 10.118.91.159:35625 --debug --rmk user --launcher ssh --demux poll --pgid 0 --retries 10 --usize -2 --pmi-port 0 --gpus-per-proc -2 --gpu-subdevs-per-proc -2 --proxy-id 1
[proxy:1 at dpr740] Sending upstream hdr.cmd = CMD_PID_LIST
[proxy:1 at dpr740] got pmi command from downstream 0-0:
cmd=init pmi_version=1 pmi_subversion=1
[proxy:1 at dpr740] Sending PMI command:
cmd=response_to_init rc=0 pmi_version=1 pmi_subversion=1
[proxy:1 at dpr740] got pmi command from downstream 0-0:
cmd=get_maxes
[proxy:1 at dpr740] Sending PMI command:
cmd=maxes rc=0 kvsname_max=256 keylen_max=64 vallen_max=1024
[proxy:1 at dpr740] got pmi command from downstream 0-0:
cmd=get_appnum
[proxy:1 at dpr740] Sending PMI command:
cmd=appnum rc=0 appnum=0
[proxy:1 at dpr740] got pmi command from downstream 0-0:
cmd=get_my_kvsname
[proxy:1 at dpr740] Sending PMI command:
cmd=my_kvsname rc=0 kvsname=kvs_1151_0_1450155337_dpr740
[proxy:1 at dpr740] got pmi command from downstream 0-0:
cmd=get kvsname=kvs_1151_0_1450155337_dpr740 key=PMI_process_mapping
[proxy:1 at dpr740] Sending PMI command:
cmd=get_result rc=0 value=(vector,(0,2,1)) found=TRUE
[proxy:1 at dpr740] got pmi command from downstream 0-0:
cmd=get kvsname=kvs_1151_0_1450155337_dpr740 key=PMI_hwloc_xmlfile
[proxy:1 at dpr740] Sending PMI command:
cmd=get_result rc=0 value=/tmp/hydra_hwloc_xmlfile_YPoAhr found=TRUE
[proxy:1 at dpr740] got pmi command from downstream 0-0:
cmd=get kvsname=kvs_1151_0_1450155337_dpr740 key=PMI_mpi_memory_alloc_kinds
[proxy:1 at dpr740] Sending upstream internal PMI command:
cmd=get kvsname=kvs_1151_0_1450155337_dpr740 key=PMI_mpi_memory_alloc_kinds
[proxy:1 at dpr740] Sending upstream hdr.cmd = CMD_PMI
[mpiexec at dpr740] [pgid: 0] got PMI command: cmd=get kvsname=kvs_1151_0_1450155337_dpr740 key=PMI_mpi_memory_alloc_kinds
[mpiexec at dpr740] Sending internal PMI command (proxy:0:1):
cmd=get_result rc=1
[proxy:1 at dpr740] we don't understand the response get_result; forwarding downstream
[proxy:1 at dpr740] got pmi command from downstream 0-0:
cmd=barrier_in
[proxy:1 at dpr740] Sending upstream internal PMI command:
cmd=barrier_in
[proxy:1 at dpr740] Sending upstream hdr.cmd = CMD_PMI
[mpiexec at dpr740] [pgid: 0] got PMI command: cmd=barrier_in
[proxy:0 at ampere-altra-2-1] Sending upstream hdr.cmd = CMD_PID_LIST
[proxy:0 at ampere-altra-2-1] got pmi command from downstream 0-0:
cmd=init pmi_version=1 pmi_subversion=1
[proxy:0 at ampere-altra-2-1] Sending PMI command:
cmd=response_to_init rc=0 pmi_version=1 pmi_subversion=1
[proxy:0 at ampere-altra-2-1] got pmi command from downstream 0-0:
cmd=get_maxes
[proxy:0 at ampere-altra-2-1] Sending PMI command:
cmd=maxes rc=0 kvsname_max=256 keylen_max=64 vallen_max=1024
[proxy:0 at ampere-altra-2-1] got pmi command from downstream 0-0:
cmd=get_appnum
[proxy:0 at ampere-altra-2-1] Sending PMI command:
cmd=appnum rc=0 appnum=0
[proxy:0 at ampere-altra-2-1] got pmi command from downstream 0-0:
cmd=get_my_kvsname
[proxy:0 at ampere-altra-2-1] Sending PMI command:
cmd=my_kvsname rc=0 kvsname=kvs_1151_0_1450155337_dpr740
[proxy:0 at ampere-altra-2-1] got pmi command from downstream 0-0:
cmd=get kvsname=kvs_1151_0_1450155337_dpr740 key=PMI_process_mapping
[proxy:0 at ampere-altra-2-1] Sending PMI command:
cmd=get_result rc=0 value=(vector,(0,2,1)) found=TRUE
[proxy:0 at ampere-altra-2-1] got pmi command from downstream 0-0:
cmd=get kvsname=kvs_1151_0_1450155337_dpr740 key=PMI_hwloc_xmlfile
[proxy:0 at ampere-altra-2-1] Sending PMI command:
cmd=get_result rc=0 value=/tmp/hydra_hwloc_xmlfile_68iqm3 found=TRUE
[mpiexec at dpr740] [pgid: 0] got PMI command: cmd=get kvsname=kvs_1151_0_1450155337_dpr740 key=PMI_mpi_memory_alloc_kinds
[mpiexec at dpr740] Sending internal PMI command (proxy:0:0):
cmd=get_result rc=1
[proxy:0 at ampere-altra-2-1] got pmi command from downstream 0-0:
cmd=get kvsname=kvs_1151_0_1450155337_dpr740 key=PMI_mpi_memory_alloc_kinds
[proxy:0 at ampere-altra-2-1] Sending upstream internal PMI command:
cmd=get kvsname=kvs_1151_0_1450155337_dpr740 key=PMI_mpi_memory_alloc_kinds
[proxy:0 at ampere-altra-2-1] Sending upstream hdr.cmd = CMD_PMI
[proxy:0 at ampere-altra-2-1] we don't understand the response get_result; forwarding downstream
[mpiexec at dpr740] [pgid: 0] got PMI command: cmd=barrier_in
[mpiexec at dpr740] Sending internal PMI command (proxy:0:0):
cmd=barrier_out
[mpiexec at dpr740] Sending internal PMI command (proxy:0:1):
cmd=barrier_out
[proxy:0 at ampere-altra-2-1] got pmi command from downstream 0-0:
cmd=barrier_in
[proxy:0 at ampere-altra-2-1] Sending upstream internal PMI command:
cmd=barrier_in
[proxy:0 at ampere-altra-2-1] [proxy:1 at dpr740] Sending PMI command:
cmd=barrier_out
Sending upstream hdr.cmd = CMD_PMI
[proxy:0 at ampere-altra-2-1] Sending PMI command:
cmd=barrier_out
[proxy:0 at ampere-altra-2-1] got pmi command from downstream 0-0:
cmd=put kvsname=kvs_1151_0_1450155337_dpr740 key=-allgather-shm-1-0 value=0200937DC0A80101[8]
[proxy:0 at ampere-altra-2-1] cached command: -allgather-shm-1-0=0200937DC0A80101[8]
[proxy:0 at ampere-altra-2-1] Sending PMI command:
cmd=put_result rc=0
[mpiexec at dpr740] [pgid: 0] got PMI command: cmd=mput -allgather-shm-1-0=0200937DC0A80101[8]
[mpiexec at dpr740] [pgid: 0] got PMI command: cmd=barrier_in
[proxy:0 at ampere-altra-2-1] got pmi command from downstream 0-0:
cmd=barrier_in
[proxy:0 at ampere-altra-2-1] flushing 1 put command(s) out
[proxy:0 at ampere-altra-2-1] forwarding command upstream:
cmd=mput -allgather-shm-1-0=0200937DC0A80101[8]
[proxy:0 at ampere-altra-2-1] Sending upstream internal PMI command:
cmd=mput -allgather-shm-1-0=0200937DC0A80101[8]
[proxy:0 at ampere-altra-2-1] Sending upstream hdr.cmd = CMD_PMI
[proxy:0 at ampere-altra-2-1] Sending upstream internal PMI command:
cmd=barrier_in
[proxy:0 at ampere-altra-2-1] Sending upstream hdr.cmd = CMD_PMI
[proxy:1 at dpr740] got pmi command from downstream 0-0:
cmd=put kvsname=kvs_1151_0_1450155337_dpr740 key=-allgather-shm-1-1 value=0A00B381[4]FE80[6]526B4BFFFEFC134208[3]
[proxy:1 at dpr740] cached command: -allgather-shm-1-1=0A00B381[4]FE80[6]526B4BFFFEFC134208[3]
[proxy:1 at dpr740] Sending PMI command:
cmd=put_result rc=0
[proxy:1 at dpr740] got pmi command from downstream 0-0:
cmd=barrier_in
[proxy:1 at dpr740] flushing 1 put command(s) out
[proxy:1 at dpr740] forwarding command upstream:
cmd=mput -allgather-shm-1-1=0A00B381[4]FE80[6]526B4BFFFEFC134208[3]
[proxy:1 at dpr740] Sending upstream internal PMI command:
cmd=mput -allgather-shm-1-1=0A00B381[4]FE80[6]526B4BFFFEFC134208[3]
[proxy:1 at dpr740] Sending upstream hdr.cmd = CMD_PMI
[proxy:1 at dpr740] Sending upstream internal PMI command:
cmd=barrier_in
[mpiexec at dpr740] [pgid: 0] got PMI command: cmd=mput -allgather-shm-1-1=0A00B381[4]FE80[6]526B4BFFFEFC134208[3]
[proxy:1 at dpr740] Sending upstream hdr.cmd = CMD_PMI
[mpiexec at dpr740] [pgid: 0] got PMI command: cmd=barrier_in
[mpiexec at dpr740] Sending internal PMI command (proxy:0:0):
cmd=keyval_cache -allgather-shm-1-0=0200937DC0A80101[8] -allgather-shm-1-1=0A00B381[4]FE80[6]526B4BFFFEFC134208[3]
[mpiexec at dpr740] Sending internal PMI command (proxy:0:1):
cmd=keyval_cache -allgather-shm-1-0=0200937DC0A80101[8] -allgather-shm-1-1=0A00B381[4]FE80[6]526B4BFFFEFC134208[3]
[mpiexec at dpr740] Sending internal PMI command (proxy:0:0):
cmd=barrier_out
[mpiexec at dpr740] Sending internal PMI command (proxy:0:1):
cmd=barrier_out
[proxy:1 at dpr740] Sending PMI command:
cmd=barrier_out
[proxy:1 at dpr740] got pmi command from downstream 0-0:
cmd=get kvsname=kvs_1151_0_1450155337_dpr740 key=-allgather-shm-1-0
[proxy:1 at dpr740] Sending PMI command:
cmd=get_result rc=0 value=0200937DC0A80101[8] found=TRUE
[proxy:1 at dpr740] got pmi command from downstream 0-0:
cmd=get kvsname=kvs_1151_0_1450155337_dpr740 key=-allgather-shm-1-1
[proxy:1 at dpr740] Sending PMI command:
cmd=get_result rc=0 value=0A00B381[4]FE80[6]526B4BFFFEFC134208[3] found=TRUE
[proxy:0 at ampere-altra-2-1] Sending PMI command:
cmd=barrier_out
[proxy:1 at dpr740] Sending upstream hdr.cmd = CMD_STDERR
[proxy:0 at ampere-altra-2-1] got pmi command from downstream 0-0:
cmd=get kvsname=kvs_1151_0_1450155337_dpr740 key=-allgather-shm-1-0
Assertion failed in file src/mpid/ch4/netmod/ofi/init_addrxchg.c at line 151: mapped_table[i] != FI_ADDR_NOTAVAIL
[proxy:0 at ampere-altra-2-1] Sending PMI command:
cmd=get_result rc=0 value=0200937DC0A80101[8] found=TRUE
[proxy:0 at ampere-altra-2-1] got pmi command from downstream 0-0:
cmd=get kvsname=kvs_1151_0_1450155337_dpr740 key=-allgather-shm-1-1
[proxy:0 at ampere-altra-2-1] Sending PMI command:
cmd=get_result rc=0 value=0A00B381[4]FE80[6]526B4BFFFEFC134208[3] found=TRUE
Assertion failed in file src/mpid/ch4/netmod/ofi/init_addrxchg.c at line 151: mapped_table[i] != FI_ADDR_NOTAVAIL
[proxy:0 at ampere-altra-2-1] Sending upstream hdr.cmd = CMD_STDERR
[proxy:1 at dpr740] Sending upstream hdr.cmd = CMD_STDERR
/opt/mpich/lib/libmpi.so.0(+0x58c005) [0x7f967920c005]
/opt/mpich/lib/libmpi.so.0(+0x491858) [0x7f9679111858]
/opt/mpich/lib/libmpi.so.0(+0x55428c) [0x7f96791d428c]
/opt/mpich/lib/libmpi.so.0(+0x53402d) [0x7f96791b402d]
/opt/mpich/lib/libmpi.so.0(+0x4dc71f) [0x7f967915c71f]
/opt/mpich/lib/libmpi.so.0(+0x4df09a) [0x7f967915f09a]
/opt/mpich/lib/libmpi.so.0(+0x3deab6) [0x7f967905eab6]
/opt/mpich/lib/libmpi.so.0(+0x3e0732) [0x7f9679060732]
/opt/mpich/lib/libmpi.so.0(+0x3dd075) [0x7f967905d075]
/opt/mpich/lib/libmpi.so.0(+0x418215) [0x7f9679098215]
[proxy:1 at dpr740] Sending upstream hdr.cmd = CMD_STDERR
/opt/mpich/lib/libmpi.so.0(+0x4188fa) [0x7f96790988fa]
/opt/mpich/lib/libmpi.so.0(MPI_Init+0x34) [0x7f9678d57594]
./a.out(+0x121a) [0x55b07f1cc21a]
/lib/x86_64-linux-gnu/libc.so.6(+0x29d90) [0x7f9678a7cd90]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80) [0x7f9678a7ce40]
./a.out(+0x1125) [0x55b07f1cc125]
Abort(1) on node 1: Internal error
/opt/mpich/lib/libmpi.so.0(+0x59a0fc) [0xffff91d0a0fc]
/opt/mpich/lib/libmpi.so.0(+0x4a6b58) [0xffff91c16b58]
/opt/mpich/lib/libmpi.so.0(+0x564740) [0xffff91cd4740]
/opt/mpich/lib/libmpi.so.0(+0x546c14) [0xffff91cb6c14]
/opt/mpich/lib/libmpi.so.0(+0x4f70cc) [0xffff91c670cc]
/opt/mpich/lib/libmpi.so.0(+0x4f9850) [0xffff91c69850]
/opt/mpich/lib/libmpi.so.0(+0x3ffd2c) [0xffff91b6fd2c]
/opt/mpich/lib/libmpi.so.0(+0x4017ec) [0xffff91b717ec]
/opt/mpich/lib/libmpi.so.0(+0x3fe384) [0xffff91b6e384]
/opt/mpich/lib/libmpi.so.0(+0x436a64) [0xffff91ba6a64]
[proxy:0 at ampere-altra-2-1] Sending upstream hdr.cmd = CMD_STDERR
[proxy:0 at ampere-altra-2-1] Sending upstream hdr.cmd = CMD_STDERR
/opt/mpich/lib/libmpi.so.0(+0x43700c) [0xffff91ba700c]
/opt/mpich/lib/libmpi.so.0(MPI_Init+0x44) [0xffff9189eeb4]
./a.out(+0x9c4) [0xaaaab5c709c4]
/lib/aarch64-linux-gnu/libc.so.6(+0x273fc) [0xffff915e73fc]
/lib/aarch64-linux-gnu/libc.so.6(__libc_start_main+0x98) [0xffff915e74cc]
./a.out(+0x8b0) [0xaaaab5c708b0]
Abort(1) on node 0: Internal error
[proxy:1 at dpr740] Sending upstream hdr.cmd = CMD_EXIT_STATUS
[proxy:0 at ampere-altra-2-1] Sending upstream hdr.cmd = CMD_EXIT_STATUS
From: Niyaz Murshed via discuss <discuss at mpich.org>
Date: Saturday, June 15, 2024 at 12:10 AM
To: discuss at mpich.org <discuss at mpich.org>
Cc: Niyaz Murshed <Niyaz.Murshed at arm.com>, nd <nd at arm.com>
Subject: [mpich-discuss] mpi hello-world error
Hello, I am trying to run the example hellow. c between 2 servers. I can run them individually and it works fine. 10. 118. 91. 158 is the machine I am running on. 10. 118. 91. 159 is the remote machine. root@ dpr740: /mpich/examples# mpirun -n 2 -hosts
ZjQcmQRYFpfptBannerStart
This Message Is From an External Sender
This message came from outside your organization.
ZjQcmQRYFpfptBannerEnd
Hello,
I am trying to run the example hellow.c between 2 servers.
I can run them individually and it works fine.
10.118.91.158 is the machine I am running on.
10.118.91.159 is the remote machine.
root at dpr740:/mpich/examples# mpirun -n 2 -hosts 10.118.91.158 ./a.out
Hello world from process 0 of 2
Hello world from process 1 of 2
root at dpr740:/mpich/examples# mpirun -n 2 -hosts 10.118.91.159 ./a.out
Hello world from process 1 of 2
Hello world from process 0 of 2
However, when I try to run them on both, I get the below error.
realloc(): invalid pointer
Is this a known issue ? Any suggestions?
root at dpr740:/mpich/examples# mpirun -verbose -n 2 -hosts 10.118.91.159,10.118.91.158 ./a.out
host: 10.118.91.159
host: 10.118.91.158
[mpiexec at dpr740] Timeout set to -1 (-1 means infinite)
==================================================================================================
mpiexec options:
----------------
Base path: /opt/mpich/bin/
Launcher: (null)
Debug level: 1
Enable X: -1
Global environment:
-------------------
PKG_CONFIG_PATH=:/opt/libfabric/lib/pkgconfig:/opt/mpich/lib/pkgconfig
HOSTNAME=dpr740
HYDRA_LAUNCHER_EXTRA_ARGS=-p 2233
PWD=/mpich/examples
HOME=/root
LS_COLORS=rs=0:di=01;34:ln=01;36:mh=00:pi=40;33:so=01;35:do=01;35:bd=40;33;01:cd=40;33;01:or=40;31;01:mi=00:su=37;41:sg=30;43:ca=30;41:tw=30;42:ow=34;42:st=37;44:ex=01;32:*.tar=01;31:*.tgz=01;31:*.arc=01;31:*.arj=01;31:*.taz=01;31:*.lha=01;31:*.lz4=01;31:*.lzh=01;31:*.lzma=01;31:*.tlz=01;31:*.txz=01;31:*.tzo=01;31:*.t7z=01;31:*.zip=01;31:*.z=01;31:*.dz=01;31:*.gz=01;31:*.lrz=01;31:*.lz=01;31:*.lzo=01;31:*.xz=01;31:*.zst=01;31:*.tzst=01;31:*.bz2=01;31:*.bz=01;31:*.tbz=01;31:*.tbz2=01;31:*.tz=01;31:*.deb=01;31:*.rpm=01;31:*.jar=01;31:*.war=01;31:*.ear=01;31:*.sar=01;31:*.rar=01;31:*.alz=01;31:*.ace=01;31:*.zoo=01;31:*.cpio=01;31:*.7z=01;31:*.rz=01;31:*.cab=01;31:*.wim=01;31:*.swm=01;31:*.dwm=01;31:*.esd=01;31:*.jpg=01;35:*.jpeg=01;35:*.mjpg=01;35:*.mjpeg=01;35:*.gif=01;35:*.bmp=01;35:*.pbm=01;35:*.pgm=01;35:*.ppm=01;35:*.tga=01;35:*.xbm=01;35:*.xpm=01;35:*.tif=01;35:*.tiff=01;35:*.png=01;35:*.svg=01;35:*.svgz=01;35:*.mng=01;35:*.pcx=01;35:*.mov=01;35:*.mpg=01;35:*.mpeg=01;35:*.m2v=01;35:*.mkv=01;35:*.webm=01;35:*.webp=01;35:*.ogm=01;35:*.mp4=01;35:*.m4v=01;35:*.mp4v=01;35:*.vob=01;35:*.qt=01;35:*.nuv=01;35:*.wmv=01;35:*.asf=01;35:*.rm=01;35:*.rmvb=01;35:*.flc=01;35:*.avi=01;35:*.fli=01;35:*.flv=01;35:*.gl=01;35:*.dl=01;35:*.xcf=01;35:*.xwd=01;35:*.yuv=01;35:*.cgm=01;35:*.emf=01;35:*.ogv=01;35:*.ogx=01;35:*.aac=00;36:*.au=00;36:*.flac=00;36:*.m4a=00;36:*.mid=00;36:*.midi=00;36:*.mka=00;36:*.mp3=00;36:*.mpc=00;36:*.ogg=00;36:*.ra=00;36:*.wav=00;36:*.oga=00;36:*.opus=00;36:*.spx=00;36:*.xspf=00;36:
LESSCLOSE=/usr/bin/lesspipe %s %s
TERM=xterm
LESSOPEN=| /usr/bin/lesspipe %s
SHLVL=1
LD_LIBRARY_PATH=:/opt/libfabric/lib:/opt/fabtests/lib:/opt/mpich/lib
PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/opt/libfabric/bin/:/opt/fabtest/bin:/opt/mpich/bin
_=/opt/mpich/bin/mpirun
OLDPWD=/
Hydra internal environment:
---------------------------
GFORTRAN_UNBUFFERED_PRECONNECTED=y
Proxy information:
*********************
[1] proxy: 10.118.91.159 (1 cores)
Exec list: ./a.out (1 processes);
[2] proxy: 10.118.91.158 (1 cores)
Exec list: ./a.out (1 processes);
==================================================================================================
Proxy launch args: /opt/mpich/bin/hydra_pmi_proxy --control-port 10.118.91.159:33909 --debug --rmk user --launcher ssh --demux poll --pgid 0 --retries 10 --usize -2 --pmi-port 0 --gpus-per-proc -2 --gpu-subdevs-per-proc -2 --proxy-id
Arguments being passed to proxy 0:
--version 4.3.0a1 --iface-ip-env-name MPIR_CVAR_CH3_INTERFACE_HOSTNAME --hostname 10.118.91.159 --global-core-map 0,1,2 --pmi-id-map 0,0 --global-process-count 2 --auto-cleanup 1 --pmi-kvsname kvs_844_0_801938186_dpr740 --pmi-process-mapping (vector,(0,2,1)) --global-inherited-env 14 'PKG_CONFIG_PATH=:/opt/libfabric/lib/pkgconfig:/opt/mpich/lib/pkgconfig' 'HOSTNAME=dpr740' 'HYDRA_LAUNCHER_EXTRA_ARGS=-p 2233' 'PWD=/mpich/examples' 'HOME=/root' 'LS_COLORS=rs=0:di=01;34:ln=01;36:mh=00:pi=40;33:so=01;35:do=01;35:bd=40;33;01:cd=40;33;01:or=40;31;01:mi=00:su=37;41:sg=30;43:ca=30;41:tw=30;42:ow=34;42:st=37;44:ex=01;32:*.tar=01;31:*.tgz=01;31:*.arc=01;31:*.arj=01;31:*.taz=01;31:*.lha=01;31:*.lz4=01;31:*.lzh=01;31:*.lzma=01;31:*.tlz=01;31:*.txz=01;31:*.tzo=01;31:*.t7z=01;31:*.zip=01;31:*.z=01;31:*.dz=01;31:*.gz=01;31:*.lrz=01;31:*.lz=01;31:*.lzo=01;31:*.xz=01;31:*.zst=01;31:*.tzst=01;31:*.bz2=01;31:*.bz=01;31:*.tbz=01;31:*.tbz2=01;31:*.tz=01;31:*.deb=01;31:*.rpm=01;31:*.jar=01;31:*.war=01;31:*.ear=01;31:*.sar=01;31:*.rar=01;31:*.alz=01;31:*.ace=01;31:*.zoo=01;31:*.cpio=01;31:*.7z=01;31:*.rz=01;31:*.cab=01;31:*.wim=01;31:*.swm=01;31:*.dwm=01;31:*.esd=01;31:*.jpg=01;35:*.jpeg=01;35:*.mjpg=01;35:*.mjpeg=01;35:*.gif=01;35:*.bmp=01;35:*.pbm=01;35:*.pgm=01;35:*.ppm=01;35:*.tga=01;35:*.xbm=01;35:*.xpm=01;35:*.tif=01;35:*.tiff=01;35:*.png=01;35:*.svg=01;35:*.svgz=01;35:*.mng=01;35:*.pcx=01;35:*.mov=01;35:*.mpg=01;35:*.mpeg=01;35:*.m2v=01;35:*.mkv=01;35:*.webm=01;35:*.webp=01;35:*.ogm=01;35:*.mp4=01;35:*.m4v=01;35:*.mp4v=01;35:*.vob=01;35:*.qt=01;35:*.nuv=01;35:*.wmv=01;35:*.asf=01;35:*.rm=01;35:*.rmvb=01;35:*.flc=01;35:*.avi=01;35:*.fli=01;35:*.flv=01;35:*.gl=01;35:*.dl=01;35:*.xcf=01;35:*.xwd=01;35:*.yuv=01;35:*.cgm=01;35:*.emf=01;35:*.ogv=01;35:*.ogx=01;35:*.aac=00;36:*.au=00;36:*.flac=00;36:*.m4a=00;36:*.mid=00;36:*.midi=00;36:*.mka=00;36:*.mp3=00;36:*.mpc=00;36:*.ogg=00;36:*.ra=00;36:*.wav=00;36:*.oga=00;36:*.opus=00;36:*.spx=00;36:*.xspf=00;36:' 'LESSCLOSE=/usr/bin/lesspipe %s %s' 'TERM=xterm' 'LESSOPEN=| /usr/bin/lesspipe %s' 'SHLVL=1' 'LD_LIBRARY_PATH=:/opt/libfabric/lib:/opt/fabtests/lib:/opt/mpich/lib' 'PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/opt/libfabric/bin/:/opt/fabtest/bin:/opt/mpich/bin' '_=/opt/mpich/bin/mpirun' 'OLDPWD=/' --global-user-env 0 --global-system-env 1 'GFORTRAN_UNBUFFERED_PRECONNECTED=y' --proxy-core-count 1 --exec --exec-appnum 0 --exec-proc-count 1 --exec-local-env 0 --exec-wdir /mpich/examples --exec-args 1 ./a.out
Arguments being passed to proxy 1:
--version 4.3.0a1 --iface-ip-env-name MPIR_CVAR_CH3_INTERFACE_HOSTNAME --hostname 10.118.91.158 --global-core-map 0,1,2 --pmi-id-map 0,1 --global-process-count 2 --auto-cleanup 1 --pmi-kvsname kvs_844_0_801938186_dpr740 --pmi-process-mapping (vector,(0,2,1)) --global-inherited-env 14 'PKG_CONFIG_PATH=:/opt/libfabric/lib/pkgconfig:/opt/mpich/lib/pkgconfig' 'HOSTNAME=dpr740' 'HYDRA_LAUNCHER_EXTRA_ARGS=-p 2233' 'PWD=/mpich/examples' 'HOME=/root' 'LS_COLORS=rs=0:di=01;34:ln=01;36:mh=00:pi=40;33:so=01;35:do=01;35:bd=40;33;01:cd=40;33;01:or=40;31;01:mi=00:su=37;41:sg=30;43:ca=30;41:tw=30;42:ow=34;42:st=37;44:ex=01;32:*.tar=01;31:*.tgz=01;31:*.arc=01;31:*.arj=01;31:*.taz=01;31:*.lha=01;31:*.lz4=01;31:*.lzh=01;31:*.lzma=01;31:*.tlz=01;31:*.txz=01;31:*.tzo=01;31:*.t7z=01;31:*.zip=01;31:*.z=01;31:*.dz=01;31:*.gz=01;31:*.lrz=01;31:*.lz=01;31:*.lzo=01;31:*.xz=01;31:*.zst=01;31:*.tzst=01;31:*.bz2=01;31:*.bz=01;31:*.tbz=01;31:*.tbz2=01;31:*.tz=01;31:*.deb=01;31:*.rpm=01;31:*.jar=01;31:*.war=01;31:*.ear=01;31:*.sar=01;31:*.rar=01;31:*.alz=01;31:*.ace=01;31:*.zoo=01;31:*.cpio=01;31:*.7z=01;31:*.rz=01;31:*.cab=01;31:*.wim=01;31:*.swm=01;31:*.dwm=01;31:*.esd=01;31:*.jpg=01;35:*.jpeg=01;35:*.mjpg=01;35:*.mjpeg=01;35:*.gif=01;35:*.bmp=01;35:*.pbm=01;35:*.pgm=01;35:*.ppm=01;35:*.tga=01;35:*.xbm=01;35:*.xpm=01;35:*.tif=01;35:*.tiff=01;35:*.png=01;35:*.svg=01;35:*.svgz=01;35:*.mng=01;35:*.pcx=01;35:*.mov=01;35:*.mpg=01;35:*.mpeg=01;35:*.m2v=01;35:*.mkv=01;35:*.webm=01;35:*.webp=01;35:*.ogm=01;35:*.mp4=01;35:*.m4v=01;35:*.mp4v=01;35:*.vob=01;35:*.qt=01;35:*.nuv=01;35:*.wmv=01;35:*.asf=01;35:*.rm=01;35:*.rmvb=01;35:*.flc=01;35:*.avi=01;35:*.fli=01;35:*.flv=01;35:*.gl=01;35:*.dl=01;35:*.xcf=01;35:*.xwd=01;35:*.yuv=01;35:*.cgm=01;35:*.emf=01;35:*.ogv=01;35:*.ogx=01;35:*.aac=00;36:*.au=00;36:*.flac=00;36:*.m4a=00;36:*.mid=00;36:*.midi=00;36:*.mka=00;36:*.mp3=00;36:*.mpc=00;36:*.ogg=00;36:*.ra=00;36:*.wav=00;36:*.oga=00;36:*.opus=00;36:*.spx=00;36:*.xspf=00;36:' 'LESSCLOSE=/usr/bin/lesspipe %s %s' 'TERM=xterm' 'LESSOPEN=| /usr/bin/lesspipe %s' 'SHLVL=1' 'LD_LIBRARY_PATH=:/opt/libfabric/lib:/opt/fabtests/lib:/opt/mpich/lib' 'PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/opt/libfabric/bin/:/opt/fabtest/bin:/opt/mpich/bin' '_=/opt/mpich/bin/mpirun' 'OLDPWD=/' --global-user-env 0 --global-system-env 1 'GFORTRAN_UNBUFFERED_PRECONNECTED=y' --proxy-core-count 1 --exec --exec-appnum 0 --exec-proc-count 1 --exec-local-env 0 --exec-wdir /mpich/examples --exec-args 1 ./a.out
[mpiexec at dpr740] Launch arguments: /opt/mpich/bin/hydra_pmi_proxy --control-port 10.118.91.159:33909 --debug --rmk user --launcher ssh --demux poll --pgid 0 --retries 10 --usize -2 --pmi-port 0 --gpus-per-proc -2 --gpu-subdevs-per-proc -2 --proxy-id 0
[mpiexec at dpr740] Launch arguments: /usr/bin/ssh -x -p 2233 10.118.91.158 "/opt/mpich/bin/hydra_pmi_proxy" --control-port 10.118.91.159:33909 --debug --rmk user --launcher ssh --demux poll --pgid 0 --retries 10 --usize -2 --pmi-port 0 --gpus-per-proc -2 --gpu-subdevs-per-proc -2 --proxy-id 1
[proxy:0 at dpr740] Sending upstream hdr.cmd = CMD_PID_LIST
[proxy:0 at dpr740] got pmi command from downstream 0-0:
cmd=init pmi_version=1 pmi_subversion=1
[proxy:0 at dpr740] Sending PMI command:
cmd=response_to_init rc=0 pmi_version=1 pmi_subversion=1
[proxy:0 at dpr740] got pmi command from downstream 0-0:
cmd=get_maxes
[proxy:0 at dpr740] Sending PMI command:
cmd=maxes rc=0 kvsname_max=256 keylen_max=64 vallen_max=1024
[proxy:0 at dpr740] got pmi command from downstream 0-0:
cmd=get_appnum
[proxy:0 at dpr740] Sending PMI command:
cmd=appnum rc=0 appnum=0
[proxy:0 at dpr740] got pmi command from downstream 0-0:
cmd=get_my_kvsname
[proxy:0 at dpr740] Sending PMI command:
cmd=my_kvsname rc=0 kvsname=kvs_844_0_801938186_dpr740
[proxy:0 at dpr740] got pmi command from downstream 0-0:
cmd=get kvsname=kvs_844_0_801938186_dpr740 key=PMI_process_mapping
[proxy:0 at dpr740] Sending PMI command:
cmd=get_result rc=0 value=(vector,(0,2,1)) found=TRUE
[proxy:0 at dpr740] got pmi command from downstream 0-0:
cmd=get kvsname=kvs_844_0_801938186_dpr740 key=PMI_hwloc_xmlfile
[proxy:0 at dpr740] Sending PMI command:
cmd=get_result rc=0 value=/tmp/hydra_hwloc_xmlfile_CeNRJN found=TRUE
[proxy:0 at dpr740] got pmi command from downstream 0-0:
cmd=get kvsname=kvs_844_0_801938186_dpr740 key=PMI_mpi_memory_alloc_kinds
[proxy:0 at dpr740] Sending upstream internal PMI command:
cmd=get kvsname=kvs_844_0_801938186_dpr740 key=PMI_mpi_memory_alloc_kinds
[proxy:0 at dpr740] Sending upstream hdr.cmd = CMD_PMI
[mpiexec at dpr740] [pgid: 0] got PMI command: cmd=get kvsname=kvs_844_0_801938186_dpr740 key=PMI_mpi_memory_alloc_kinds
[mpiexec at dpr740] Sending internal PMI command (proxy:0:0):
cmd=get_result rc=1
[proxy:0 at dpr740] we don't understand the response get_result; forwarding downstream
[proxy:1 at ampere-altra-2-1] Sending upstream hdr.cmd = CMD_PID_LIST
[proxy:1 at ampere-altra-2-1] got pmi command from downstream 0-0:
cmd=init pmi_version=1 pmi_subversion=1
[proxy:1 at ampere-altra-2-1] Sending PMI command:
cmd=response_to_init rc=0 pmi_version=1 pmi_subversion=1
[proxy:1 at ampere-altra-2-1] got pmi command from downstream 0-0:
cmd=get_maxes
[proxy:1 at ampere-altra-2-1] Sending PMI command:
cmd=maxes rc=0 kvsname_max=256 keylen_max=64 vallen_max=1024
[proxy:1 at ampere-altra-2-1] got pmi command from downstream 0-0:
cmd=get_appnum
[proxy:1 at ampere-altra-2-1] Sending PMI command:
cmd=appnum rc=0 appnum=0
[proxy:1 at ampere-altra-2-1] got pmi command from downstream 0-0:
cmd=get_my_kvsname
[proxy:1 at ampere-altra-2-1] Sending PMI command:
cmd=my_kvsname rc=0 kvsname=kvs_844_0_801938186_dpr740
[proxy:1 at ampere-altra-2-1] got pmi command from downstream 0-0:
cmd=get kvsname=kvs_844_0_801938186_dpr740 key=PMI_process_mapping
[proxy:1 at ampere-altra-2-1] Sending PMI command:
cmd=get_result rc=0 value=(vector,(0,2,1)) found=TRUE
[proxy:1 at ampere-altra-2-1] got pmi command from downstream 0-0:
cmd=get kvsname=kvs_844_0_801938186_dpr740 key=PMI_hwloc_xmlfile
[proxy:1 at ampere-altra-2-1] Sending PMI command:
cmd=get_result rc=0 value=/tmp/hydra_hwloc_xmlfile_xv8EIG found=TRUE
[proxy:0 at dpr740] got pmi command from downstream 0-0:
cmd=barrier_in
[proxy:0 at dpr740] Sending upstream internal PMI command:
cmd=barrier_in
[proxy:0 at dpr740] Sending upstream hdr.cmd = CMD_PMI
[mpiexec at dpr740] [pgid: 0] got PMI command: cmd=barrier_in
[mpiexec at dpr740] [pgid: 0] got PMI command: cmd=get kvsname=kvs_844_0_801938186_dpr740 key=PMI_mpi_memory_alloc_kinds
[mpiexec at dpr740] Sending internal PMI command (proxy:0:1):
cmd=get_result rc=1
[proxy:1 at ampere-altra-2-1] got pmi command from downstream 0-0:
cmd=get kvsname=kvs_844_0_801938186_dpr740 key=PMI_mpi_memory_alloc_kinds
[proxy:1 at ampere-altra-2-1] Sending upstream internal PMI command:
cmd=get kvsname=kvs_844_0_801938186_dpr740 key=PMI_mpi_memory_alloc_kinds
[proxy:1 at ampere-altra-2-1] Sending upstream hdr.cmd = CMD_PMI
[proxy:1 at ampere-altra-2-1] we don't understand the response get_result; forwarding downstream
[mpiexec at dpr740] [pgid: 0] got PMI command: cmd=barrier_in
[mpiexec at dpr740] Sending internal PMI command (proxy:0:0):
cmd=barrier_out
[mpiexec at dpr740] Sending internal PMI command (proxy:0:1):
cmd=barrier_out
[proxy:0 at dpr740] Sending PMI command:
cmd=barrier_out
[proxy:1 at ampere-altra-2-1] got pmi command from downstream 0-0:
cmd=barrier_in
[proxy:1 at ampere-altra-2-1] Sending upstream internal PMI command:
cmd=barrier_in
[proxy:1 at ampere-altra-2-1] Sending upstream hdr.cmd = CMD_PMI
[proxy:1 at ampere-altra-2-1] Sending PMI command:
cmd=barrier_out
[proxy:0 at dpr740] got pmi command from downstream 0-0:
cmd=put kvsname=kvs_844_0_801938186_dpr740 key=-allgather-shm-1-0 value=0A00812D[4]FE80[6]526B4BFFFEFC134208[3]
[proxy:0 at dpr740] [proxy:1 at ampere-altra-2-1] got pmi command from downstream 0-0:
cmd=put kvsname=kvs_844_0_801938186_dpr740 key=-allgather-shm-1-1 value=0200A8BFC0A80101[8]
cached command: -allgather-shm-1-0=0A00812D[4]FE80[6]526B4BFFFEFC134208[3]
[proxy:0 at dpr740] Sending PMI command:
cmd=put_result rc=0
[proxy:0 at dpr740] got pmi command from downstream 0-0:
cmd=barrier_in
[proxy:0 at dpr740] flushing 1 put command(s) out
[proxy:0 at dpr740] forwarding command upstream:
cmd=mput -allgather-shm-1-0=0A00812D[4]FE80[6]526B4BFFFEFC134208[3]
[proxy:0 at dpr740] Sending upstream internal PMI command:
cmd=mput -allgather-shm-1-0=0A00812D[4]FE80[6]526B4BFFFEFC134208[3]
[proxy:0 at dpr740] Sending upstream hdr.cmd = CMD_PMI
[proxy:1 at ampere-altra-2-1] cached command: -allgather-shm-1-1=0200A8BFC0A80101[8]
[proxy:1 at ampere-altra-2-1] Sending PMI command:
[proxy:0 at dpr740] Sending upstream internal PMI command:
cmd=barrier_in
[proxy:0 at dpr740] Sending upstream hdr.cmd = CMD_PMI
[mpiexec at dpr740] [pgid: 0] got PMI command: cmd=mput -allgather-shm-1-0=0A00812D[4]FE80[6]526B4BFFFEFC134208[3]
cmd=put_result rc=0
[mpiexec at dpr740] [pgid: 0] got PMI command: cmd=barrier_in
[mpiexec at dpr740] [pgid: 0] got PMI command: cmd=mput -allgather-shm-1-1=0200A8BFC0A80101[8]
[proxy:1 at ampere-altra-2-1] got pmi command from downstream 0-0:
cmd=barrier_in
[proxy:1 at ampere-altra-2-1] flushing 1 put command(s) out
[proxy:1 at ampere-altra-2-1] forwarding command upstream:
[mpiexec at dpr740] [pgid: 0] got PMI command: cmd=barrier_in
[mpiexec at dpr740] Sending internal PMI command (proxy:0:0):
cmd=keyval_cache -allgather-shm-1-0=0A00812D[4]FE80[6]526B4BFFFEFC134208[3] -allgather-shm-1-1=0200A8BFC0A80101[8]
[mpiexec at dpr740] Sending internal PMI command (proxy:0:1):
cmd=keyval_cache -allgather-shm-1-0=0A00812D[4]FE80[6]526B4BFFFEFC134208[3] -allgather-shm-1-1=0200A8BFC0A80101[8]
[mpiexec at dpr740] Sending internal PMI command (proxy:0:0):
cmd=barrier_out
[mpiexec at dpr740] Sending internal PMI command (proxy:0:1):
cmd=barrier_out
[proxy:0 at dpr740] Sending PMI command:
cmd=barrier_out
cmd=mput -allgather-shm-1-1=0200A8BFC0A80101[8]
[proxy:1 at ampere-altra-2-1] Sending upstream internal PMI command:
cmd=mput -allgather-shm-1-1=0200A8BFC0A80101[8]
[proxy:1 at ampere-altra-2-1] Sending upstream hdr.cmd = CMD_PMI
[proxy:1 at ampere-altra-2-1] Sending upstream internal PMI command:
cmd=barrier_in
[proxy:1 at ampere-altra-2-1] Sending upstream hdr.cmd = CMD_PMI
[proxy:0 at dpr740] got pmi command from downstream 0-0:
cmd=get kvsname=kvs_844_0_801938186_dpr740 key=-allgather-shm-1-0
[proxy:0 at dpr740] Sending PMI command:
cmd=get_result rc=0 value=0A00812D[4]FE80[6]526B4BFFFEFC134208[3] found=TRUE
[proxy:0 at dpr740] got pmi command from downstream 0-0:
cmd=get kvsname=kvs_844_0_801938186_dpr740 key=-allgather-shm-1-1
[proxy:0 at dpr740] Sending PMI command:
cmd=get_result rc=0 value=0200A8BFC0A80101[8] found=TRUE
[proxy:1 at ampere-altra-2-1] Sending PMI command:
cmd=barrier_out
[proxy:1 at ampere-altra-2-1] got pmi command from downstream 0-0:
cmd=get kvsname=kvs_844_0_801938186_dpr740 key=-allgather-shm-1-0
[proxy:1 at ampere-altra-2-1] Sending PMI command:
cmd=get_result rc=0 value=0A00812D[4]FE80[6]526B4BFFFEFC134208[3] found=TRUE
[proxy:1 at ampere-altra-2-1] got pmi command from downstream 0-0:
cmd=get kvsname=kvs_844_0_801938186_dpr740 key=-allgather-shm-1-1
[proxy:1 at ampere-altra-2-1] Sending PMI command:
cmd=get_result rc=0 value=0200A8BFC0A80101[8] found=TRUE
realloc(): invalid pointer
[proxy:1 at ampere-altra-2-1] Sending upstream hdr.cmd = CMD_STDERR
===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= PID 2404 RUNNING AT 10.118.91.158
= EXIT CODE: 134
= CLEANING UP REMAINING PROCESSES
= YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
[proxy:1 at ampere-altra-2-1] Sending upstream hdr.cmd = CMD_EXIT_STATUS
[proxy:0 at dpr740] HYD_pmcd_pmip_control_cmd_cb (proxy/pmip_cb.c:484): assert (!closed) failed
[proxy:0 at dpr740] HYDT_dmxu_poll_wait_for_event (lib/tools/demux/demux_poll.c:76): callback returned error status
[proxy:0 at dpr740] main (proxy/pmip.c:122): demux engine error waiting for event
[mpiexec at dpr740] HYDT_bscu_wait_for_completion (lib/tools/bootstrap/utils/bscu_wait.c:109): one of the processes terminated badly; aborting
[mpiexec at dpr740] HYDT_bsci_wait_for_completion (lib/tools/bootstrap/src/bsci_wait.c:21): launcher returned error waiting for completion
[mpiexec at dpr740] HYD_pmci_wait_for_completion (mpiexec/pmiserv_pmci.c:189): launcher returned error waiting for completion
[mpiexec at dpr740] main (mpiexec/mpiexec.c:260): process manager error waiting for completion
Regards,
Niyaz
IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20240617/8715df0d/attachment-0001.html>
More information about the discuss
mailing list