[mpich-discuss] Custom rank for processes
Zhou, Hui
zhouh at anl.gov
Fri Jul 5 09:18:37 CDT 2024
Niyaz,
I don't know what prevents your from using shared memory. But in the meantime if you want to turn off shared memory, you could try configure mpich with --without-ch4-shmmods.
--
Hui
________________________________
From: Niyaz Murshed <Niyaz.Murshed at arm.com>
Sent: Friday, July 5, 2024 9:14 AM
To: discuss at mpich.org <discuss at mpich.org>; Zhou, Hui <zhouh at anl.gov>; Jenke, Joachim <jenke at itc.rwth-aachen.de>
Cc: nd <nd at arm.com>
Subject: Re: Custom rank for processes
Hi Hui, Any suggestion on this ? Regards, Niyaz From: Niyaz Murshed via discuss <discuss@ mpich. org> Date: Tuesday, July 2, 2024 at 10: 48 AM To: Zhou, Hui <zhouh@ anl. gov>, discuss@ mpich. org <discuss@ mpich. org>, Jenke, Joachim
ZjQcmQRYFpfptBannerStart
This Message Is From an External Sender
This message came from outside your organization.
ZjQcmQRYFpfptBannerEnd
Hi Hui,
Any suggestion on this ?
Regards,
Niyaz
From: Niyaz Murshed via discuss <discuss at mpich.org>
Date: Tuesday, July 2, 2024 at 10:48 AM
To: Zhou, Hui <zhouh at anl.gov>, discuss at mpich.org <discuss at mpich.org>, Jenke, Joachim <jenke at itc.rwth-aachen.de>
Cc: Niyaz Murshed <Niyaz.Murshed at arm.com>, nd <nd at arm.com>
Subject: Re: [mpich-discuss] Custom rank for processes
I deleted and tried again. root@ ampere-altra-2-1: /# cd /dev/shm/ root@ ampere-altra-2-1: /dev/shm# ls -l total 24 -rw------- 1 root root 256 Jul 2 15: 07 mpich_shar_tmp7LFvuU -rw------- 1 root root 256 Jul 2 02: 01 mpich_shar_tmpDXxqHv -rw-------
ZjQcmQRYFpfptBannerStart
This Message Is From an External Sender
This message came from outside your organization.
ZjQcmQRYFpfptBannerEnd
I deleted and tried again.
root at ampere-altra-2-1:/# cd /dev/shm/
root at ampere-altra-2-1:/dev/shm# ls -l
total 24
-rw------- 1 root root 256 Jul 2 15:07 mpich_shar_tmp7LFvuU
-rw------- 1 root root 256 Jul 2 02:01 mpich_shar_tmpDXxqHv
-rw------- 1 root root 256 Jul 2 15:07 mpich_shar_tmpEivSSF
-rw------- 1 root root 192 Jul 2 15:07 mpich_shar_tmpXaqJui
-rw------- 1 root root 192 Jul 2 02:01 mpich_shar_tmpg2Rank
-rw------- 1 root root 192 Jul 2 15:07 mpich_shar_tmpxwlNrm
root at ampere-altra-2-1:/dev/shm# rm -rf *
root at ampere-altra-2-1:/dev/shm# mpirun -np 5 -rankmap 1,1,0,0,0 -hosts 192.168.2.100,192.168.2.200 /mpitutorial/tutorials/mpi-hello-world/code/mpi_hello_world
Abort(678028815) on node 1: Fatal error in internal_Init: Other MPI error, error stack:
internal_Init(70).............: MPI_Init(argc=(nil), argv=(nil)) failed
MPII_Init_thread(268).........:
MPIR_init_comm_world(34)......:
MPIR_Comm_commit(800).........:
MPIR_Comm_commit_internal(585):
MPID_Comm_commit_pre_hook(151):
MPIDI_world_pre_init(638).....:
MPIDU_Init_shm_init(179)......: unable to allocate shared memory
Abort(678028815) on node 3: Fatal error in internal_Init: Other MPI error, error stack:
internal_Init(70).............: MPI_Init(argc=(nil), argv=(nil)) failed
MPII_Init_thread(268).........:
MPIR_init_comm_world(34)......:
MPIR_Comm_commit(800).........:
MPIR_Comm_commit_internal(585):
MPID_Comm_commit_pre_hook(151):
MPIDI_world_pre_init(638).....:
MPIDU_Init_shm_init(179)......: unable to allocate shared memory
^C[mpiexec at ampere-altra-2-1] Sending Ctrl-C to processes as requested
[mpiexec at ampere-altra-2-1] Press Ctrl-C again to force abort
[mpiexec at ampere-altra-2-1] HYDU_sock_write (lib/utils/sock.c:250): write error (Bad file descriptor)
[mpiexec at ampere-altra-2-1] send_hdr_downstream (mpiexec/pmiserv_cb.c:28): sock write error
[mpiexec at ampere-altra-2-1] HYD_pmcd_pmiserv_send_signal (mpiexec/pmiserv_cb.c:218): unable to write data to proxy
[mpiexec at ampere-altra-2-1] ui_cmd_cb (mpiexec/pmiserv_pmci.c:61): unable to send signal downstream
[mpiexec at ampere-altra-2-1] HYDT_dmxu_poll_wait_for_event (lib/tools/demux/demux_poll.c:76): callback returned error status
[mpiexec at ampere-altra-2-1] HYD_pmci_wait_for_completion (mpiexec/pmiserv_pmci.c:173): error waiting for event
[mpiexec at ampere-altra-2-1] main (mpiexec/mpiexec.c:260): process manager error waiting for completion
root at ampere-altra-2-1:/dev/shm# ls -l
total 8
-rw------- 1 root root 256 Jul 2 15:46 mpich_shar_tmp0eFIGw
-rw------- 1 root root 192 Jul 2 15:46 mpich_shar_tmpd4W7pC
From: Zhou, Hui <zhouh at anl.gov>
Date: Tuesday, July 2, 2024 at 10:20 AM
To: Niyaz Murshed <Niyaz.Murshed at arm.com>, discuss at mpich.org <discuss at mpich.org>, Jenke, Joachim <jenke at itc.rwth-aachen.de>
Cc: nd <nd at arm.com>
Subject: Re: Custom rank for processes
Not sure why it can't create shared memory. Do you have /dev/shm? Is it full?
--
Hui
________________________________
From: Niyaz Murshed <Niyaz.Murshed at arm.com>
Sent: Tuesday, July 2, 2024 10:09 AM
To: Zhou, Hui <zhouh at anl.gov>; discuss at mpich.org <discuss at mpich.org>; Jenke, Joachim <jenke at itc.rwth-aachen.de>
Cc: nd <nd at arm.com>
Subject: Re: Custom rank for processes
Tried both ways but gives error. root@ ampere-altra-2-1: /# mpirun -np 5 -rankmap '(vector,1,1,0,0,0)' -hosts 192. 168. 2. 100,192. 168. 2. 200 /mpitutorial/tutorials/mpi-hello-world/code/mpi_hello_world Abort(678028815) on node 1: Fatal error in internal_Init:
ZjQcmQRYFpfptBannerStart
This Message Is From an External Sender
This message came from outside your organization.
ZjQcmQRYFpfptBannerEnd
Tried both ways but gives error.
root at ampere-altra-2-1:/# mpirun -np 5 -rankmap '(vector,1,1,0,0,0)' -hosts 192.168.2.100,192.168.2.200 /mpitutorial/tutorials/mpi-hello-world/code/mpi_hello_world
Abort(678028815) on node 1: Fatal error in internal_Init: Other MPI error, error stack:
internal_Init(70).............: MPI_Init(argc=(nil), argv=(nil)) failed
MPII_Init_thread(268).........:
MPIR_init_comm_world(34)......:
MPIR_Comm_commit(800).........:
MPIR_Comm_commit_internal(585):
MPID_Comm_commit_pre_hook(151):
MPIDI_world_pre_init(638).....:
MPIDU_Init_shm_init(179)......: unable to allocate shared memory
Abort(678028815) on node 3: Fatal error in internal_Init: Other MPI error, error stack:
internal_Init(70).............: MPI_Init(argc=(nil), argv=(nil)) failed
MPII_Init_thread(268).........:
MPIR_init_comm_world(34)......:
MPIR_Comm_commit(800).........:
MPIR_Comm_commit_internal(585):
MPID_Comm_commit_pre_hook(151):
MPIDI_world_pre_init(638).....:
MPIDU_Init_shm_init(179)......: unable to allocate shared memory
^C[mpiexec at ampere-altra-2-1] Sending Ctrl-C to processes as requested
[mpiexec at ampere-altra-2-1] Press Ctrl-C again to force abort
[mpiexec at ampere-altra-2-1] HYDU_sock_write (lib/utils/sock.c:250): write error (Bad file descriptor)
[mpiexec at ampere-altra-2-1] send_hdr_downstream (mpiexec/pmiserv_cb.c:28): sock write error
[mpiexec at ampere-altra-2-1] HYD_pmcd_pmiserv_send_signal (mpiexec/pmiserv_cb.c:218): unable to write data to proxy
[mpiexec at ampere-altra-2-1] ui_cmd_cb (mpiexec/pmiserv_pmci.c:61): unable to send signal downstream
[mpiexec at ampere-altra-2-1] HYDT_dmxu_poll_wait_for_event (lib/tools/demux/demux_poll.c:76): callback returned error status
[mpiexec at ampere-altra-2-1] HYD_pmci_wait_for_completion (mpiexec/pmiserv_pmci.c:173): error waiting for event
[mpiexec at ampere-altra-2-1] main (mpiexec/mpiexec.c:260): process manager error waiting for completion
root at ampere-altra-2-1:/# mpirun -np 5 -rankmap 1,1,0,0,0 -hosts 192.168.2.100,192.168.2.200 /mpitutorial/tutorials/mpi-hello-world/code/mpi_hello_world
Abort(678028815) on node 1: Fatal error in internal_Init: Other MPI error, error stack:
internal_Init(70).............: MPI_Init(argc=(nil), argv=(nil)) failed
MPII_Init_thread(268).........:
MPIR_init_comm_world(34)......:
MPIR_Comm_commit(800).........:
MPIR_Comm_commit_internal(585):
MPID_Comm_commit_pre_hook(151):
MPIDI_world_pre_init(638).....:
MPIDU_Init_shm_init(179)......: unable to allocate shared memory
Abort(678028815) on node 3: Fatal error in internal_Init: Other MPI error, error stack:
internal_Init(70).............: MPI_Init(argc=(nil), argv=(nil)) failed
MPII_Init_thread(268).........:
MPIR_init_comm_world(34)......:
MPIR_Comm_commit(800).........:
MPIR_Comm_commit_internal(585):
MPID_Comm_commit_pre_hook(151):
MPIDI_world_pre_init(638).....:
MPIDU_Init_shm_init(179)......: unable to allocate shared memory
^C[mpiexec at ampere-altra-2-1] Sending Ctrl-C to processes as requested
[mpiexec at ampere-altra-2-1] Press Ctrl-C again to force abort
[mpiexec at ampere-altra-2-1] HYDU_sock_write (lib/utils/sock.c:250): write error (Bad file descriptor)
[mpiexec at ampere-altra-2-1] send_hdr_downstream (mpiexec/pmiserv_cb.c:28): sock write error
[mpiexec at ampere-altra-2-1] HYD_pmcd_pmiserv_send_signal (mpiexec/pmiserv_cb.c:218): unable to write data to proxy
[mpiexec at ampere-altra-2-1] ui_cmd_cb (mpiexec/pmiserv_pmci.c:61): unable to send signal downstream
[mpiexec at ampere-altra-2-1] HYDT_dmxu_poll_wait_for_event (lib/tools/demux/demux_poll.c:76): callback returned error status
[mpiexec at ampere-altra-2-1] HYD_pmci_wait_for_completion (mpiexec/pmiserv_pmci.c:173): error waiting for event
[mpiexec at ampere-altra-2-1] main (mpiexec/mpiexec.c:260): process manager error waiting for completion
From: Zhou, Hui <zhouh at anl.gov>
Date: Monday, July 1, 2024 at 9:39 PM
To: Niyaz Murshed <Niyaz.Murshed at arm.com>, discuss at mpich.org <discuss at mpich.org>, Jenke, Joachim <jenke at itc.rwth-aachen.de>
Cc: nd <nd at arm.com>
Subject: Re: Custom rank for processes
Use quote. Try -rankmap '(vector,1,1,0,0,0)'.
Actually, the vector part is optional, so this should also work: -rankmap 1,1,0,0,0.
--
Hui
________________________________
From: Niyaz Murshed <Niyaz.Murshed at arm.com>
Sent: Monday, July 1, 2024 9:11 PM
To: discuss at mpich.org <discuss at mpich.org>; Zhou, Hui <zhouh at anl.gov>; Jenke, Joachim <jenke at itc.rwth-aachen.de>
Cc: nd <nd at arm.com>
Subject: Re: Custom rank for processes
I get the below error when trying. . am I trying it worng? root@ ampere-altra-2-1: /# mpirun -np 5 -hosts 192. 168. 2. 100,192. 168. 2. 200 -rankmap (vector,1,1,0,0,0) /mpitutorial/tutorials/mpi-hello-world/code/mpi_hello_world bash: syntax error near
ZjQcmQRYFpfptBannerStart
This Message Is From an External Sender
This message came from outside your organization.
ZjQcmQRYFpfptBannerEnd
I get the below error when trying.. am I trying it worng?
root at ampere-altra-2-1:/# mpirun -np 5 -hosts 192.168.2.100,192.168.2.200 -rankmap (vector,1,1,0,0,0) /mpitutorial/tutorials/mpi-hello-world/code/mpi_hello_world
bash: syntax error near unexpected token `('
From: Niyaz Murshed via discuss <discuss at mpich.org>
Date: Monday, July 1, 2024 at 4:42 PM
To: Zhou, Hui <zhouh at anl.gov>, discuss at mpich.org <discuss at mpich.org>, Jenke, Joachim <jenke at itc.rwth-aachen.de>
Cc: Niyaz Murshed <Niyaz.Murshed at arm.com>, nd <nd at arm.com>
Subject: Re: [mpich-discuss] Custom rank for processes
Thank you so much Hui. Really appreciate it. I now understand it. From: Zhou, Hui <zhouh@ anl. gov> Date: Monday, July 1, 2024 at 4: 37 PM To: Niyaz Murshed <Niyaz. Murshed@ arm. com>, discuss@ mpich. org <discuss@ mpich. org>, Jenke,
ZjQcmQRYFpfptBannerStart
This Message Is From an External Sender
This message came from outside your organization.
ZjQcmQRYFpfptBannerEnd
Thank you so much Hui. Really appreciate it. I now understand it.
From: Zhou, Hui <zhouh at anl.gov>
Date: Monday, July 1, 2024 at 4:37 PM
To: Niyaz Murshed <Niyaz.Murshed at arm.com>, discuss at mpich.org <discuss at mpich.org>, Jenke, Joachim <jenke at itc.rwth-aachen.de>
Cc: nd <nd at arm.com>
Subject: Re: Custom rank for processes
Yes, you can use rankmap to simply list out the node assignment for each rank. "-rankmap (vector,1,1,0,0,0)" is a list of 5 node ids, one for each rank. So rank 0 gets node 1, rank 1 gets node 1, rank 2 gets node 0, and so on.
The -hosts options is convenient if you have somewhat uniform assignment. If you want arbitrary assignment, just use -rankmap option.
--
Hui
________________________________
From: Niyaz Murshed <Niyaz.Murshed at arm.com>
Sent: Monday, July 1, 2024 4:30 PM
To: Zhou, Hui <zhouh at anl.gov>; discuss at mpich.org <discuss at mpich.org>; Jenke, Joachim <jenke at itc.rwth-aachen.de>
Cc: nd <nd at arm.com>
Subject: Re: Custom rank for processes
Sorry to bother you again with silly questions. How do I read the below : -rankmap (vector,1,1,0,0,0) I have 5 processes. 1 1 => means node1 will get first 2 processes ? -hosts 192. 168. 2. 100: 2,192. 168. 2. 200: 3 : this will give 2 process in
ZjQcmQRYFpfptBannerStart
This Message Is From an External Sender
This message came from outside your organization.
ZjQcmQRYFpfptBannerEnd
Sorry to bother you again with silly questions.
How do I read the below :
-rankmap (vector,1,1,0,0,0)
I have 5 processes. 1 1 => means node1 will get first 2 processes ?
-hosts 192.168.2.100:2,192.168.2.200:3 : this will give 2 process in 100 and 3 in 200 .. Still won’t be able to specify which ranks goes to which node. Rank0 will be on node 100 .. what If I want Rank0 on node 200?
From: Zhou, Hui <zhouh at anl.gov>
Date: Monday, July 1, 2024 at 4:09 PM
To: Niyaz Murshed <Niyaz.Murshed at arm.com>, discuss at mpich.org <discuss at mpich.org>, Jenke, Joachim <jenke at itc.rwth-aachen.de>
Cc: nd <nd at arm.com>
Subject: Re: Custom rank for processes
> root at ampere-altra-2-1:/# mpirun -n 5 -bind-to user:10,11,12,13 -hosts 192.168.2.200,192.168.2.100 /mpitutorial/tutorials/mpi-hello-world/code/mpi_hello_world
Hello world from processor ampere-altra-2-1, rank 1 out of 5 processors
Hello world from processor ampere-altra-2-1, rank 3 out of 5 processors
Hello world from processor dpr740, rank 0 out of 5 processors
Hello world from processor dpr740, rank 4 out of 5 processors
Hello world from processor dpr740, rank 2 out of 5 processors
> -bind-to user:10,11,12,13
> This would mean on host 192.168.2.100
> P0=>10 , P2=>11
> This would mean on host 192.168.2.200
> P0=>10 , P2=>11, P3=12
> Is this correct understanding ? Is it also possible to say which rank process will be pinned to which core ?
Yes, that is correct. The ranks are assigned to hosts as shown in the hello world output.
> About the rankmap:, trying to understand if I can select where a particular rank would be from > list of hosts. Currently, the first host in the list always get rank0.
> Can I specify the below ranks?
> mpirun -n 5 -bind-to user:10,11,12,13 -hosts 192.168.2.200,192.168.2.100 /mpitutorial/tutorials/mpi-hello-world/code/mpi_hello_world
>
> Hello world from processor ampere-altra-2-1, rank 1 out of 5 processors => rank0
> Hello world from processor ampere-altra-2-1, rank 3 out of 5 processors. => rank1
> Hello world from processor dpr740, rank 0 out of 5 processors =>rank2
> Hello world from processor dpr740, rank 4 out of 5 processors =>rank3
> Hello world from processor dpr740, rank 2 out of 5 processors =>rank4
Yes. You can use "-rankmap (vector,1,1,0,0,0)". Alternatively, you can use "-hosts 192.168.2.100:2,192.168.2.200:3", the colon syntax specifies how many processes you want to assign to each host.
--
Hui
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20240705/d0eea595/attachment-0001.html>
More information about the discuss
mailing list