[mpich-discuss] Custom rank for processes

Zhou, Hui zhouh at anl.gov
Fri Jul 5 09:18:37 CDT 2024


Niyaz,

I don't know what prevents your from using shared memory. But in the meantime if you want to turn off shared memory, you could try configure mpich with --without-ch4-shmmods.

--
Hui
________________________________
From: Niyaz Murshed <Niyaz.Murshed at arm.com>
Sent: Friday, July 5, 2024 9:14 AM
To: discuss at mpich.org <discuss at mpich.org>; Zhou, Hui <zhouh at anl.gov>; Jenke, Joachim <jenke at itc.rwth-aachen.de>
Cc: nd <nd at arm.com>
Subject: Re: Custom rank for processes

Hi Hui, Any suggestion on this ? Regards, Niyaz From: Niyaz Murshed via discuss <discuss@ mpich. org> Date: Tuesday, July 2, 2024 at 10: 48 AM To: Zhou, Hui <zhouh@ anl. gov>, discuss@ mpich. org <discuss@ mpich. org>, Jenke, Joachim
ZjQcmQRYFpfptBannerStart
This Message Is From an External Sender
This message came from outside your organization.

ZjQcmQRYFpfptBannerEnd

Hi Hui,



Any suggestion on this ?



Regards,

Niyaz



From: Niyaz Murshed via discuss <discuss at mpich.org>
Date: Tuesday, July 2, 2024 at 10:48 AM
To: Zhou, Hui <zhouh at anl.gov>, discuss at mpich.org <discuss at mpich.org>, Jenke, Joachim <jenke at itc.rwth-aachen.de>
Cc: Niyaz Murshed <Niyaz.Murshed at arm.com>, nd <nd at arm.com>
Subject: Re: [mpich-discuss] Custom rank for processes

I deleted and tried again. root@ ampere-altra-2-1: /# cd /dev/shm/ root@ ampere-altra-2-1: /dev/shm# ls -l total 24 -rw------- 1 root root 256 Jul 2 15: 07 mpich_shar_tmp7LFvuU -rw------- 1 root root 256 Jul 2 02: 01 mpich_shar_tmpDXxqHv -rw-------

ZjQcmQRYFpfptBannerStart

This Message Is From an External Sender

This message came from outside your organization.



ZjQcmQRYFpfptBannerEnd



I deleted and tried again.



root at ampere-altra-2-1:/# cd /dev/shm/

root at ampere-altra-2-1:/dev/shm# ls -l

total 24

-rw------- 1 root root 256 Jul  2 15:07 mpich_shar_tmp7LFvuU

-rw------- 1 root root 256 Jul  2 02:01 mpich_shar_tmpDXxqHv

-rw------- 1 root root 256 Jul  2 15:07 mpich_shar_tmpEivSSF

-rw------- 1 root root 192 Jul  2 15:07 mpich_shar_tmpXaqJui

-rw------- 1 root root 192 Jul  2 02:01 mpich_shar_tmpg2Rank

-rw------- 1 root root 192 Jul  2 15:07 mpich_shar_tmpxwlNrm

root at ampere-altra-2-1:/dev/shm# rm -rf *





root at ampere-altra-2-1:/dev/shm# mpirun -np 5  -rankmap 1,1,0,0,0   -hosts 192.168.2.100,192.168.2.200 /mpitutorial/tutorials/mpi-hello-world/code/mpi_hello_world

Abort(678028815) on node 1: Fatal error in internal_Init: Other MPI error, error stack:

internal_Init(70).............: MPI_Init(argc=(nil), argv=(nil)) failed

MPII_Init_thread(268).........:

MPIR_init_comm_world(34)......:

MPIR_Comm_commit(800).........:

MPIR_Comm_commit_internal(585):

MPID_Comm_commit_pre_hook(151):

MPIDI_world_pre_init(638).....:

MPIDU_Init_shm_init(179)......: unable to allocate shared memory

Abort(678028815) on node 3: Fatal error in internal_Init: Other MPI error, error stack:

internal_Init(70).............: MPI_Init(argc=(nil), argv=(nil)) failed

MPII_Init_thread(268).........:

MPIR_init_comm_world(34)......:

MPIR_Comm_commit(800).........:

MPIR_Comm_commit_internal(585):

MPID_Comm_commit_pre_hook(151):

MPIDI_world_pre_init(638).....:

MPIDU_Init_shm_init(179)......: unable to allocate shared memory

^C[mpiexec at ampere-altra-2-1] Sending Ctrl-C to processes as requested

[mpiexec at ampere-altra-2-1] Press Ctrl-C again to force abort

[mpiexec at ampere-altra-2-1] HYDU_sock_write (lib/utils/sock.c:250): write error (Bad file descriptor)

[mpiexec at ampere-altra-2-1] send_hdr_downstream (mpiexec/pmiserv_cb.c:28): sock write error

[mpiexec at ampere-altra-2-1] HYD_pmcd_pmiserv_send_signal (mpiexec/pmiserv_cb.c:218): unable to write data to proxy

[mpiexec at ampere-altra-2-1] ui_cmd_cb (mpiexec/pmiserv_pmci.c:61): unable to send signal downstream

[mpiexec at ampere-altra-2-1] HYDT_dmxu_poll_wait_for_event (lib/tools/demux/demux_poll.c:76): callback returned error status

[mpiexec at ampere-altra-2-1] HYD_pmci_wait_for_completion (mpiexec/pmiserv_pmci.c:173): error waiting for event

[mpiexec at ampere-altra-2-1] main (mpiexec/mpiexec.c:260): process manager error waiting for completion





root at ampere-altra-2-1:/dev/shm# ls -l

total 8

-rw------- 1 root root 256 Jul  2 15:46 mpich_shar_tmp0eFIGw

-rw------- 1 root root 192 Jul  2 15:46 mpich_shar_tmpd4W7pC





From: Zhou, Hui <zhouh at anl.gov>
Date: Tuesday, July 2, 2024 at 10:20 AM
To: Niyaz Murshed <Niyaz.Murshed at arm.com>, discuss at mpich.org <discuss at mpich.org>, Jenke, Joachim <jenke at itc.rwth-aachen.de>
Cc: nd <nd at arm.com>
Subject: Re: Custom rank for processes

Not sure why it can't create shared memory. Do you have /dev/shm​? Is it full?

--

Hui

________________________________

From: Niyaz Murshed <Niyaz.Murshed at arm.com>
Sent: Tuesday, July 2, 2024 10:09 AM
To: Zhou, Hui <zhouh at anl.gov>; discuss at mpich.org <discuss at mpich.org>; Jenke, Joachim <jenke at itc.rwth-aachen.de>
Cc: nd <nd at arm.com>
Subject: Re: Custom rank for processes



Tried both ways but gives error. root@ ampere-altra-2-1: /# mpirun -np 5 -rankmap '(vector,1,1,0,0,0)' -hosts 192. 168. 2. 100,192. 168. 2. 200 /mpitutorial/tutorials/mpi-hello-world/code/mpi_hello_world Abort(678028815) on node 1: Fatal error in internal_Init: 

ZjQcmQRYFpfptBannerStart

This Message Is From an External Sender

This message came from outside your organization.



ZjQcmQRYFpfptBannerEnd

Tried both ways but gives error.





root at ampere-altra-2-1:/# mpirun -np 5  -rankmap '(vector,1,1,0,0,0)'   -hosts 192.168.2.100,192.168.2.200 /mpitutorial/tutorials/mpi-hello-world/code/mpi_hello_world

Abort(678028815) on node 1: Fatal error in internal_Init: Other MPI error, error stack:

internal_Init(70).............: MPI_Init(argc=(nil), argv=(nil)) failed

MPII_Init_thread(268).........:

MPIR_init_comm_world(34)......:

MPIR_Comm_commit(800).........:

MPIR_Comm_commit_internal(585):

MPID_Comm_commit_pre_hook(151):

MPIDI_world_pre_init(638).....:

MPIDU_Init_shm_init(179)......: unable to allocate shared memory

Abort(678028815) on node 3: Fatal error in internal_Init: Other MPI error, error stack:

internal_Init(70).............: MPI_Init(argc=(nil), argv=(nil)) failed

MPII_Init_thread(268).........:

MPIR_init_comm_world(34)......:

MPIR_Comm_commit(800).........:

MPIR_Comm_commit_internal(585):

MPID_Comm_commit_pre_hook(151):

MPIDI_world_pre_init(638).....:

MPIDU_Init_shm_init(179)......: unable to allocate shared memory





^C[mpiexec at ampere-altra-2-1] Sending Ctrl-C to processes as requested

[mpiexec at ampere-altra-2-1] Press Ctrl-C again to force abort

[mpiexec at ampere-altra-2-1] HYDU_sock_write (lib/utils/sock.c:250): write error (Bad file descriptor)

[mpiexec at ampere-altra-2-1] send_hdr_downstream (mpiexec/pmiserv_cb.c:28): sock write error

[mpiexec at ampere-altra-2-1] HYD_pmcd_pmiserv_send_signal (mpiexec/pmiserv_cb.c:218): unable to write data to proxy

[mpiexec at ampere-altra-2-1] ui_cmd_cb (mpiexec/pmiserv_pmci.c:61): unable to send signal downstream

[mpiexec at ampere-altra-2-1] HYDT_dmxu_poll_wait_for_event (lib/tools/demux/demux_poll.c:76): callback returned error status

[mpiexec at ampere-altra-2-1] HYD_pmci_wait_for_completion (mpiexec/pmiserv_pmci.c:173): error waiting for event

[mpiexec at ampere-altra-2-1] main (mpiexec/mpiexec.c:260): process manager error waiting for completion







root at ampere-altra-2-1:/# mpirun -np 5  -rankmap 1,1,0,0,0   -hosts 192.168.2.100,192.168.2.200 /mpitutorial/tutorials/mpi-hello-world/code/mpi_hello_world

Abort(678028815) on node 1: Fatal error in internal_Init: Other MPI error, error stack:

internal_Init(70).............: MPI_Init(argc=(nil), argv=(nil)) failed

MPII_Init_thread(268).........:

MPIR_init_comm_world(34)......:

MPIR_Comm_commit(800).........:

MPIR_Comm_commit_internal(585):

MPID_Comm_commit_pre_hook(151):

MPIDI_world_pre_init(638).....:

MPIDU_Init_shm_init(179)......: unable to allocate shared memory

Abort(678028815) on node 3: Fatal error in internal_Init: Other MPI error, error stack:

internal_Init(70).............: MPI_Init(argc=(nil), argv=(nil)) failed

MPII_Init_thread(268).........:

MPIR_init_comm_world(34)......:

MPIR_Comm_commit(800).........:

MPIR_Comm_commit_internal(585):

MPID_Comm_commit_pre_hook(151):

MPIDI_world_pre_init(638).....:

MPIDU_Init_shm_init(179)......: unable to allocate shared memory

^C[mpiexec at ampere-altra-2-1] Sending Ctrl-C to processes as requested

[mpiexec at ampere-altra-2-1] Press Ctrl-C again to force abort

[mpiexec at ampere-altra-2-1] HYDU_sock_write (lib/utils/sock.c:250): write error (Bad file descriptor)

[mpiexec at ampere-altra-2-1] send_hdr_downstream (mpiexec/pmiserv_cb.c:28): sock write error

[mpiexec at ampere-altra-2-1] HYD_pmcd_pmiserv_send_signal (mpiexec/pmiserv_cb.c:218): unable to write data to proxy

[mpiexec at ampere-altra-2-1] ui_cmd_cb (mpiexec/pmiserv_pmci.c:61): unable to send signal downstream

[mpiexec at ampere-altra-2-1] HYDT_dmxu_poll_wait_for_event (lib/tools/demux/demux_poll.c:76): callback returned error status

[mpiexec at ampere-altra-2-1] HYD_pmci_wait_for_completion (mpiexec/pmiserv_pmci.c:173): error waiting for event

[mpiexec at ampere-altra-2-1] main (mpiexec/mpiexec.c:260): process manager error waiting for completion









From: Zhou, Hui <zhouh at anl.gov>
Date: Monday, July 1, 2024 at 9:39 PM
To: Niyaz Murshed <Niyaz.Murshed at arm.com>, discuss at mpich.org <discuss at mpich.org>, Jenke, Joachim <jenke at itc.rwth-aachen.de>
Cc: nd <nd at arm.com>
Subject: Re: Custom rank for processes

Use quote. Try -rankmap '(vector,1,1,0,0,0)'​.



Actually, the vector​ part is optional, so this should also work: -rankmap 1,1,0,0,0​.





--

Hui

________________________________

From: Niyaz Murshed <Niyaz.Murshed at arm.com>
Sent: Monday, July 1, 2024 9:11 PM
To: discuss at mpich.org <discuss at mpich.org>; Zhou, Hui <zhouh at anl.gov>; Jenke, Joachim <jenke at itc.rwth-aachen.de>
Cc: nd <nd at arm.com>
Subject: Re: Custom rank for processes



I get the below error when trying. . am I trying it worng? root@ ampere-altra-2-1: /# mpirun -np 5 -hosts 192. 168. 2. 100,192. 168. 2. 200 -rankmap (vector,1,1,0,0,0) /mpitutorial/tutorials/mpi-hello-world/code/mpi_hello_world bash: syntax error near

ZjQcmQRYFpfptBannerStart

This Message Is From an External Sender

This message came from outside your organization.



ZjQcmQRYFpfptBannerEnd

I get the below error when trying.. am I trying it worng?





root at ampere-altra-2-1:/# mpirun   -np 5  -hosts 192.168.2.100,192.168.2.200   -rankmap (vector,1,1,0,0,0)  /mpitutorial/tutorials/mpi-hello-world/code/mpi_hello_world

bash: syntax error near unexpected token `('





From: Niyaz Murshed via discuss <discuss at mpich.org>
Date: Monday, July 1, 2024 at 4:42 PM
To: Zhou, Hui <zhouh at anl.gov>, discuss at mpich.org <discuss at mpich.org>, Jenke, Joachim <jenke at itc.rwth-aachen.de>
Cc: Niyaz Murshed <Niyaz.Murshed at arm.com>, nd <nd at arm.com>
Subject: Re: [mpich-discuss] Custom rank for processes

Thank you so much Hui. Really appreciate it. I now understand it. From: Zhou, Hui <zhouh@ anl. gov> Date: Monday, July 1, 2024 at 4: 37 PM To: Niyaz Murshed <Niyaz. Murshed@ arm. com>, discuss@ mpich. org <discuss@ mpich. org>, Jenke,

ZjQcmQRYFpfptBannerStart

This Message Is From an External Sender

This message came from outside your organization.



ZjQcmQRYFpfptBannerEnd

Thank you so much Hui. Really appreciate it. I now understand it.



From: Zhou, Hui <zhouh at anl.gov>
Date: Monday, July 1, 2024 at 4:37 PM
To: Niyaz Murshed <Niyaz.Murshed at arm.com>, discuss at mpich.org <discuss at mpich.org>, Jenke, Joachim <jenke at itc.rwth-aachen.de>
Cc: nd <nd at arm.com>
Subject: Re: Custom rank for processes

Yes, you can use rankmap to simply list out the node assignment for each rank. "-rankmap (vector,1,1,0,0,0)" is a list of 5 node ids, one for each rank. So rank 0 gets node 1, rank 1 gets node 1, rank 2 gets node 0, and so on.



The -hosts options is convenient if you have somewhat uniform assignment. If you want arbitrary assignment, just use -rankmap option.



--

Hui

________________________________

From: Niyaz Murshed <Niyaz.Murshed at arm.com>
Sent: Monday, July 1, 2024 4:30 PM
To: Zhou, Hui <zhouh at anl.gov>; discuss at mpich.org <discuss at mpich.org>; Jenke, Joachim <jenke at itc.rwth-aachen.de>
Cc: nd <nd at arm.com>
Subject: Re: Custom rank for processes



Sorry to bother you again with silly questions. How do I read the below : -rankmap (vector,1,1,0,0,0) I have 5 processes. 1 1 => means node1 will get first 2 processes ? -hosts 192. 168. 2. 100: 2,192. 168. 2. 200: 3 : this will give 2 process in

ZjQcmQRYFpfptBannerStart

This Message Is From an External Sender

This message came from outside your organization.



ZjQcmQRYFpfptBannerEnd

Sorry to bother you again with silly questions.



How do I read the below :

-rankmap (vector,1,1,0,0,0)



I have 5 processes. 1 1 => means node1 will get first 2 processes ?



-hosts 192.168.2.100:2,192.168.2.200:3 : this will give 2 process in 100 and 3 in 200 .. Still won’t be able to specify which ranks goes to which node. Rank0 will be on node 100 .. what If I want Rank0 on node 200?





From: Zhou, Hui <zhouh at anl.gov>
Date: Monday, July 1, 2024 at 4:09 PM
To: Niyaz Murshed <Niyaz.Murshed at arm.com>, discuss at mpich.org <discuss at mpich.org>, Jenke, Joachim <jenke at itc.rwth-aachen.de>
Cc: nd <nd at arm.com>
Subject: Re: Custom rank for processes

> root at ampere-altra-2-1:/# mpirun -n 5   -bind-to user:10,11,12,13 -hosts 192.168.2.200,192.168.2.100 /mpitutorial/tutorials/mpi-hello-world/code/mpi_hello_world

Hello world from processor ampere-altra-2-1, rank 1 out of 5 processors

Hello world from processor ampere-altra-2-1, rank 3 out of 5 processors

Hello world from processor dpr740, rank 0 out of 5 processors

Hello world from processor dpr740, rank 4 out of 5 processors

Hello world from processor dpr740, rank 2 out of 5 processors



> -bind-to user:10,11,12,13

> This would mean on host 192.168.2.100

> P0=>10 , P2=>11

> This would mean on host 192.168.2.200

> P0=>10 , P2=>11, P3=12

> Is this correct understanding ?  Is it also possible to say which rank process will be pinned to which core ?

Yes, that is correct. The ranks are assigned to hosts as shown in the hello world output.





> About the rankmap:, trying to understand if I can select where a particular rank would be from > list of hosts. Currently, the first host in the list always get rank0.

> Can I specify the below ranks?

> mpirun -n 5   -bind-to user:10,11,12,13 -hosts 192.168.2.200,192.168.2.100 /mpitutorial/tutorials/mpi-hello-world/code/mpi_hello_world

>

> Hello world from processor ampere-altra-2-1, rank 1 out of 5 processors => rank0

> Hello world from processor ampere-altra-2-1, rank 3 out of 5 processors. => rank1

> Hello world from processor dpr740, rank 0 out of 5 processors           =>rank2

> Hello world from processor dpr740, rank 4 out of 5 processors           =>rank3

> Hello world from processor dpr740, rank 2 out of 5 processors           =>rank4

Yes. You can use "-rankmap (vector,1,1,0,0,0)". Alternatively, you can use "-hosts 192.168.2.100:2,192.168.2.200:3", the colon syntax specifies how many processes you want to assign to each host.



--

Hui
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20240705/d0eea595/attachment-0001.html>


More information about the discuss mailing list