[mpich-discuss] Maximum number of communicators

Zhou, Hui zhouh at anl.gov
Wed Oct 9 14:52:33 CDT 2024


Hi Bruce,

Could you create a github issue for this?

Hui
________________________________
From: Palmer, Bruce J <Bruce.Palmer at pnnl.gov>
Sent: Wednesday, October 9, 2024 1:37 PM
To: discuss at mpich.org <discuss at mpich.org>; Zhou, Hui <zhouh at anl.gov>
Subject: Re: Maximum number of communicators

I may have spoken too soon. This fixed the issue on one SMP node using 24 processes. When I increase the number of nodes to 4 with 96 processes, I can’t even get past MPI_Init. I’m seeing the following in standard out ===================================================================================
ZjQcmQRYFpfptBannerStart
This Message Is From an External Sender
This message came from outside your organization.

ZjQcmQRYFpfptBannerEnd

I may have spoken too soon. This fixed the issue on one SMP node using 24 processes. When I increase the number of nodes to 4 with 96 processes, I can’t even get past MPI_Init. I’m seeing the following in standard out



===================================================================================

=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES

=   PID 483326 RUNNING AT j006

=   EXIT CODE: 9

=   CLEANING UP REMAINING PROCESSES

=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES

===================================================================================

YOUR APPLICATION TERMINATED WITH THE EXIT STRING: Killed (signal 9)

This typically refers to a problem with your application.

Please see the FAQ page for debugging suggestions



I get the following in standard error:



Abort(613541135) on node 65: Fatal error in internal_Init: Other MPI error, error stack:

internal_Init(49162).............: MPI_Init(argc=0x7ffff3c5646c, argv=0x7ffff3c56460) failed

MPII_Init_thread(265)............:

MPIR_init_comm_world(34).........:

MPIR_Comm_commit(823)............:

MPID_Comm_commit_post_hook(222)..:

MPIDI_world_post_init(660).......:

MPIDI_OFI_init_vcis(842).........:

check_num_nics(891)..............:

MPIR_Allreduce_allcomm_auto(4726):

MPIC_Sendrecv(302)...............:

MPID_Isend(63)...................:

MPIDI_isend(35)..................:

MPIDI_OFI_send_fallback(549).....: OFI call tsendv failed (ofi_send.h:549:MPIDI_OFI_send_fallback:No such file or dir

ectory)



Bruce



From: Palmer, Bruce J via discuss <discuss at mpich.org>
Date: Monday, October 7, 2024 at 1:14 PM
To: Zhou, Hui <zhouh at anl.gov>, discuss at mpich.org <discuss at mpich.org>
Cc: Palmer, Bruce J <Bruce.Palmer at pnnl.gov>
Subject: Re: [mpich-discuss] Maximum number of communicators

Check twice before you click! This email originated from outside PNNL.



That seemed to fix the issue. Thanks! Bruce From: Zhou, Hui <zhouh@ anl. gov> Date: Thursday, October 3, 2024 at 11: 01 AM To: discuss@ mpich. org <discuss@ mpich. org> Cc: Palmer, Bruce J <Bruce. Palmer@ pnnl. gov> Subject: Re: Maximum

ZjQcmQRYFpfptBannerStart

This Message Is From an External Sender

This message came from outside your organization.



ZjQcmQRYFpfptBannerEnd

That seemed to fix the issue.

Thanks!

Bruce



From: Zhou, Hui <zhouh at anl.gov>
Date: Thursday, October 3, 2024 at 11:01 AM
To: discuss at mpich.org <discuss at mpich.org>
Cc: Palmer, Bruce J <Bruce.Palmer at pnnl.gov>
Subject: Re: Maximum number of communicators

Hi Bruce,



Try configure mpich using --with-device=ch4:ofi --enable-extended-context-bits.



Hui

________________________________

From: Palmer, Bruce J via discuss <discuss at mpich.org>
Sent: Thursday, October 3, 2024 11:44 AM
To: discuss at mpich.org <discuss at mpich.org>
Cc: Palmer, Bruce J <Bruce.Palmer at pnnl.gov>
Subject: [mpich-discuss] Maximum number of communicators



Hi, I’m looking at using MPI RMA to support sparse data structures in Global Arrays. I’ve got an application that uses a large number of sparse arrays and it is failing when the number of sparse arrays reaches about 500. Each sparse array is

ZjQcmQRYFpfptBannerStart

This Message Is From an External Sender

This message came from outside your organization.



ZjQcmQRYFpfptBannerEnd

Hi,



I’m looking at using MPI RMA to support sparse data structures in Global Arrays. I’ve got an application that uses a large number of sparse arrays and it is failing when the number of sparse arrays reaches about 500. Each sparse array is built on top of 4 conventional global arrays and each global array uses one MPI Window. Each Window appears to be creating its own communicator and I’m hitting an internal limit at 2048 communicators. Is there a way to increase the number of communicators?



Bruce Palmer
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20241009/38a6259f/attachment-0001.html>


More information about the discuss mailing list