From solomon2 at illinois.edu Thu Jun 8 15:36:56 2023 From: solomon2 at illinois.edu (Solomonik, Edgar) Date: Thu, 8 Jun 2023 20:36:56 +0000 Subject: [mpich-discuss] MPI Reduce with MPI_IN_PLACE fails with non-0 root rank for message sizes over 256 with MPI version 4 and after Message-ID: Hello, Our library's autobuild (CTF, which uses MPI extensively and in relatively sophisticated ways) started failing on multiple architectures after github workflows moved to later OS versions (and so later MPI versions). I believe I have narrowed the issue to an MPI bug associated with very basic usage of MPI Reduce. The following test code runs into a segmentation fault inside MPI when running with 2 MPI processes with the latest Ubuntu MPI build and MPI 4.0. It works for smaller values of message size (n) or if the root is rank 0. The usage of MPI_IN_PLACE adheres with the MPI standard. Best, Edgar Solomonik #include #include int main(int argc, char ** argv){ int64_t n = 257; MPI_Init(&argc, &argv); int rank; MPI_Comm_rank(MPI_COMM_WORLD, &rank); double * A = (double*)malloc(sizeof(double)*n); for (int i=0; i From raffenet at anl.gov Thu Jun 8 15:44:44 2023 From: raffenet at anl.gov (Raffenetti, Ken) Date: Thu, 8 Jun 2023 20:44:44 +0000 Subject: [mpich-discuss] MPI Reduce with MPI_IN_PLACE fails with non-0 root rank for message sizes over 256 with MPI version 4 and after In-Reply-To: References: Message-ID: Hi, I believe this bug was recently fixed in https://github.com/pmodels/mpich/pull/6543. The fix is part of the MPICH 4.1.2 release just posted to our website and Github. I confirmed that your test program works as expected now vs. an older 4.1 release. Ken From: "Solomonik, Edgar via discuss" Reply-To: "discuss at mpich.org" Date: Thursday, June 8, 2023 at 3:37 PM To: "discuss at mpich.org" Cc: "Solomonik, Edgar" Subject: [mpich-discuss] MPI Reduce with MPI_IN_PLACE fails with non-0 root rank for message sizes over 256 with MPI version 4 and after Hello, Our library's autobuild (CTF, which uses MPI extensively and in relatively sophisticated ways) started failing on multiple architectures after github workflows moved to later OS versions (and so later MPI versions). I believe I have narrowed the issue to an MPI bug associated with very basic usage of MPI Reduce. The following test code runs into a segmentation fault inside MPI when running with 2 MPI processes with the latest Ubuntu MPI build and MPI 4.0. It works for smaller values of message size (n) or if the root is rank 0. The usage of MPI_IN_PLACE adheres with the MPI standard. Best, Edgar Solomonik #include #include int main(int argc, char ** argv){ int64_t n = 257; MPI_Init(&argc, &argv); int rank; MPI_Comm_rank(MPI_COMM_WORLD, &rank); double * A = (double*)malloc(sizeof(double)*n); for (int i=0; i From raffenet at anl.gov Thu Jun 8 15:50:34 2023 From: raffenet at anl.gov (Raffenetti, Ken) Date: Thu, 8 Jun 2023 20:50:34 +0000 Subject: [mpich-discuss] MPI Reduce with MPI_IN_PLACE fails with non-0 root rank for message sizes over 256 with MPI version 4 and after In-Reply-To: References: Message-ID: FWIW, you can workaround the bug in older versions by setting MPIR_CVAR_DEVICE_COLLECTIVES=none in your environment. Ken From: "Raffenetti, Ken via discuss" Reply-To: "discuss at mpich.org" Date: Thursday, June 8, 2023 at 3:45 PM To: "discuss at mpich.org" Cc: "Raffenetti, Ken" Subject: Re: [mpich-discuss] MPI Reduce with MPI_IN_PLACE fails with non-0 root rank for message sizes over 256 with MPI version 4 and after Hi, I believe this bug was recently fixed in https://github.com/pmodels/mpich/pull/6543. The fix is part of the MPICH 4.1.2 release just posted to our website and Github. I confirmed that your test program works as expected now vs. an older 4.1 release. Ken From: "Solomonik, Edgar via discuss" Reply-To: "discuss at mpich.org" Date: Thursday, June 8, 2023 at 3:37 PM To: "discuss at mpich.org" Cc: "Solomonik, Edgar" Subject: [mpich-discuss] MPI Reduce with MPI_IN_PLACE fails with non-0 root rank for message sizes over 256 with MPI version 4 and after Hello, Our library's autobuild (CTF, which uses MPI extensively and in relatively sophisticated ways) started failing on multiple architectures after github workflows moved to later OS versions (and so later MPI versions). I believe I have narrowed the issue to an MPI bug associated with very basic usage of MPI Reduce. The following test code runs into a segmentation fault inside MPI when running with 2 MPI processes with the latest Ubuntu MPI build and MPI 4.0. It works for smaller values of message size (n) or if the root is rank 0. The usage of MPI_IN_PLACE adheres with the MPI standard. Best, Edgar Solomonik #include #include int main(int argc, char ** argv){ int64_t n = 257; MPI_Init(&argc, &argv); int rank; MPI_Comm_rank(MPI_COMM_WORLD, &rank); double * A = (double*)malloc(sizeof(double)*n); for (int i=0; i From solomon2 at illinois.edu Thu Jun 8 16:01:09 2023 From: solomon2 at illinois.edu (Solomonik, Edgar) Date: Thu, 8 Jun 2023 21:01:09 +0000 Subject: [mpich-discuss] MPI Reduce with MPI_IN_PLACE fails with non-0 root rank for message sizes over 256 with MPI version 4 and after In-Reply-To: References: Message-ID: Thanks, indeed the env variable fixes the issue and glad to hear its fixed in the latest version. Best, Edgar ________________________________ From: Raffenetti, Ken via discuss Sent: Thursday, June 8, 2023 3:50 PM To: discuss at mpich.org Cc: Raffenetti, Ken Subject: Re: [mpich-discuss] MPI Reduce with MPI_IN_PLACE fails with non-0 root rank for message sizes over 256 with MPI version 4 and after FWIW, you can workaround the bug in older versions by setting MPIR_CVAR_DEVICE_COLLECTIVES=none in your environment. Ken From: "Raffenetti, Ken via discuss" Reply-To: "discuss at mpich.org" Date: Thursday, June 8, 2023 at 3:45 PM To: "discuss at mpich.org" Cc: "Raffenetti, Ken" Subject: Re: [mpich-discuss] MPI Reduce with MPI_IN_PLACE fails with non-0 root rank for message sizes over 256 with MPI version 4 and after Hi, I believe this bug was recently fixed in https://github.com/pmodels/mpich/pull/6543. The fix is part of the MPICH 4.1.2 release just posted to our website and Github. I confirmed that your test program works as expected now vs. an older 4.1 release. Ken From: "Solomonik, Edgar via discuss" Reply-To: "discuss at mpich.org" Date: Thursday, June 8, 2023 at 3:37 PM To: "discuss at mpich.org" Cc: "Solomonik, Edgar" Subject: [mpich-discuss] MPI Reduce with MPI_IN_PLACE fails with non-0 root rank for message sizes over 256 with MPI version 4 and after Hello, Our library's autobuild (CTF, which uses MPI extensively and in relatively sophisticated ways) started failing on multiple architectures after github workflows moved to later OS versions (and so later MPI versions). I believe I have narrowed the issue to an MPI bug associated with very basic usage of MPI Reduce. The following test code runs into a segmentation fault inside MPI when running with 2 MPI processes with the latest Ubuntu MPI build and MPI 4.0. It works for smaller values of message size (n) or if the root is rank 0. The usage of MPI_IN_PLACE adheres with the MPI standard. Best, Edgar Solomonik #include #include int main(int argc, char ** argv){ int64_t n = 257; MPI_Init(&argc, &argv); int rank; MPI_Comm_rank(MPI_COMM_WORLD, &rank); double * A = (double*)malloc(sizeof(double)*n); for (int i=0; i From daixw3 at lenovo.com Mon Jun 12 20:59:23 2023 From: daixw3 at lenovo.com (Xuwen XW3 Dai) Date: Tue, 13 Jun 2023 01:59:23 +0000 Subject: [mpich-discuss] Ask for help with MPICH Message-ID: ********************************** IMPORTANT NOTICE: This email is from a third-party vendor and its personnel who provide services to Lenovo,and this email is not legally binding on Lenovo. Any Lenovo commitments should only be included in a formal contract signed between the two parties. Lenovo shall not be liable for any damages of any kind in any form whatsoever caused by this email. Any opinions expressed by individuals in this email neither represent nor reflect Lenovo's. This email and any documents sent with it may contain confidential information and are protected by the privileges of lawyers and clients. This message is only read and used by the intended recipients. Unauthorized viewing, use, disclosure or distribution of this email or the documents sent with it is strictly prohibited. If you are not the intended recipient, please do not read or save this message, and please delete this message permanently and destroy any printed copies. Thank you. ????????????????????????????????????????????????? ???????????????????????????? ????????????????????????????????????????????????????????????????? ????????????????????????????????????????????????????????????????????????????????????????? ????????????????????????????????????????????? ********************************** Hi, developer of MPICH Hello, My name is Dai Xuwen, a developer of Lenovo. Recently, we are developing a project on a supercomputing platform, and we need to use the PBS scheduler to run the linpack job, and the mpi used is MPICH. I would like to ask what is the difference between mpich-4.1.2 (stable release) and hydra-4.1.2 (stable release) in the download options provided by the project official website? The linpack job I compiled using MPICH-4.1.2 can run, but PBS cannot obtain all the job process information of the multi-machine nodes. However, I use hydra-4.1.2 to run the linpack multi-machine job, and pbs can get the process information of all nodes. I can't understand the difference between the two installation packages hydra-4.1.2 and mpich-4.1.2 from the documents you provided, and I hope to get your help, thank you. This email uses Google translation, I hope you can understand it. Hi?MPICH????????????????????????? ?????????????????????PBS??????linpack??????mpi?MPICH? ????????????????????mpich-4.1.2 (stable release)?hydra-4.1.2 (stable release)?????????MPICH-4.1.2???linpack?????????PBS????????????????????????hydra-4.1.2 ????linpack?????pbs??????????????? ??????????????hydra-4.1.2 ?mpich-4.1.2??????????????????????? ????????????????? BRs Dai Xuwen -------------- next part -------------- An HTML attachment was scrubbed... URL: From zhouh at anl.gov Wed Jun 14 07:26:21 2023 From: zhouh at anl.gov (Zhou, Hui) Date: Wed, 14 Jun 2023 12:26:21 +0000 Subject: [mpich-discuss] Ask for help with MPICH In-Reply-To: References: Message-ID: Hi Xuwen, The release mpich-4.1.2 contains hydra-4.1.2. There is no difference. The hydra release is for those users who only need the job launcher. --- Hui Zhou ________________________________ From: Xuwen XW3 Dai via discuss Sent: Monday, June 12, 2023 8:59 PM To: discuss at mpich.org Cc: Xuwen XW3 Dai Subject: [mpich-discuss] Ask for help with MPICH ********************************** IMPORTANT NOTICE: This email is from a third-party vendor and its personnel who provide services to Lenovo,and this email is not legally binding on Lenovo. Any Lenovo commitments should only be included in a formal contract signed between the two parties. Lenovo shall not be liable for any damages of any kind in any form whatsoever caused by this email. Any opinions expressed by individuals in this email neither represent nor reflect Lenovo's. This email and any documents sent with it may contain confidential information and are protected by the privileges of lawyers and clients. This message is only read and used by the intended recipients. Unauthorized viewing, use, disclosure or distribution of this email or the documents sent with it is strictly prohibited. If you are not the intended recipient, please do not read or save this message, and please delete this message permanently and destroy any printed copies. Thank you. ????????????????????????????????????????????????? ???????????????????????????? ????????????????????????????????????????????????????????????????? ????????????????????????????????????????????????????????????????????????????????????????? ????????????????????????????????????????????? ********************************** Hi, developer of MPICH Hello, My name is Dai Xuwen, a developer of Lenovo. Recently, we are developing a project on a supercomputing platform, and we need to use the PBS scheduler to run the linpack job, and the mpi used is MPICH. I would like to ask what is the difference between mpich-4.1.2 (stable release) and hydra-4.1.2 (stable release) in the download options provided by the project official website? The linpack job I compiled using MPICH-4.1.2 can run, but PBS cannot obtain all the job process information of the multi-machine nodes. However, I use hydra-4.1.2 to run the linpack multi-machine job, and pbs can get the process information of all nodes. I can't understand the difference between the two installation packages hydra-4.1.2 and mpich-4.1.2 from the documents you provided, and I hope to get your help, thank you. This email uses Google translation, I hope you can understand it. Hi?MPICH????????????????????????? ?????????????????????PBS??????linpack??????mpi?MPICH? ????????????????????mpich-4.1.2 (stable release)?hydra-4.1.2 (stable release)?????????MPICH-4.1.2???linpack?????????PBS????????????????????????hydra-4.1.2 ????linpack?????pbs??????????????? ??????????????hydra-4.1.2 ?mpich-4.1.2??????????????????????? ????????????????? BRs Dai Xuwen -------------- next part -------------- An HTML attachment was scrubbed... URL: From kurt.e.mccall at nasa.gov Wed Jun 14 16:12:40 2023 From: kurt.e.mccall at nasa.gov (Mccall, Kurt E. (MSFC-EV41)) Date: Wed, 14 Jun 2023 21:12:40 +0000 Subject: [mpich-discuss] Duplicate reception of messages Message-ID: My code seems to be receiving particular messages more than once -- I can tell from the unique variables they contain. I have verified that each unique variable is being sent exactly once. The sender simply calls MPI_ISend to send a single message containing an array of variables. The receiver pseudo-code is as follows. Have I made a mistake in my use of MPI_Iprobe? MPI_Status status; int flag; // loop in case there was more than one message queued up due to asynchrony between // sender and receiver. Most of the time, there is only one message. do { flag = false; MPI_Iprobe (sender_rank, tag, intercom, &flag, &status); if (flag) { int len; MPI_Get_count (&status, data_type, &len); MPI_Recv (buffer, len, data_type, sender_rank, tag, intercom, MPI_STATUS_IGNORE); // at this point, sometimes I see duplicate variable values in the buffer, in different // calls to MPI_Recv. That is, s supposedly unique variable appears in two different // message. } } while(flag); Thanks for your help. Here is my build information: MPICH Version: 4.0.1 MPICH Release date: Tue Feb 22 16:37:51 CST 2022 MPICH Device: ch3:nemesis MPICH configure: --prefix=/opt/mpich --with-device=ch3:nemesis --with-slurm --disable-fortran -enable-debuginfo --enable-g=debug MPICH CC: gcc -g -O2 MPICH CXX: g++ -g -O2 MPICH F77: -g MPICH FC: -g -------------- next part -------------- An HTML attachment was scrubbed... URL: From thakur at anl.gov Wed Jun 14 16:36:43 2023 From: thakur at anl.gov (Thakur, Rajeev) Date: Wed, 14 Jun 2023 21:36:43 +0000 Subject: [mpich-discuss] Duplicate reception of messages Message-ID: To help narrow down the problem, can you try your code with some other MPI implementation on any system? If it works, can you send us a small test program that reproduces the problem? Rajeev From: "Mccall, Kurt E. (MSFC-EV41) via discuss" Reply-To: "discuss at mpich.org" Date: Wednesday, June 14, 2023 at 4:13 PM To: "discuss at mpich.org" Cc: "Mccall, Kurt E. (MSFC-EV41)" Subject: [mpich-discuss] Duplicate reception of messages My code seems to be receiving particular messages more than once -- I can tell from the unique variables they contain. I have verified that each unique variable is being sent exactly once. The sender simply calls MPI_ISend to send a single message containing an array of variables. The receiver pseudo-code is as follows. Have I made a mistake in my use of MPI_Iprobe? MPI_Status status; int flag; // loop in case there was more than one message queued up due to asynchrony between // sender and receiver. Most of the time, there is only one message. do { flag = false; MPI_Iprobe (sender_rank, tag, intercom, &flag, &status); if (flag) { int len; MPI_Get_count (&status, data_type, &len); MPI_Recv (buffer, len, data_type, sender_rank, tag, intercom, MPI_STATUS_IGNORE); // at this point, sometimes I see duplicate variable values in the buffer, in different // calls to MPI_Recv. That is, s supposedly unique variable appears in two different // message. } } while(flag); Thanks for your help. Here is my build information: MPICH Version: 4.0.1 MPICH Release date: Tue Feb 22 16:37:51 CST 2022 MPICH Device: ch3:nemesis MPICH configure: --prefix=/opt/mpich --with-device=ch3:nemesis --with-slurm --disable-fortran -enable-debuginfo --enable-g=debug MPICH CC: gcc -g -O2 MPICH CXX: g++ -g -O2 MPICH F77: -g MPICH FC: -g -------------- next part -------------- An HTML attachment was scrubbed... URL: From thomas.jayaseelan at regeneron.com Thu Jun 15 09:58:01 2023 From: thomas.jayaseelan at regeneron.com (Thomas Jayaseelan-External) Date: Thu, 15 Jun 2023 14:58:01 +0000 Subject: [mpich-discuss] Issue in MPICH while submitting jobs through slurm in NONMEM application In-Reply-To: References: Message-ID: Hi All, It would helpful if you could help me on the below issue that we face in our application using MPI. Best Regards, Thomas From: Thomas Jayaseelan-External Sent: Thursday, June 15, 2023 10:58 AM To: discuss at mpich.org Cc: Sundaresh Krishnasamy-External ; Hariram Jayaram-External Subject: Issue in MPICH while submitting jobs through slurm in NONMEM application Hi Team, This is Thomas, I am part of HPCOPs team in Regeneron Pharmaceuticals company. We build and support the HPC cluster infrastructure for the business as per their requirements. I have reached out to you to get help on an issue that we are currently facing with MPI. It would be great if you could help us in getting a solution to it. Nonmem is the application in which users submit the jobs through CLI, it is a CLI based application. When user tries to run job with more no. of cores the job runs for 10 to 15 hours and then stops intermittently. Please find the below error message that we get in our output file. 1. Assertion failed in file src/mpid/ch3/channels/nemesis/netmod/tcp/socksm.c at line 600: hdr.pkt_type == MPIDI_NEM_TCP_SOCKSM_PKT_ID_INFO || hdr.pkt_type == MPIDI_NEM_TCP_SOCKSM_PKT_TMPVC_INFO internal ABORT - process 1231 Done with nonmem execution 1. Assertion failed in file src/mpid/ch3/channels/nemesis/netmod/tcp/socksm.c at line 600: hdr.pkt_type == MPIDI_NEM_TCP_SOCKSM_PKT_ID_INFO || hdr.pkt_type == MPIDI_NEM_TCP_SOCKSM_PKT_TMPVC_INFO internal ABORT - process 163 Done with nonmem execution Details: NONMEM application version - NM750 Slurm version - 21.08.6 MPICH version - 3.2.1 OS - Amazon Linux 2 Please let me know if you need anything from my end. Best Regards, Thomas ******************************************************************** This e-mail and any attachment hereto, is intended only for use by the addressee(s) named above and may contain legally privileged and/or confidential information. If you are not the intended recipient of this e-mail, any dissemination, distribution or copying of this email, or any attachment hereto, is strictly prohibited. If you receive this email in error please immediately notify me by return electronic mail and permanently delete this email and any attachment hereto, any copy of this e-mail and of any such attachment, and any printout thereof. Finally, please note that only authorized representatives of Regeneron Pharmaceuticals, Inc. have the power and authority to enter into business dealings with any third party. ******************************************************************** -------------- next part -------------- An HTML attachment was scrubbed... URL: From zhouh at anl.gov Thu Jun 15 20:20:38 2023 From: zhouh at anl.gov (Zhou, Hui) Date: Fri, 16 Jun 2023 01:20:38 +0000 Subject: [mpich-discuss] Issue in MPICH while submitting jobs through slurm in NONMEM application In-Reply-To: References: Message-ID: Hi Thomas, The assertion error means packet corruption. It's not clear what could be the cause unless you can provide a reproducer. Anyway, mpich-3.2.1 is quite old. My first suggestion would be try a newer mpich see if the error persists. -- Hui Zhou ________________________________ From: Thomas Jayaseelan-External via discuss Sent: Thursday, June 15, 2023 9:58 AM To: discuss at mpich.org Cc: Thomas Jayaseelan-External ; Sundaresh Krishnasamy-External ; Hariram Jayaram-External Subject: Re: [mpich-discuss] Issue in MPICH while submitting jobs through slurm in NONMEM application Hi All, It would helpful if you could help me on the below issue that we face in our application using MPI. Best Regards, Thomas From: Thomas Jayaseelan-External Sent: Thursday, June 15, 2023 10:58 AM To: discuss at mpich.org Cc: Sundaresh Krishnasamy-External ; Hariram Jayaram-External Subject: Issue in MPICH while submitting jobs through slurm in NONMEM application Hi Team, This is Thomas, I am part of HPCOPs team in Regeneron Pharmaceuticals company. We build and support the HPC cluster infrastructure for the business as per their requirements. I have reached out to you to get help on an issue that we are currently facing with MPI. It would be great if you could help us in getting a solution to it. Nonmem is the application in which users submit the jobs through CLI, it is a CLI based application. When user tries to run job with more no. of cores the job runs for 10 to 15 hours and then stops intermittently. Please find the below error message that we get in our output file. 1. Assertion failed in file src/mpid/ch3/channels/nemesis/netmod/tcp/socksm.c at line 600: hdr.pkt_type == MPIDI_NEM_TCP_SOCKSM_PKT_ID_INFO || hdr.pkt_type == MPIDI_NEM_TCP_SOCKSM_PKT_TMPVC_INFO internal ABORT - process 1231 Done with nonmem execution 1. Assertion failed in file src/mpid/ch3/channels/nemesis/netmod/tcp/socksm.c at line 600: hdr.pkt_type == MPIDI_NEM_TCP_SOCKSM_PKT_ID_INFO || hdr.pkt_type == MPIDI_NEM_TCP_SOCKSM_PKT_TMPVC_INFO internal ABORT - process 163 Done with nonmem execution Details: NONMEM application version ? NM750 Slurm version - 21.08.6 MPICH version - 3.2.1 OS ? Amazon Linux 2 Please let me know if you need anything from my end. Best Regards, Thomas ******************************************************************** This e-mail and any attachment hereto, is intended only for use by the addressee(s) named above and may contain legally privileged and/or confidential information. If you are not the intended recipient of this e-mail, any dissemination, distribution or copying of this email, or any attachment hereto, is strictly prohibited. If you receive this email in error please immediately notify me by return electronic mail and permanently delete this email and any attachment hereto, any copy of this e-mail and of any such attachment, and any printout thereof. Finally, please note that only authorized representatives of Regeneron Pharmaceuticals, Inc. have the power and authority to enter into business dealings with any third party. ******************************************************************** -------------- next part -------------- An HTML attachment was scrubbed... URL: From shadhin873 at outlook.com Sun Jun 25 23:37:46 2023 From: shadhin873 at outlook.com (Nazmul Haque) Date: Mon, 26 Jun 2023 04:37:46 +0000 Subject: [mpich-discuss] MPI_Init fail Message-ID: Hello, I have installed mpich2 and tried to run an executable with 'mpiexec -n 8 lmp -in .....' but it shows error, replying that mpi_init is unable to connect to the local host. How can I solve this? I have also attached the image of the error. Thank you for your time. Regards, Nazmul -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: Screenshot 2023-06-25 154358.png Type: image/png Size: 36173 bytes Desc: Screenshot 2023-06-25 154358.png URL: From raffenet at anl.gov Mon Jun 26 09:22:01 2023 From: raffenet at anl.gov (Raffenetti, Ken) Date: Mon, 26 Jun 2023 14:22:01 +0000 Subject: [mpich-discuss] MPI_Init fail Message-ID: <2A1E92F3-830B-4568-8259-91626A0B9606@anl.gov> The looks to be an error from a fairly old version of MPICH. Have you tried installing the latest version (4.1.2) available on our website or Github? Ken From: Nazmul Haque via discuss Reply-To: "discuss at mpich.org" Date: Sunday, June 25, 2023 at 11:38 PM To: "discuss at mpich.org" Cc: Nazmul Haque Subject: [mpich-discuss] MPI_Init fail Hello, I have installed mpich2 and tried to run an executable with 'mpiexec -n 8 lmp -in .....' but it shows error, replying that mpi_init is unable to connect to the local host. How can I solve this? I have also attached the image of the error. Thank you for your time. Regards, Nazmul -------------- next part -------------- An HTML attachment was scrubbed... URL: