[mpich-discuss] MPICH3 Problem
Abhishek Bhat
abhat at trinityconsultants.com
Wed Feb 4 15:32:19 CST 2015
Sangmin,
All the firewalls are off. This might be a very basic question but do I need to install MPICH libraries on all nodes even if I have installed MPICH (including libraries) on the master in a shared file between all nodes?
Thank You
Abhishek
................................................................................................................
Abhishek Bhat, PhD, EPI,
Senior Consultant
From: Seo, Sangmin [mailto:sseo at anl.gov]
Sent: Wednesday, February 04, 2015 1:41 PM
To: <discuss at mpich.org>
Subject: Re: [mpich-discuss] MPICH3 Problem
Can you also check whether the firewalls are turned off on all nodes, as described in https://wiki.mpich.org/mpich/index.php/Frequently_Asked_Questions#Q:_My_MPI_program_aborts_with_an_error_saying_it_cannot_communicate_with_other_processes?
- Sangmin
On Feb 4, 2015, at 12:55 PM, Abhishek Bhat <abhat at trinityconsultants.com<mailto:abhat at trinityconsultants.com>> wrote:
Sangmin,
All nodes can communicate with each other with ssh. I ran the MPI example program and here is the error I got
[12:52:46] [clusteruser at Earth examples]$ mpiexec -n 5 -f machinefile ./cpi
Process 0 of 5 is on Earth
Process 1 of 5 is on Earth
Process 4 of 5 is on Earth
Process 2 of 5 is on node1
Process 3 of 5 is on node1
Fatal error in PMPI_Reduce: Unknown error class, error stack:
PMPI_Reduce(1263)...............: MPI_Reduce(sbuf=0x7fff3bea9148, rbuf=0x7fff3bea9140, count=1, MPI_DOUBLE, MPI_SUM, root=0, MPI_COMM_WORLD) failed
MPIR_Reduce_impl(1075)..........:
MPIR_Reduce_intra(826)..........:
MPIR_Reduce_impl(1075)..........:
MPIR_Reduce_intra(881)..........:
MPIR_Reduce_binomial(188).......:
MPIDI_CH3U_Recvq_FDU_or_AEP(636): Communication error with rank 2
MPIR_Reduce_intra(846)..........:
MPIR_Reduce_impl(1075)..........:
MPIR_Reduce_intra(881)..........:
MPIR_Reduce_binomial(250).......: Failure during collective
===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= PID 11705 RUNNING AT Earth
= EXIT CODE: 1
= CLEANING UP REMAINING PROCESSES
= YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
[proxy:0:1 at node1] HYD_pmcd_pmip_control_cmd_cb (/home/Earth/MODELS/mpi/mpich-3.1.3/src/pm/hydra/pm/pmiserv/pmip_cb.c:885): assert (!closed) failed
[proxy:0:1 at node1] HYDT_dmxu_poll_wait_for_event (/home/Earth/MODELS/mpi/mpich-3.1.3/src/pm/hydra/tools/demux/demux_poll.c:76): callback returned error status
[proxy:0:1 at node1] main (/home/Earth/MODELS/mpi/mpich-3.1.3/src/pm/hydra/pm/pmiserv/pmip.c:206): demux engine error waiting for event
[mpiexec at Earth] HYDT_bscu_wait_for_completion (/home/Earth/MODELS/mpi/mpich-3.1.3/src/pm/hydra/tools/bootstrap/utils/bscu_wait.c:76): one of the processes terminated badly; aborting
[mpiexec at Earth] HYDT_bsci_wait_for_completion (/home/Earth/MODELS/mpi/mpich-3.1.3/src/pm/hydra/tools/bootstrap/src/bsci_wait.c:23): launcher returned error waiting for completion
[mpiexec at Earth] HYD_pmci_wait_for_completion (/home/Earth/MODELS/mpi/mpich-3.1.3/src/pm/hydra/pm/pmiserv/pmiserv_pmci.c:218): launcher returned error waiting for completion
[mpiexec at Earth] main (/home/Earth/MODELS/mpi/mpich-3.1.3/src/pm/hydra/ui/mpich/mpiexec.c:344): process manager error waiting for completion
Thank You
Abhishek
................................................................................................................
Abhishek Bhat, PhD, EPI,
Senior Consultant
From: Seo, Sangmin [mailto:sseo at anl.gov]
Sent: Tuesday, February 03, 2015 6:26 PM
To: <discuss at mpich.org<mailto:discuss at mpich.org>>
Subject: Re: [mpich-discuss] MPICH3 Problem
Can all nodes that you are using do ssh each other? And, can you also try to run an MPICH example, cpi, located in <mpich_top_dir>/examples with the same nodes to see if you run into the same error?
- Sangmin
On Feb 3, 2015, at 6:06 PM, Abhishek Bhat <abhat at trinityconsultants.com<mailto:abhat at trinityconsultants.com>> wrote:
So, after installing the PGI fortran on the shared drive, I re-ran and now I am getting communication error -
Model startup ......Fatal error in MPI_Send: Unknown error class, error stack:
MPI_Send(174)..............: MPI_Send(buf=0x1983bef8, count=1, MPI_INTEGER, dest=2, tag=1, MPI_COMM_WORLD) failed
MPID_nem_tcp_connpoll(1832): Communication error with rank 2: Connection timed out
................................................................................................................
Abhishek Bhat, PhD, EPI,
Senior Consultant
From: Seo, Sangmin [mailto:sseo at anl.gov]
Sent: Tuesday, February 03, 2015 5:57 PM
To: discuss at mpich.org<mailto:discuss at mpich.org>
Subject: Re: [mpich-discuss] MPICH3 Problem
Correct. libpgmp.so should be from your PGI compiler installation.
- Sangmin
On Feb 3, 2015, at 5:53 PM, Abhishek Bhat <abhat at trinityconsultants.com<mailto:abhat at trinityconsultants.com>> wrote:
Sangmin,
I have installed MPICH on shared drive but PGI fortran is installed on /opt/pgi which node do not have access. I am assuming that is the issue currently. I am trying to re-install PGI on the shared drive to see if that will fix the problem.
Just to confirm libpgmp.so is not mpich file correct?
Thank You
Abhishek
................................................................................................................
Abhishek Bhat, PhD, EPI,
Senior Consultant
From: Seo, Sangmin [mailto:sseo at anl.gov]
Sent: Tuesday, February 03, 2015 5:50 PM
To: <discuss at mpich.org<mailto:discuss at mpich.org>>
Subject: Re: [mpich-discuss] MPICH3 Problem
Hi Abhishek,
As the error message says, it looks that the node running the application doesn't have libpgmp.so. Can you confirm whether the node has libpgmp.so and whether LD_LIBRARY_PATH is correctly set on the node in case that it has libpgmp.so?
Best regards,
Sangmin
On Feb 3, 2015, at 5:31 PM, Abhishek Bhat <abhat at trinityconsultants.com<mailto:abhat at trinityconsultants.com>> wrote:
HI All,
I installed MPICH3 on a master terminal along with PGI Fortran. Then used MPICH3 and PGI to compile my software. When I run the program on the master terminal only I do not have any error messages but when I am trying to run them on one of the nodes, I am getting following message error -
/home/Earth/MODELS/camx/src_611/CAMx.v6.11.MPICH3.pgfomp: error while loading shared libraries: libpgmp.so: cannot open shared object file: No such file or directory
/home/Earth/MODELS/camx/src_611/CAMx.v6.11.MPICH3.pgfomp: error while loading shared libraries: libpgmp.so: cannot open shared object file: No such file or directory
===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= PID 5200 RUNNING AT node3
= EXIT CODE: 127
= CLEANING UP REMAINING PROCESSES
= YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
[proxy:0:0 at Earth] HYD_pmcd_pmip_control_cmd_cb (/home/Earth/MODELS/mpi/mpich-3.1.3/src/pm/hydra/pm/pmiserv/pmip_cb.c:885): assert (!closed) failed
[proxy:0:0 at Earth] HYDT_dmxu_poll_wait_for_event (/home/Earth/MODELS/mpi/mpich-3.1.3/src/pm/hydra/tools/demux/demux_poll.c:76): callback returned error status
[proxy:0:0 at Earth] main (/home/Earth/MODELS/mpi/mpich-3.1.3/src/pm/hydra/pm/pmiserv/pmip.c:206): demux engine error waiting for event
[mpiexec at Earth] HYDT_bscu_wait_for_completion (/home/Earth/MODELS/mpi/mpich-3.1.3/src/pm/hydra/tools/bootstrap/utils/bscu_wait.c:76): one of the processes terminated badly; aborting
[mpiexec at Earth] HYDT_bsci_wait_for_completion (/home/Earth/MODELS/mpi/mpich-3.1.3/src/pm/hydra/tools/bootstrap/src/bsci_wait.c:23): launcher returned error waiting for completion
[mpiexec at Earth] HYD_pmci_wait_for_completion (/home/Earth/MODELS/mpi/mpich-3.1.3/src/pm/hydra/pm/pmiserv/pmiserv_pmci.c:218): launcher returned error waiting for completion
[mpiexec at Earth] main (/home/Earth/MODELS/mpi/mpich-3.1.3/src/pm/hydra/ui/mpich/mpiexec.c:344): process manager error waiting for completion
The /home/Earth is shared and mapped on all nodes
Any help is much appreciated.
Thank You
Abhishek
................................................................................................................
Abhishek Bhat, PhD, EPI,
Senior Consultant
_________________________________________________________________________
The information transmitted is intended only for the person or entity to
which it is addressed and may contain confidential and/or privileged
material. Any review, retransmission, dissemination or other use of, or
taking of any action in reliance upon, this information by persons or
entities other than the intended recipient is prohibited. If you received
this in error, please contact the sender and delete the material from any
computer.
_________________________________________________________________________
_______________________________________________
discuss mailing list discuss at mpich.org<mailto:discuss at mpich.org>
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss
_________________________________________________________________________
The information transmitted is intended only for the person or entity to
which it is addressed and may contain confidential and/or privileged
material. Any review, retransmission, dissemination or other use of, or
taking of any action in reliance upon, this information by persons or
entities other than the intended recipient is prohibited. If you received
this in error, please contact the sender and delete the material from any
computer.
_________________________________________________________________________
_______________________________________________
discuss mailing list discuss at mpich.org<mailto:discuss at mpich.org>
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss
_________________________________________________________________________
The information transmitted is intended only for the person or entity to
which it is addressed and may contain confidential and/or privileged
material. Any review, retransmission, dissemination or other use of, or
taking of any action in reliance upon, this information by persons or
entities other than the intended recipient is prohibited. If you received
this in error, please contact the sender and delete the material from any
computer.
_________________________________________________________________________
_______________________________________________
discuss mailing list discuss at mpich.org<mailto:discuss at mpich.org>
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss
_________________________________________________________________________
The information transmitted is intended only for the person or entity to
which it is addressed and may contain confidential and/or privileged
material. Any review, retransmission, dissemination or other use of, or
taking of any action in reliance upon, this information by persons or
entities other than the intended recipient is prohibited. If you received
this in error, please contact the sender and delete the material from any
computer.
_________________________________________________________________________
_______________________________________________
discuss mailing list discuss at mpich.org<mailto:discuss at mpich.org>
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss
--
_________________________________________________________________________
The information transmitted is intended only for the person or entity to
which it is addressed and may contain confidential and/or privileged
material. Any review, retransmission, dissemination or other use of, or
taking of any action in reliance upon, this information by persons or
entities other than the intended recipient is prohibited. If you received
this in error, please contact the sender and delete the material from any
computer.
_________________________________________________________________________
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20150204/9583d93e/attachment.html>
-------------- next part --------------
_______________________________________________
discuss mailing list discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss
More information about the discuss
mailing list