[mpich-discuss] MPICH3 Problem

Seo, Sangmin sseo at anl.gov
Wed Feb 4 13:40:52 CST 2015


Can you also check whether the firewalls are turned off on all nodes, as described in https://wiki.mpich.org/mpich/index.php/Frequently_Asked_Questions#Q:_My_MPI_program_aborts_with_an_error_saying_it_cannot_communicate_with_other_processes?

— Sangmin


On Feb 4, 2015, at 12:55 PM, Abhishek Bhat <abhat at trinityconsultants.com<mailto:abhat at trinityconsultants.com>> wrote:

Sangmin,

All nodes can communicate with each other with ssh.  I ran the MPI example program and here is the error I got

[12:52:46] [clusteruser at Earth examples]$ mpiexec -n 5 -f machinefile ./cpi
Process 0 of 5 is on Earth
Process 1 of 5 is on Earth
Process 4 of 5 is on Earth
Process 2 of 5 is on node1
Process 3 of 5 is on node1
Fatal error in PMPI_Reduce: Unknown error class, error stack:
PMPI_Reduce(1263)...............: MPI_Reduce(sbuf=0x7fff3bea9148, rbuf=0x7fff3bea9140, count=1, MPI_DOUBLE, MPI_SUM, root=0, MPI_COMM_WORLD) failed
MPIR_Reduce_impl(1075)..........:
MPIR_Reduce_intra(826)..........:
MPIR_Reduce_impl(1075)..........:
MPIR_Reduce_intra(881)..........:
MPIR_Reduce_binomial(188).......:
MPIDI_CH3U_Recvq_FDU_or_AEP(636): Communication error with rank 2
MPIR_Reduce_intra(846)..........:
MPIR_Reduce_impl(1075)..........:
MPIR_Reduce_intra(881)..........:
MPIR_Reduce_binomial(250).......: Failure during collective

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   PID 11705 RUNNING AT Earth
=   EXIT CODE: 1
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
[proxy:0:1 at node1] HYD_pmcd_pmip_control_cmd_cb (/home/Earth/MODELS/mpi/mpich-3.1.3/src/pm/hydra/pm/pmiserv/pmip_cb.c:885): assert (!closed) failed
[proxy:0:1 at node1] HYDT_dmxu_poll_wait_for_event (/home/Earth/MODELS/mpi/mpich-3.1.3/src/pm/hydra/tools/demux/demux_poll.c:76): callback returned error status
[proxy:0:1 at node1] main (/home/Earth/MODELS/mpi/mpich-3.1.3/src/pm/hydra/pm/pmiserv/pmip.c:206): demux engine error waiting for event
[mpiexec at Earth] HYDT_bscu_wait_for_completion (/home/Earth/MODELS/mpi/mpich-3.1.3/src/pm/hydra/tools/bootstrap/utils/bscu_wait.c:76): one of the processes terminated badly; aborting
[mpiexec at Earth] HYDT_bsci_wait_for_completion (/home/Earth/MODELS/mpi/mpich-3.1.3/src/pm/hydra/tools/bootstrap/src/bsci_wait.c:23): launcher returned error waiting for completion
[mpiexec at Earth] HYD_pmci_wait_for_completion (/home/Earth/MODELS/mpi/mpich-3.1.3/src/pm/hydra/pm/pmiserv/pmiserv_pmci.c:218): launcher returned error waiting for completion
[mpiexec at Earth] main (/home/Earth/MODELS/mpi/mpich-3.1.3/src/pm/hydra/ui/mpich/mpiexec.c:344): process manager error waiting for completion

Thank You
Abhishek
………………………………………………………………………………………………….
Abhishek Bhat, PhD, EPI,
Senior Consultant


From: Seo, Sangmin [mailto:sseo at anl.gov]
Sent: Tuesday, February 03, 2015 6:26 PM
To: <discuss at mpich.org<mailto:discuss at mpich.org>>
Subject: Re: [mpich-discuss] MPICH3 Problem

Can all nodes that you are using do ssh each other? And, can you also try to run an MPICH example, cpi, located in <mpich_top_dir>/examples with the same nodes to see if you run into the same error?

— Sangmin


On Feb 3, 2015, at 6:06 PM, Abhishek Bhat <abhat at trinityconsultants.com<mailto:abhat at trinityconsultants.com>> wrote:

So, after installing the PGI fortran on the shared drive, I re-ran and now I am getting communication error –
Model startup ......Fatal error in MPI_Send: Unknown error class, error stack:
MPI_Send(174)..............: MPI_Send(buf=0x1983bef8, count=1, MPI_INTEGER, dest=2, tag=1, MPI_COMM_WORLD) failed
MPID_nem_tcp_connpoll(1832): Communication error with rank 2: Connection timed out



………………………………………………………………………………………………….
Abhishek Bhat, PhD, EPI,
Senior Consultant


From: Seo, Sangmin [mailto:sseo at anl.gov]
Sent: Tuesday, February 03, 2015 5:57 PM
To: discuss at mpich.org<mailto:discuss at mpich.org>
Subject: Re: [mpich-discuss] MPICH3 Problem

Correct. libpgmp.so should be from your PGI compiler installation.

— Sangmin


On Feb 3, 2015, at 5:53 PM, Abhishek Bhat <abhat at trinityconsultants.com<mailto:abhat at trinityconsultants.com>> wrote:

Sangmin,

I have installed MPICH on shared drive but PGI fortran is installed on /opt/pgi which node do not have access.  I am assuming that is the issue currently.  I am trying to re-install PGI on the shared drive to see if that will fix the problem.

Just to confirm libpgmp.so is not mpich file correct?

Thank You
Abhishek

………………………………………………………………………………………………….
Abhishek Bhat, PhD, EPI,
Senior Consultant


From: Seo, Sangmin [mailto:sseo at anl.gov]
Sent: Tuesday, February 03, 2015 5:50 PM
To: <discuss at mpich.org<mailto:discuss at mpich.org>>
Subject: Re: [mpich-discuss] MPICH3 Problem

Hi Abhishek,

As the error message says, it looks that the node running the application doesn’t have libpgmp.so. Can you confirm whether the node has libpgmp.so and whether LD_LIBRARY_PATH is correctly set on the node in case that it has libpgmp.so?

Best regards,

Sangmin


On Feb 3, 2015, at 5:31 PM, Abhishek Bhat <abhat at trinityconsultants.com<mailto:abhat at trinityconsultants.com>> wrote:

HI All,

I installed MPICH3 on a master terminal along with PGI Fortran.  Then used MPICH3 and PGI to compile my software.  When I run the program on the master terminal only I do not have any error messages but when I am trying to run them on one of the nodes, I am getting following message error –

/home/Earth/MODELS/camx/src_611/CAMx.v6.11.MPICH3.pgfomp: error while loading shared libraries: libpgmp.so: cannot open shared object file: No such file or directory
/home/Earth/MODELS/camx/src_611/CAMx.v6.11.MPICH3.pgfomp: error while loading shared libraries: libpgmp.so: cannot open shared object file: No such file or directory

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   PID 5200 RUNNING AT node3
=   EXIT CODE: 127
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
[proxy:0:0 at Earth] HYD_pmcd_pmip_control_cmd_cb (/home/Earth/MODELS/mpi/mpich-3.1.3/src/pm/hydra/pm/pmiserv/pmip_cb.c:885): assert (!closed) failed
[proxy:0:0 at Earth] HYDT_dmxu_poll_wait_for_event (/home/Earth/MODELS/mpi/mpich-3.1.3/src/pm/hydra/tools/demux/demux_poll.c:76): callback returned error status
[proxy:0:0 at Earth] main (/home/Earth/MODELS/mpi/mpich-3.1.3/src/pm/hydra/pm/pmiserv/pmip.c:206): demux engine error waiting for event
[mpiexec at Earth] HYDT_bscu_wait_for_completion (/home/Earth/MODELS/mpi/mpich-3.1.3/src/pm/hydra/tools/bootstrap/utils/bscu_wait.c:76): one of the processes terminated badly; aborting
[mpiexec at Earth] HYDT_bsci_wait_for_completion (/home/Earth/MODELS/mpi/mpich-3.1.3/src/pm/hydra/tools/bootstrap/src/bsci_wait.c:23): launcher returned error waiting for completion
[mpiexec at Earth] HYD_pmci_wait_for_completion (/home/Earth/MODELS/mpi/mpich-3.1.3/src/pm/hydra/pm/pmiserv/pmiserv_pmci.c:218): launcher returned error waiting for completion
[mpiexec at Earth] main (/home/Earth/MODELS/mpi/mpich-3.1.3/src/pm/hydra/ui/mpich/mpiexec.c:344): process manager error waiting for completion

The /home/Earth is shared and mapped on all nodes

Any help is much appreciated.

Thank You
Abhishek
………………………………………………………………………………………………….
Abhishek Bhat, PhD, EPI,
Senior Consultant

_________________________________________________________________________

The information transmitted is intended only for the person or entity to
which it is addressed and may contain confidential and/or privileged
material. Any review, retransmission, dissemination or other use of, or
taking of any action in reliance upon, this information by persons or
entities other than the intended recipient is prohibited. If you received
this in error, please contact the sender and delete the material from any
computer.
_________________________________________________________________________
_______________________________________________
discuss mailing list     discuss at mpich.org<mailto:discuss at mpich.org>
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss


_________________________________________________________________________

The information transmitted is intended only for the person or entity to
which it is addressed and may contain confidential and/or privileged
material. Any review, retransmission, dissemination or other use of, or
taking of any action in reliance upon, this information by persons or
entities other than the intended recipient is prohibited. If you received
this in error, please contact the sender and delete the material from any
computer.
_________________________________________________________________________
_______________________________________________
discuss mailing list     discuss at mpich.org<mailto:discuss at mpich.org>
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss


_________________________________________________________________________

The information transmitted is intended only for the person or entity to
which it is addressed and may contain confidential and/or privileged
material. Any review, retransmission, dissemination or other use of, or
taking of any action in reliance upon, this information by persons or
entities other than the intended recipient is prohibited. If you received
this in error, please contact the sender and delete the material from any
computer.
_________________________________________________________________________
_______________________________________________
discuss mailing list     discuss at mpich.org<mailto:discuss at mpich.org>
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss


_________________________________________________________________________

The information transmitted is intended only for the person or entity to
which it is addressed and may contain confidential and/or privileged
material. Any review, retransmission, dissemination or other use of, or
taking of any action in reliance upon, this information by persons or
entities other than the intended recipient is prohibited. If you received
this in error, please contact the sender and delete the material from any
computer.
_________________________________________________________________________
_______________________________________________
discuss mailing list     discuss at mpich.org<mailto:discuss at mpich.org>
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20150204/d6ee41b4/attachment.html>
-------------- next part --------------
_______________________________________________
discuss mailing list     discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss


More information about the discuss mailing list