[mpich-discuss] Optimal Firewall Settings for MPICH/HYDRA
Capehart, William J
William.Capehart at sdsmt.edu
Tue Jul 22 12:48:32 CDT 2014
Hi All
We're running MPICH on a couple machines with a brand new UNIX distro (SL 6.5) and that are on a vulnerable network and rather than leave the firewalls dropped we would like to run it through the firewall.
We have included the MPIEXEC_PORT_RANGE and MPIR_CVAR_CH3_PORT_RANGE fields and
have adjusted our iptables accordingly and in line with the "FAQ" guidance.
Our passwordless SSH works fine between the machines.
But all of this gives us momentary success with the cpi and fpi MPICH test programs. But they crash with the firewall up. (but of course run happily with the firewall down).
An example of the basic output is below (node short sends one process to "this.machine" and one to remote "that.machine"
[this.machine]% mpiexec -n 2 -f nodesshort cpi.exe
Process 0 of 2 is on this.machine
Process 1 of 2 is on that.machine
Fatal error in PMPI_Reduce: A process has failed, error stack:
PMPI_Reduce(1217)...............: MPI_Reduce(sbuf=0x7fff466a94d0, rbuf=0x7fff466a94d8, count=1, MPI_DOUBLE, MPI_SUM, root=0, MPI_COMM_WORLD) failed
MPIR_Reduce_impl(1029)..........:
MPIR_Reduce_intra(835)..........:
MPIR_Reduce_binomial(144).......:
MPIDI_CH3U_Recvq_FDU_or_AEP(667): Communication error with rank 1
===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= EXIT CODE: 1
= CLEANING UP REMAINING PROCESSES
= YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
[proxy:0:1 at that.machine] HYD_pmcd_pmip_control_cmd_cb (./pm/pmiserv/pmip_cb.c:886): assert (!closed) failed
[proxy:0:1 at that.machine] HYDT_dmxu_poll_wait_for_event (./tools/demux/demux_poll.c:77): callback returned error status
[proxy:0:1 at that.machine] main (./pm/pmiserv/pmip.c:206): demux engine error waiting for event
[mpiexec at this.machine] HYDT_bscu_wait_for_completion (./tools/bootstrap/utils/bscu_wait.c:76): one of the processes terminated badly; aborting
[mpiexec at this.machine] HYDT_bsci_wait_for_completion (./tools/bootstrap/src/bsci_wait.c:23): launcher returned error waiting for completion
[mpiexec at this.machine] HYD_pmci_wait_for_completion (./pm/pmiserv/pmiserv_pmci.c:217): launcher returned error waiting for completion
[mpiexec at this.machine] main (./ui/mpich/mpiexec.c:331): process manager error waiting for completion
In debug mode it affirms that it is at least *starting with the first available port as listed in MPIEXEC_PORT_RANGE
But later we get output like this:
[mpiexec at this.machine] PMI response to fd 6 pid 4: cmd=keyval_cache P0-businesscard=description#{this.machine's.ip.address}$port#54105$ifname#{this.machine's.ip.address}$ P1-businesscard=description#{that.machine's.ip.address}$port#47302$ifname#{that.machine's.ip.address}$
Does this mean that we have missed a firewall setting either in the environment variables or in the ip tables themselves?
Ideas?
Thanks Much
Bill
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20140722/e84b9c86/attachment.html>
More information about the discuss
mailing list