[mpich-discuss] Optimal Firewall Settings for MPICH/HYDRA

Kenneth Raffenetti raffenet at mcs.anl.gov
Tue Jul 22 12:52:23 CDT 2014


What version of MPICH/Hydra is this?

On 07/22/2014 12:48 PM, Capehart, William J wrote:
> Hi All
>
> We’re running MPICH on a couple machines with a brand new UNIX distro
> (SL 6.5) and that are on a vulnerable network and rather than leave the
> firewalls dropped we would like to run it through the firewall.
>
> We have included the MPIEXEC_PORT_RANGE and MPIR_CVAR_CH3_PORT_RANGE
> fields and
> have adjusted our iptables accordingly and in line with the “FAQ” guidance.
>
> Our passwordless SSH works fine between the machines.
>
> But all of this gives us momentary success with the cpi and fpi MPICH
> test programs.  But they crash with the firewall up. (but of course run
> happily with the firewall down).
>
> An example of the basic output is below (node short sends one process to
> “this.machine” and one to remote “that.machine”
>
>
> [this.machine]% mpiexec -n 2 -f nodesshort cpi.exe
>
> Process 0 of 2 is on this.machine
>
> Process 1 of 2 is on that.machine
>
> Fatal error in PMPI_Reduce: A process has failed, error stack:
>
> PMPI_Reduce(1217)...............: MPI_Reduce(sbuf=0x7fff466a94d0,
> rbuf=0x7fff466a94d8, count=1, MPI_DOUBLE, MPI_SUM, root=0,
> MPI_COMM_WORLD) failed
>
> MPIR_Reduce_impl(1029)..........:
>
> MPIR_Reduce_intra(835)..........:
>
> MPIR_Reduce_binomial(144).......:
>
> MPIDI_CH3U_Recvq_FDU_or_AEP(667): Communication error with rank 1
>
>
> ===================================================================================
>
> =   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
>
> =   EXIT CODE: 1
>
> =   CLEANING UP REMAINING PROCESSES
>
> =   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
>
> ===================================================================================
>
> [proxy:0:1 at that.machine] HYD_pmcd_pmip_control_cmd_cb
> (./pm/pmiserv/pmip_cb.c:886): assert (!closed) failed
>
> [proxy:0:1 at that.machine] HYDT_dmxu_poll_wait_for_event
> (./tools/demux/demux_poll.c:77): callback returned error status
>
> [proxy:0:1 at that.machine] main (./pm/pmiserv/pmip.c:206): demux engine
> error waiting for event
>
> [mpiexec at this.machine] HYDT_bscu_wait_for_completion
> (./tools/bootstrap/utils/bscu_wait.c:76): one of the processes
> terminated badly; aborting
>
> [mpiexec at this.machine] HYDT_bsci_wait_for_completion
> (./tools/bootstrap/src/bsci_wait.c:23): launcher returned error waiting
> for completion
>
> [mpiexec at this.machine] HYD_pmci_wait_for_completion
> (./pm/pmiserv/pmiserv_pmci.c:217): launcher returned error waiting for
> completion
>
> [mpiexec at this.machine] main (./ui/mpich/mpiexec.c:331): process manager
> error waiting for completion
>
>
>
> In debug mode it affirms that it is at least *starting with the first
> available port as listed in MPIEXEC_PORT_RANGE
>
> But later we get output like this:
>
> [mpiexec at this.machine] PMI response to fd 6 pid 4: cmd=keyval_cache
> P0-businesscard=description#{this.machine’s.ip.address}$port#54105$ifname#{this.machine’s.ip.address}$
> P1-businesscard=description#{that.machine’s.ip.address}$port#47302$ifname#{that.machine’s.ip.address}$
>
>
>
> Does this mean that we have missed a firewall setting either in the
> environment variables or in the ip tables themselves?
>
>
> Ideas?
>
>
>
> Thanks Much
>
> Bill
>
>
>
>
>
> _______________________________________________
> discuss mailing list     discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss
>



More information about the discuss mailing list