[mpich-discuss] Optimal Firewall Settings for MPICH/HYDRA

Balaji, Pavan balaji at anl.gov
Tue Jul 22 14:44:30 CDT 2014


Bill,

Just to make sure this is a firewall problem, can you try disabling the firewall for a short time to try out MPICH and see if it works correctly?  Remember to turn off the firewall on all machines, not just the head node.

  — Pavan

On Jul 22, 2014, at 2:18 PM, Capehart, William J <William.Capehart at sdsmt.edu> wrote:

> That would be the one that comes with PGI 14.6 (MPICH 3.0.4)
> 
> Bill
> 
> 
> On 7/22/14, 11:52 MDT, "Kenneth Raffenetti" <raffenet at mcs.anl.gov> wrote:
> 
>> What version of MPICH/Hydra is this?
>> 
>> On 07/22/2014 12:48 PM, Capehart, William J wrote:
>>> Hi All
>>> 
>>> We¹re running MPICH on a couple machines with a brand new UNIX distro
>>> (SL 6.5) and that are on a vulnerable network and rather than leave the
>>> firewalls dropped we would like to run it through the firewall.
>>> 
>>> We have included the MPIEXEC_PORT_RANGE and MPIR_CVAR_CH3_PORT_RANGE
>>> fields and
>>> have adjusted our iptables accordingly and in line with the ³FAQ²
>>> guidance.
>>> 
>>> Our passwordless SSH works fine between the machines.
>>> 
>>> But all of this gives us momentary success with the cpi and fpi MPICH
>>> test programs.  But they crash with the firewall up. (but of course run
>>> happily with the firewall down).
>>> 
>>> An example of the basic output is below (node short sends one process to
>>> ³this.machine² and one to remote ³that.machine²
>>> 
>>> 
>>> [this.machine]% mpiexec -n 2 -f nodesshort cpi.exe
>>> 
>>> Process 0 of 2 is on this.machine
>>> 
>>> Process 1 of 2 is on that.machine
>>> 
>>> Fatal error in PMPI_Reduce: A process has failed, error stack:
>>> 
>>> PMPI_Reduce(1217)...............: MPI_Reduce(sbuf=0x7fff466a94d0,
>>> rbuf=0x7fff466a94d8, count=1, MPI_DOUBLE, MPI_SUM, root=0,
>>> MPI_COMM_WORLD) failed
>>> 
>>> MPIR_Reduce_impl(1029)..........:
>>> 
>>> MPIR_Reduce_intra(835)..........:
>>> 
>>> MPIR_Reduce_binomial(144).......:
>>> 
>>> MPIDI_CH3U_Recvq_FDU_or_AEP(667): Communication error with rank 1
>>> 
>>> 
>>> 
>>> =========================================================================
>>> ==========
>>> 
>>> =   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
>>> 
>>> =   EXIT CODE: 1
>>> 
>>> =   CLEANING UP REMAINING PROCESSES
>>> 
>>> =   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
>>> 
>>> 
>>> =========================================================================
>>> ==========
>>> 
>>> [proxy:0:1 at that.machine] HYD_pmcd_pmip_control_cmd_cb
>>> (./pm/pmiserv/pmip_cb.c:886): assert (!closed) failed
>>> 
>>> [proxy:0:1 at that.machine] HYDT_dmxu_poll_wait_for_event
>>> (./tools/demux/demux_poll.c:77): callback returned error status
>>> 
>>> [proxy:0:1 at that.machine] main (./pm/pmiserv/pmip.c:206): demux engine
>>> error waiting for event
>>> 
>>> [mpiexec at this.machine] HYDT_bscu_wait_for_completion
>>> (./tools/bootstrap/utils/bscu_wait.c:76): one of the processes
>>> terminated badly; aborting
>>> 
>>> [mpiexec at this.machine] HYDT_bsci_wait_for_completion
>>> (./tools/bootstrap/src/bsci_wait.c:23): launcher returned error waiting
>>> for completion
>>> 
>>> [mpiexec at this.machine] HYD_pmci_wait_for_completion
>>> (./pm/pmiserv/pmiserv_pmci.c:217): launcher returned error waiting for
>>> completion
>>> 
>>> [mpiexec at this.machine] main (./ui/mpich/mpiexec.c:331): process manager
>>> error waiting for completion
>>> 
>>> 
>>> 
>>> In debug mode it affirms that it is at least *starting with the first
>>> available port as listed in MPIEXEC_PORT_RANGE
>>> 
>>> But later we get output like this:
>>> 
>>> [mpiexec at this.machine] PMI response to fd 6 pid 4: cmd=keyval_cache
>>> 
>>> P0-businesscard=description#{this.machine¹s.ip.address}$port#54105$ifname
>>> #{this.machine¹s.ip.address}$
>>> 
>>> P1-businesscard=description#{that.machine¹s.ip.address}$port#47302$ifname
>>> #{that.machine¹s.ip.address}$
>>> 
>>> 
>>> 
>>> Does this mean that we have missed a firewall setting either in the
>>> environment variables or in the ip tables themselves?
>>> 
>>> 
>>> Ideas?
>>> 
>>> 
>>> 
>>> Thanks Much
>>> 
>>> Bill
>>> 
>>> 
>>> 
>>> 
>>> 
>>> _______________________________________________
>>> discuss mailing list     discuss at mpich.org
>>> To manage subscription options or unsubscribe:
>>> https://lists.mpich.org/mailman/listinfo/discuss
>>> 
>> _______________________________________________
>> discuss mailing list     discuss at mpich.org
>> To manage subscription options or unsubscribe:
>> https://lists.mpich.org/mailman/listinfo/discuss
> 
> _______________________________________________
> discuss mailing list     discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss

--
Pavan Balaji  ✉️
http://www.mcs.anl.gov/~balaji



More information about the discuss mailing list