[mpich-discuss] Optimal Firewall Settings for MPICH/HYDRA
Balaji, Pavan
balaji at anl.gov
Tue Jul 22 14:44:30 CDT 2014
Bill,
Just to make sure this is a firewall problem, can you try disabling the firewall for a short time to try out MPICH and see if it works correctly? Remember to turn off the firewall on all machines, not just the head node.
— Pavan
On Jul 22, 2014, at 2:18 PM, Capehart, William J <William.Capehart at sdsmt.edu> wrote:
> That would be the one that comes with PGI 14.6 (MPICH 3.0.4)
>
> Bill
>
>
> On 7/22/14, 11:52 MDT, "Kenneth Raffenetti" <raffenet at mcs.anl.gov> wrote:
>
>> What version of MPICH/Hydra is this?
>>
>> On 07/22/2014 12:48 PM, Capehart, William J wrote:
>>> Hi All
>>>
>>> We¹re running MPICH on a couple machines with a brand new UNIX distro
>>> (SL 6.5) and that are on a vulnerable network and rather than leave the
>>> firewalls dropped we would like to run it through the firewall.
>>>
>>> We have included the MPIEXEC_PORT_RANGE and MPIR_CVAR_CH3_PORT_RANGE
>>> fields and
>>> have adjusted our iptables accordingly and in line with the ³FAQ²
>>> guidance.
>>>
>>> Our passwordless SSH works fine between the machines.
>>>
>>> But all of this gives us momentary success with the cpi and fpi MPICH
>>> test programs. But they crash with the firewall up. (but of course run
>>> happily with the firewall down).
>>>
>>> An example of the basic output is below (node short sends one process to
>>> ³this.machine² and one to remote ³that.machine²
>>>
>>>
>>> [this.machine]% mpiexec -n 2 -f nodesshort cpi.exe
>>>
>>> Process 0 of 2 is on this.machine
>>>
>>> Process 1 of 2 is on that.machine
>>>
>>> Fatal error in PMPI_Reduce: A process has failed, error stack:
>>>
>>> PMPI_Reduce(1217)...............: MPI_Reduce(sbuf=0x7fff466a94d0,
>>> rbuf=0x7fff466a94d8, count=1, MPI_DOUBLE, MPI_SUM, root=0,
>>> MPI_COMM_WORLD) failed
>>>
>>> MPIR_Reduce_impl(1029)..........:
>>>
>>> MPIR_Reduce_intra(835)..........:
>>>
>>> MPIR_Reduce_binomial(144).......:
>>>
>>> MPIDI_CH3U_Recvq_FDU_or_AEP(667): Communication error with rank 1
>>>
>>>
>>>
>>> =========================================================================
>>> ==========
>>>
>>> = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
>>>
>>> = EXIT CODE: 1
>>>
>>> = CLEANING UP REMAINING PROCESSES
>>>
>>> = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
>>>
>>>
>>> =========================================================================
>>> ==========
>>>
>>> [proxy:0:1 at that.machine] HYD_pmcd_pmip_control_cmd_cb
>>> (./pm/pmiserv/pmip_cb.c:886): assert (!closed) failed
>>>
>>> [proxy:0:1 at that.machine] HYDT_dmxu_poll_wait_for_event
>>> (./tools/demux/demux_poll.c:77): callback returned error status
>>>
>>> [proxy:0:1 at that.machine] main (./pm/pmiserv/pmip.c:206): demux engine
>>> error waiting for event
>>>
>>> [mpiexec at this.machine] HYDT_bscu_wait_for_completion
>>> (./tools/bootstrap/utils/bscu_wait.c:76): one of the processes
>>> terminated badly; aborting
>>>
>>> [mpiexec at this.machine] HYDT_bsci_wait_for_completion
>>> (./tools/bootstrap/src/bsci_wait.c:23): launcher returned error waiting
>>> for completion
>>>
>>> [mpiexec at this.machine] HYD_pmci_wait_for_completion
>>> (./pm/pmiserv/pmiserv_pmci.c:217): launcher returned error waiting for
>>> completion
>>>
>>> [mpiexec at this.machine] main (./ui/mpich/mpiexec.c:331): process manager
>>> error waiting for completion
>>>
>>>
>>>
>>> In debug mode it affirms that it is at least *starting with the first
>>> available port as listed in MPIEXEC_PORT_RANGE
>>>
>>> But later we get output like this:
>>>
>>> [mpiexec at this.machine] PMI response to fd 6 pid 4: cmd=keyval_cache
>>>
>>> P0-businesscard=description#{this.machine¹s.ip.address}$port#54105$ifname
>>> #{this.machine¹s.ip.address}$
>>>
>>> P1-businesscard=description#{that.machine¹s.ip.address}$port#47302$ifname
>>> #{that.machine¹s.ip.address}$
>>>
>>>
>>>
>>> Does this mean that we have missed a firewall setting either in the
>>> environment variables or in the ip tables themselves?
>>>
>>>
>>> Ideas?
>>>
>>>
>>>
>>> Thanks Much
>>>
>>> Bill
>>>
>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> discuss mailing list discuss at mpich.org
>>> To manage subscription options or unsubscribe:
>>> https://lists.mpich.org/mailman/listinfo/discuss
>>>
>> _______________________________________________
>> discuss mailing list discuss at mpich.org
>> To manage subscription options or unsubscribe:
>> https://lists.mpich.org/mailman/listinfo/discuss
>
> _______________________________________________
> discuss mailing list discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss
--
Pavan Balaji ✉️
http://www.mcs.anl.gov/~balaji
More information about the discuss
mailing list