[mpich-discuss] Optimal Firewall Settings for MPICH/HYDRA

Capehart, William J William.Capehart at sdsmt.edu
Tue Jul 22 14:18:41 CDT 2014


That would be the one that comes with PGI 14.6 (MPICH 3.0.4)

Bill


On 7/22/14, 11:52 MDT, "Kenneth Raffenetti" <raffenet at mcs.anl.gov> wrote:

>What version of MPICH/Hydra is this?
>
>On 07/22/2014 12:48 PM, Capehart, William J wrote:
>> Hi All
>>
>> We¹re running MPICH on a couple machines with a brand new UNIX distro
>> (SL 6.5) and that are on a vulnerable network and rather than leave the
>> firewalls dropped we would like to run it through the firewall.
>>
>> We have included the MPIEXEC_PORT_RANGE and MPIR_CVAR_CH3_PORT_RANGE
>> fields and
>> have adjusted our iptables accordingly and in line with the ³FAQ²
>>guidance.
>>
>> Our passwordless SSH works fine between the machines.
>>
>> But all of this gives us momentary success with the cpi and fpi MPICH
>> test programs.  But they crash with the firewall up. (but of course run
>> happily with the firewall down).
>>
>> An example of the basic output is below (node short sends one process to
>> ³this.machine² and one to remote ³that.machine²
>>
>>
>> [this.machine]% mpiexec -n 2 -f nodesshort cpi.exe
>>
>> Process 0 of 2 is on this.machine
>>
>> Process 1 of 2 is on that.machine
>>
>> Fatal error in PMPI_Reduce: A process has failed, error stack:
>>
>> PMPI_Reduce(1217)...............: MPI_Reduce(sbuf=0x7fff466a94d0,
>> rbuf=0x7fff466a94d8, count=1, MPI_DOUBLE, MPI_SUM, root=0,
>> MPI_COMM_WORLD) failed
>>
>> MPIR_Reduce_impl(1029)..........:
>>
>> MPIR_Reduce_intra(835)..........:
>>
>> MPIR_Reduce_binomial(144).......:
>>
>> MPIDI_CH3U_Recvq_FDU_or_AEP(667): Communication error with rank 1
>>
>>
>> 
>>=========================================================================
>>==========
>>
>> =   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
>>
>> =   EXIT CODE: 1
>>
>> =   CLEANING UP REMAINING PROCESSES
>>
>> =   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
>>
>> 
>>=========================================================================
>>==========
>>
>> [proxy:0:1 at that.machine] HYD_pmcd_pmip_control_cmd_cb
>> (./pm/pmiserv/pmip_cb.c:886): assert (!closed) failed
>>
>> [proxy:0:1 at that.machine] HYDT_dmxu_poll_wait_for_event
>> (./tools/demux/demux_poll.c:77): callback returned error status
>>
>> [proxy:0:1 at that.machine] main (./pm/pmiserv/pmip.c:206): demux engine
>> error waiting for event
>>
>> [mpiexec at this.machine] HYDT_bscu_wait_for_completion
>> (./tools/bootstrap/utils/bscu_wait.c:76): one of the processes
>> terminated badly; aborting
>>
>> [mpiexec at this.machine] HYDT_bsci_wait_for_completion
>> (./tools/bootstrap/src/bsci_wait.c:23): launcher returned error waiting
>> for completion
>>
>> [mpiexec at this.machine] HYD_pmci_wait_for_completion
>> (./pm/pmiserv/pmiserv_pmci.c:217): launcher returned error waiting for
>> completion
>>
>> [mpiexec at this.machine] main (./ui/mpich/mpiexec.c:331): process manager
>> error waiting for completion
>>
>>
>>
>> In debug mode it affirms that it is at least *starting with the first
>> available port as listed in MPIEXEC_PORT_RANGE
>>
>> But later we get output like this:
>>
>> [mpiexec at this.machine] PMI response to fd 6 pid 4: cmd=keyval_cache
>> 
>>P0-businesscard=description#{this.machine¹s.ip.address}$port#54105$ifname
>>#{this.machine¹s.ip.address}$
>> 
>>P1-businesscard=description#{that.machine¹s.ip.address}$port#47302$ifname
>>#{that.machine¹s.ip.address}$
>>
>>
>>
>> Does this mean that we have missed a firewall setting either in the
>> environment variables or in the ip tables themselves?
>>
>>
>> Ideas?
>>
>>
>>
>> Thanks Much
>>
>> Bill
>>
>>
>>
>>
>>
>> _______________________________________________
>> discuss mailing list     discuss at mpich.org
>> To manage subscription options or unsubscribe:
>> https://lists.mpich.org/mailman/listinfo/discuss
>>
>_______________________________________________
>discuss mailing list     discuss at mpich.org
>To manage subscription options or unsubscribe:
>https://lists.mpich.org/mailman/listinfo/discuss




More information about the discuss mailing list