[mpich-discuss] Fatal error in PMPI_Reduce

Pavan Balaji balaji at mcs.anl.gov
Sat Jan 12 12:34:22 CST 2013


Hi Michael,

On 01/12/2013 12:29 PM US Central Time, Michael Colonno wrote:
>             Thanks for the reply. I did verify connectivity prior to
> trying the program. In this test case I have two systems, each with 16
> cores (cv-hpcn1 and cv-hpcn2). ssh between the two systems appears to be
> working as far as I can tell:

Even if ssh is working, there might still be problems.  For example,
many firewalls allow ssh (port 22), but not other ports.  So even if ssh
appears to be working, MPICH cannot.  Similarly, if one of the nodes has
a messed up /etc/hosts file pointing to a wrong IP address for another
host, again mpich will fail.

My recommendation would be to use selectively remove hosts from your
host file to figure out the problematic host first.

> cv-hpcn1$ /usr/local/apps/MPICH2/standalone/bin/mpirun -n 32 -f
> /usr/local/apps/hosts /usr/local/apps/cxxcpi

FWIW, in a SLURM environment, mpiexec/mpirun automatically detects the
default "-f" parameter.  You can override it with whatever you want, but
if you want to use the default set of allocated hosts, -f is not required.

 -- Pavan

-- 
Pavan Balaji
http://www.mcs.anl.gov/~balaji



More information about the discuss mailing list