[mpich-discuss] Fatal error in PMPI_Reduce

Michael Colonno mcolonno at stanford.edu
Sat Jan 12 14:51:09 CST 2013


	Hi Pavan ~

	I'm only using two hosts in this test. cxxcpi works successfully on
any one host (n <= 16) but fails on any two hosts (n > 16). So whatever the
issue is it's systematic to all the systems (which I suppose makes sense).
All these systems have firewalls disabled so I can confirm that firewall is
not an issue. The only thing that's not completely standard is they use a
bond of several eth devices for communication and some fancier
load-balancing network equipment. That should all be external to the OS
however. I'm almost ready to test out the little VM test network; I will
report what I find. 

	Thanks,
	~Mike C. 

-----Original Message-----
From: Pavan Balaji [mailto:balaji at mcs.anl.gov] 
Sent: Saturday, January 12, 2013 10:34 AM
To: Michael Colonno
Cc: discuss at mpich.org
Subject: Re: [mpich-discuss] Fatal error in PMPI_Reduce

Hi Michael,

On 01/12/2013 12:29 PM US Central Time, Michael Colonno wrote:
>             Thanks for the reply. I did verify connectivity prior to 
> trying the program. In this test case I have two systems, each with 16 
> cores (cv-hpcn1 and cv-hpcn2). ssh between the two systems appears to 
> be working as far as I can tell:

Even if ssh is working, there might still be problems.  For example, many
firewalls allow ssh (port 22), but not other ports.  So even if ssh appears
to be working, MPICH cannot.  Similarly, if one of the nodes has a messed up
/etc/hosts file pointing to a wrong IP address for another host, again mpich
will fail.

My recommendation would be to use selectively remove hosts from your host
file to figure out the problematic host first.

> cv-hpcn1$ /usr/local/apps/MPICH2/standalone/bin/mpirun -n 32 -f 
> /usr/local/apps/hosts /usr/local/apps/cxxcpi

FWIW, in a SLURM environment, mpiexec/mpirun automatically detects the
default "-f" parameter.  You can override it with whatever you want, but if
you want to use the default set of allocated hosts, -f is not required.

 -- Pavan

--
Pavan Balaji
http://www.mcs.anl.gov/~balaji




More information about the discuss mailing list