[mpich-discuss] Fatal error in PMPI_Reduce
Pavan Balaji
balaji at mcs.anl.gov
Sat Jan 12 15:16:22 CST 2013
On 01/12/2013 02:51 PM US Central Time, Michael Colonno wrote:
> I'm only using two hosts in this test. cxxcpi works successfully on
> any one host (n <= 16) but fails on any two hosts (n > 16). So
> whatever the issue is it's systematic to all the systems (which I
> suppose makes sense).
FWIW, you can pass -ppn 1 to mpiexec to ask it to pretend that each host
has only one core. That way, you should be able to reproduce this with
just n == 2.
> All these systems have firewalls disabled so I can confirm that
> firewall is not an issue. The only thing that's not completely
> standard is they use a bond of several eth devices for communication
> and some fancier load-balancing network equipment.
Hmm.. Interesting. Is it possible that the hostname provided by slurm
matches an eth device that does not have full connectivity? Can you try
passing -disable-hostname-propagation to mpiexec to see if that helps?
-- Pavan
--
Pavan Balaji
http://www.mcs.anl.gov/~balaji
More information about the discuss
mailing list