[mpich-discuss] Fatal error in PMPI_Reduce

Pavan Balaji balaji at mcs.anl.gov
Sat Jan 12 15:16:22 CST 2013


On 01/12/2013 02:51 PM US Central Time, Michael Colonno wrote:
> I'm only using two hosts in this test. cxxcpi works successfully on 
> any one host (n <= 16) but fails on any two hosts (n > 16). So 
> whatever the issue is it's systematic to all the systems (which I 
> suppose makes sense).

FWIW, you can pass -ppn 1 to mpiexec to ask it to pretend that each host
has only one core.  That way, you should be able to reproduce this with
just n == 2.

> All these systems have firewalls disabled so I can confirm that 
> firewall is not an issue. The only thing that's not completely 
> standard is they use a bond of several eth devices for communication 
> and some fancier load-balancing network equipment.

Hmm..  Interesting.  Is it possible that the hostname provided by slurm
matches an eth device that does not have full connectivity?  Can you try
passing -disable-hostname-propagation to mpiexec to see if that helps?

 -- Pavan

-- 
Pavan Balaji
http://www.mcs.anl.gov/~balaji



More information about the discuss mailing list