[mpich-discuss] Hanging behavior with derived types in a 'user-defined gatherv'

Sewall, Jason jason.sewall at intel.com
Fri Apr 21 16:29:16 CDT 2017


> From: Latham, Robert J. [mailto:robl at mcs.anl.gov]
> Sent: Friday, April 21, 2017 4:32 PM
> 
> 
> good news, maybe?
> 
> I can't reproduce this with today's MPICH .  I get some debug-logging
> warnings that you didn't free two of your types, but it doesn't hang on
> my laptop.  Those datatype-related allocations are the only valgrind
> error I see when I run "mpiexec -np 3 ./mpi-gather 256 256"

That *is* good news. Thanks!

The leak is probably because I didn't call MPI_Type_free on the tmp arrays I use to build up the compound types. They are never committed, and it wasn't clear to me if they needed to be freed or not.  

> Is it possible a 256 by 256 grids could overflow an integer anywhere?
> I think master has some integer overflow fixes in the gather path that
> might not have made it into an MPICH release.  Aside from that, I'd
> have to dig through the history to figure out what might be different.

It's tricky to bisect when the error is 'this thing doesn't return at all'! I don't see how that size grid could get in trouble with overflow. It should use (256+4)^2 * 4 * 8 bytes = ~2Mb.

The fact that it only shows up with 3 or more ranks suggest that it might be more than just the grid size, but maybe I'm wrong. 

Now I need to see about getting the Intel MPI fixed...

Cheers,
Jason
_______________________________________________
discuss mailing list     discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss


More information about the discuss mailing list