[mpich-discuss] ULFM and timeouts

Shead, Timothy tshead at sandia.gov
Tue Oct 13 11:52:08 CDT 2020


Folks:

I’m interested in running some fault tolerance experiments using the ULFM extensions in MPICH, but I’m stumbling right out of the gate. When I run the following Python code, rank 0 broadcasts a monotonically increasing value to the other ranks, and it gets printed to stdout:

import itertools
import time
from mpi4py import MPI
comm = MPI.COMM_WORLD
for count in itertools.count():
    value = comm.bcast(count, root=0)
    print(f"{comm.rank}: {value}", flush=True)
    time.sleep(1)

So far, so good.  If I run the code using

mpiexec -n 3 --disable-auto-cleanup python test.py

I get the expected output.  If I kill one of the three processes, the others keep running for a few more iterations thanks to --disable-auto-cleanup, until bcast() blocks.  My assumption was that MPICH would eventually return an error code, rather than blocking.  Otherwise, it doesn’t seem like my code will ever have the chance to use revoke() and shrink().  What am I missing here?  Is there a way to specify timeouts for blocking operations?  Am I limited to using async operations with ULFM?  Could you refer me to any ULFM tests or examples that you used in development?

Thanks in advance,
Tim


Timothy M. Shead
Sandia National Laboratories
tshead at sandia.gov<mailto:tshead at sandia.gov>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20201013/7b861186/attachment.html>


More information about the discuss mailing list