[mpich-devel] Deadlock when using MPIX_Grequest interface

Latham, Robert J. robl at mcs.anl.gov
Tue Feb 28 14:52:19 CST 2017


On Tue, 2017-02-28 at 17:59 +0000, Giuseppe Congiu wrote:
> Hello,
> 
> I am trying to integrate an asynchronous API in ROMIO but I am
> experiencing some problems, probably caused by the improper use of
> the MPIX_Grequest interface.

I'm excited to see someone using the MPIX_Grequest interface.  We used
the MPIX_Grequest interface to implement non-blocking collective I/O,
and had some bad interactions between libc's aio and the grequest
callbacks.  I don't know if you are running into something similar.


> Short description of the integrated functions:
> I have a ROMIO implementation with extended MPI-IO hints that can be
> used to divert written data, during collective I/O, to a local file
> system (typically running on a SSD). The full description of the
> modification can be found at this link:
> https://github.com/gcongiu/E10/tree/beegfs-devel

Do you have any desire or plans to submit these changes into upstream
ROMIO?  There have been quite a few changes since 3.1.4, particularly
for dealing with large transfers.  I don't think you're hitting an
overflow path here, but it's something I cannot rule out.

> The BeeGFS ROMIO driver uses the cache API internally provided by
> BeeGFS to move data around, while the custom ROMIO implementation
> provides support for any other file system (in the "common" layer).
> 
> Description of the problem:
> I am currently having problems making the BeeGFS driver work
> properly. More specifically I am noticing that when the size of the
> shared file is large (32GB in my configuration) and I am writing
> multiple files one after the other (using a custom coll_perf
> benchmark) the i-th file (for i > 0 with 0 <= i < 4) gets stuck.
> 
> BeeGFS provides a deeper_cache_flush_range() function to flush data
> in the cache to the global file. This is non blocking and will submit
> a range (offset, length) for a certain filename to a cache deamon
> that transfers the data. Multiple ranges can be submitted one after
> the other. 
> 
> Completion can be checked using deeper_cache_flush_wait() using the
> filename as input. Since the MPI_Grequest model requires the external
> thread to make progress by invoking MPI_Grequest_complete() I have
> opted for the MPIX_Grequest interface which allows me to make
> progress (MPI_Grequest_complete) inside MPI while invoking the
> MPI_Wait() function.

The original intent of the MPIX_Grequest extensions is that something
else (in this case the AIO library) was making progress in the
background and all we needed was a way to check if things were done or
not.  I suspect the callbacks could do some work, but blocking
callbacks haven't seen a lot of testing (in 10+ years the extensions
themselves haven't seen a lot of activity, either)

> 
> deeper_cache_flush_range() does not require the application to keep
> any handler for the submitted request. The only thing the application
> needs is the name of the file to pass to deeper_cache_flush_wait().
> 
> Follows a code snippet for the corresponding non blocking
> implementation in ROMIO:
> 
> /* arguments for callback function */
> struct callback {
>     ADIO_File fd_;
>     ADIOI_Sync_req_t req_;
> };
> 
> /*
>  * ADIOI_BEEGFS_Sync_thread_start - start synchronisation of req
>  */
> int ADIOI_BEEGFS_Sync_thread_start(ADIOI_Sync_thread_t t) {
>     ADIOI_Atomic_queue_t q = t->sub_;
>     ADIOI_Sync_req_t r;
>     int retval, count, fflags, error_code, i;
>     ADIO_Offset offset, len;
>     MPI_Count datatype_size;
>     MPI_Datatype datatype;
>     ADIO_Request *req;
>     char myname[] = "ADIOI_BEEGFS_SYNC_THREAD_START";
> 
>     r = ADIOI_Atomic_queue_front(q);
>     ADIOI_Atomic_queue_pop(q);
> 
>     ADIOI_Sync_req_get_key(r, ADIOI_SYNC_ALL, &offset,
> 	    &datatype, &count, &req, &error_code, &fflags);
> 
>     MPI_Type_size_x(datatype, &datatype_size);
>     len = (ADIO_Offset)datatype_size * (ADIO_Offset)count;
> 
>     retval = deeper_cache_flush_range(t->fd_->filename,
> (off_t)offset, (size_t)len, fflags);
> 
>     if (retval == DEEPER_RETVAL_SUCCESS && ADIOI_BEEGFS_greq_class ==
> 1) {
> 	MPIX_Grequest_class_create(ADIOI_BEEGFS_Sync_req_query,
> 		                   ADIOI_BEEGFS_Sync_req_free,
> 		                   MPIU_Greq_cancel_fn,
> 			           ADIOI_BEEGFS_Sync_req_poll,
> 			           ADIOI_BEEGFS_Sync_req_wait,
> 				   &ADIOI_BEEGFS_greq_class);
>     } else {
> 	/* --BEGIN ERROR HANDLING-- */
> 	return MPIO_Err_create_code(MPI_SUCCESS,
> 		                    MPIR_ERR_RECOVERABLE,
> 		                    "ADIOI_BEEGFS_Cache_sync_req",
> 			            __LINE__, MPI_ERR_IO, "**io %s",
> 			            strerror(errno));
>         /* --END ERROR HANDLING-- */
>     }
> 
>     /* init args for the callback functions */
>     struct callback *args = (struct callback
> *)ADIOI_Malloc(sizeof(struct callback));
>     args->fd_ = t->fd_;
>     args->req_ = r;
> 
>     MPIX_Grequest_class_allocate(ADIOI_BEEGFS_greq_class, args, req);
> 
>     return MPI_SUCCESS;
> }
> 
> /*
>  * ADIOI_BEEGFS_Sync_req_poll -
>  */
> int ADIOI_BEEGFS_Sync_req_poll(void *extra_state, MPI_Status *status)
> {
>     struct callback *cb = (struct callback *)extra_state;
>     ADIOI_Sync_req_t r = (ADIOI_Sync_req_t)cb->req_;
>     ADIO_File fd = (ADIO_File)cb->fd_;
>     char *filename = fd->filename;
>     int count, cache_flush_flags, error_code;
>     MPI_Datatype datatype;
>     ADIO_Offset offset;
>     MPI_Aint lb, extent;
>     ADIO_Offset len;
>     ADIO_Request *req;
> 
>     ADIOI_Sync_req_get_key(r, ADIOI_SYNC_ALL, &offset,
> 	    &datatype, &count, &req, &error_code, &cache_flush_flags);
> 
>     int retval = deeper_cache_flush_wait(filename,
> cache_flush_flags);
> 
>     MPI_Type_get_extent(datatype, &lb, &extent);
>     len = (ADIO_Offset)extent * (ADIO_Offset)count;
> 
>     if (fd->hints->e10_cache_coherent == ADIOI_HINT_ENABLE)
> 	ADIOI_UNLOCK(fd, offset, SEEK_SET, len);
> 
>     /* mark generilized request as completed */
>     MPI_Grequest_complete(*req);
> 
>     if (retval != DEEPER_RETVAL_SUCCESS)
>         goto fn_exit_error;
> 
>     MPI_Status_set_cancelled(status, 0);
>     MPI_Status_set_elements(status, datatype, count);
>     status->MPI_SOURCE = MPI_UNDEFINED;
>     status->MPI_TAG = MPI_UNDEFINED;
> 
>     ADIOI_Free(cb);
> 
>     return MPI_SUCCESS;
> 
> fn_exit_error:
>     ADIOI_Free(cb);
> 
>     return MPIO_Err_create_code(MPI_SUCCESS,
>                                 MPIR_ERR_RECOVERABLE,
>                                 "ADIOI_BEEGFS_Sync_req_poll",
>                                 __LINE__, MPI_ERR_IO, "**io %s",
>                                 strerror(errno));
> }
> 
> /*
>  * ADIOI_BEEGFS_Sync_req_wait -
>  */
> int ADIOI_BEEGFS_Sync_req_wait(int count, void **array_of_states,
> double timeout, MPI_Status *status) {
>     return ADIOI_BEEGFS_Sync_req_poll(*array_of_states, status);
> }
> 
> /*
>  * ADIOI_BEEGFS_Sync_req_query -
>  */
> int ADIOI_BEEGFS_Sync_req_query(void *extra_state, MPI_Status
> *status) {
>     return MPI_SUCCESS;
> }
> 
> /*
>  * ADIOI_BEEGFS_Sync_req_free -
>  */
> int ADIOI_BEEGFS_Sync_req_free(void *extra_state) {
>     return MPI_SUCCESS;
> }
> 
> /*
>  * ADIOI_BEEGFS_Sync_req_cancel -
>  */
> int ADIOI_BEEGFS_Sync_req_cancel(void *extra_state, int complete) {
>     return MPI_SUCCESS;
> }  
> 
> I have had a look at the implementation for MPI_File_iwrite() in
> ad_iwrite.c but this uses POSIX AIO, thus I am not sure I am doing
> things properly here.

the poll_fn() should be non-blocking.  think MPI_Test more than unix
poll/select/epoll.  it's going to get called in a tight loop.

Can you provide a stack trace for some of the stuck processes?


> Additionally the problem does not show for small files (e.g. 1GB),
> which makes it not easy to debug.
> 
> BTW, I am using weak scaling thus every process always writes the
> same amount of data (64MB), I just change the number of procs in the
> test.
> 
> To test the modification with coll_perf I am using a configuration
> with 512 procs (64 nodes, 8procs/node) and 16 procs (8 nodes,
> 2procs/node). 
> 
> The 16 procs configuration writes 4 files of 1GB (16 x 64MB), the 512
> procs configuration writes 4 files of 32GB (512 x 64MB).
> 
> Can someone spot any problem in my code?

Might take a few rounds but I'm sure we'll figure it out!
==rob

> 
> Thanks,
> 
> -- 
> Giuseppe Congiu · Research Engineer II
> Seagate Technology, LLC
> office: +44 (0)23 9249 6082 · mobile: 
> www.seagate.com
> _______________________________________________
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/devel


More information about the devel mailing list