[mpich-discuss] Single node IO Aggregator setup

Jon Povich jon.povich at convergecfd.com
Tue Jun 21 12:48:46 CDT 2016

I'm trying to simulate a cluster setup where only the I/O aggregators have
access to the working directory. Is this feasible to do with romio hints?

  A) Work dir is on Node0's local hard drive.
  B) Remote Node1 has no access to Node0's hard drive
  C) Run a case where only rank 0 on Node0 serves as the I/O aggregator

Simple MPI I/O Code:

#include "mpi.h"

#include <string.h>
#include <stdio.h>
#include <unistd.h>

int main(int argc, char *argv[])
   MPI_File fh;
   MPI_Info info;
   int amode, mpi_ret;
   char filename[64];

   MPI_Init(&argc, &argv);

   MPI_Info_set(info, "cb_nodes", "1");
   MPI_Info_set(info, "no_indep_rw", "true");

   strcpy(filename, "mpio_test.out");
   mpi_ret = MPI_File_open(MPI_COMM_WORLD, filename, amode, info, &fh);

   if(mpi_ret != MPI_SUCCESS)
      char       mpi_err_buf[MPI_MAX_ERROR_STRING];
      int        mpi_err_len;

      MPI_Error_string(mpi_ret, mpi_err_buf, &mpi_err_len);
      fprintf(stderr, "Failed MPI_File_open. Filename = %s, error = %s",
filename, mpi_err_buf);
      return -1;

   // Force I/O errors associated with this file to abort
   MPI_File_set_errhandler(fh, MPI_ERRORS_ARE_FATAL);




   return 0;

Note the hardcoded "MPI_Info_set(info, "cb_nodes", "1");" and
"MPI_Info_set(info, "no_indep_rw", "true");".

The above code runs fine when run from multiple cores on Node0. As soon as
I add a Node1 to the mix, I get the following error:

jpovich at crane mini_test> mpirun -np 2 -hosts crane,node1
[mpiexec at crane.csi.com] control_cb (pm/pmiserv/pmiserv_cb.c:200): assert
(!closed) failed
[mpiexec at crane.csi.com] HYDT_dmxu_poll_wait_for_event
(tools/demux/demux_poll.c:76): callback returned error status
[mpiexec at crane.csi.com] HYD_pmci_wait_for_completion
(pm/pmiserv/pmiserv_pmci.c:198): error waiting for event
[mpiexec at crane.csi.com] main (ui/mpich/mpiexec.c:344): process manager
error waiting for completion

The cb_nodes setting seems to have no impact on behavior. I get the same
error if I comment out the cb_nodes and no_indep_rw settings.

Any help is greatly appreciated,

