[mpich-devel] MPI_Alltoall fails when called with newly created communicators

Rob Latham robl at mcs.anl.gov
Wed Oct 21 09:32:35 CDT 2015


You've provided a lot of information.  thanks.  Nothing looks obviously 
wrong.


You've set up 4 communicators, each to handle a separate portion of the 
collective I/O request, but then when it comes time to actually do 
collective i/o you get a crash.   It sounds like maybe a comm_world 
snuck in somewhere?  Or there is a collective routine mis-matched with 
another process somewhere?

Here's how I'd start debugging this:

- put in a few MPI_Barriers (over fd->comm) in selected places, just for 
debugging.  These barriers will help (a bit) to keep things temporally 
in step, and if there is a mis-matched collecitve you'll find a hanging 
process.

- When it comes time to start two-phase I/O, set up a token so that only 
one communicator at a time is operating.  Again, not something you want 
long-term but you'll find out which communicator is causing headaches.

- then I'd stare at my screen for a while until I came up with a next 
thing to try...

==rob


On 10/21/2015 02:06 AM, Giuseppe Congiu wrote:
> Hello everyone,
>
> I am currently working on a partitioned collective I/O implementation
> for a EU project. The idea is to partition the accessed file range into
> disjoint access regions (i.e., regions that do not overlap with each
> other) and assign processes from these regions to independent
> communicators created by splitting the original communicator with
> MPI_Comm_split. Very much like it is done in the ParColl paper or in the
> Memory Conscious Collective I/O paper.
>
> The code is pretty simple. It just uses the access pattern information
> to create the new communicators, assign aggregators to each of them and
> then compute file domains and data dependencies for every process using
> default functions from ROMIO (i.e. ADIOI_Calc_file_domains,
> ADIOI_Calc_my_req, ADIOI_Calc_others_req).
>
> I am currently testing my implementation using coll_perf with 512
> processes. Every process writes 64MB of data for a total of 32GBs (no
> reads are performed). The ranks belonging to non-overlapping regions (of
> file areas) are printed out by the process with rank 0 in the global
> communicator:
>
> [romio/adio/common/ad_aggregate.c:0] file_area_count = 8
> [romio/adio/common/ad_aggregate.c:0] file_area_ranklist[0] = 0   1   2
> 3 ... 63
> [romio/adio/common/ad_aggregate.c:0] file_area_ranklist[1] = 64  65  66
>   67 ... 127
> [romio/adio/common/ad_aggregate.c:0] file_area_ranklist[2] = 128 129 130
> 131 ... 191
> [romio/adio/common/ad_aggregate.c:0] file_area_ranklist[3] = 192 193 194
> 195 ... 255
> [romio/adio/common/ad_aggregate.c:0] file_area_ranklist[4] = 256 257 258
> 259 ... 319
> [romio/adio/common/ad_aggregate.c:0] file_area_ranklist[5] = 320 321 322
> 323 ... 383
> [romio/adio/common/ad_aggregate.c:0] file_area_ranklist[6] = 384 385 386
> 387 ... 447
> [romio/adio/common/ad_aggregate.c:0] file_area_ranklist[7] = 448 449 450
> 451 ... 511
>
>
> Afterwards, every aggregator in each communication group prints its rank
> and aggregator number in ranklist:
>
> [romio/adio/common/ad_aggregate.c:0] my_cb_nodes_index = 0
> [romio/adio/common/ad_aggregate.c:0] fd->hints->ranklist[0] = 0
> [romio/adio/common/ad_aggregate.c:0] my_cb_nodes_index = 0
> [romio/adio/common/ad_aggregate.c:0] fd->hints->ranklist[0] = 0
> [romio/adio/common/ad_aggregate.c:0] my_cb_nodes_index = 0
> [romio/adio/common/ad_aggregate.c:0] fd->hints->ranklist[0] = 0
> [romio/adio/common/ad_aggregate.c:0] my_cb_nodes_index = 0
> [romio/adio/common/ad_aggregate.c:0] fd->hints->ranklist[0] = 0
>
>
> In this particular case I have set fd->hints->cb_nodes to 4. This means
> that there are 4 aggregators but 8 non-overlapping regions (file areas).
> Thus, I create only 4 communicators and assign to each of them 2 file
> areas. In the previous printed messages the only aggregator for each
> communicator is printing its information.
>
> Then every process prints its group name, the number of process in the
> group, its rank in the old and new communicator:
>
> group-0:128:0:0
> group-0:128:1:1
> group-0:128:2:2
> group-0:128:3:3
> ---
> group-0:128:127:127
> group-1:128:128:0
> group-1:128:129:1
> group-1:128:130:2
> group-1:128:131:3
> ---
> group-1:128:255:127
> group-2:128:256:0
> group-2:128:257:1
> group-2:128:258:2
> group-2:128:259:3
> ---
> group-2:128:383:127
> group-3:128:384:0
> group-3:128:385:1
> group-3:128:386:2
> group-3:128:387:3
> ---
> group-3:128:511:127
>
>
> So far it looks like the communicators are created properly. After
> calling the ADIOI_Calc_* functions every aggregator prints MIN and MAX
> offsets for every independent access range:
>
> [romio/adio/common/ad_write_coll.c:0] st_loc = 17179869184, end_loc =
> 25769803775
> [romio/adio/common/ad_write_coll.c:0] st_loc = 8589934592, end_loc =
> 17179869183
> [romio/adio/common/ad_write_coll.c:0] st_loc = 25769803776, end_loc =
> 34359738367
> [romio/adio/common/ad_write_coll.c:0] st_loc = 0, end_loc = 8589934591
>
>
> Even though the new communicators seems to be created correctly,
> something bad happens when two phase I/O starts and the first
> MPI_Alltoall() is called to exchange access information among processes.
> MPI_Alltoall is called using the new communicators separately and every
> buffer passed to it has the new nprocs size for the corresponding
> communicator. Nevertheless, for some reason I am getting the following
> error message:
>
> Fatal error in PMPI_Alltoall: Other MPI error, error stack:
> PMPI_Alltoall(888)......: MPI_Alltoall(sbuf=0x2551e98, scount=1,
> MPI_INT, rbuf=0x2551be8, rcount=1, MPI_INT, comm=0x84000001) failed
> MPIR_Alltoall_impl(760).:
> MPIR_Alltoall(725)......:
> MPIR_Alltoall_intra(283):
>
> repeated many more times.
>
> This is actually strange because if I look at the file after coll_perf
> crashes there is some data in it (~200MB). Like if the problem was
> caused by only one communicator among the four.
>
> Does anybody have any idea of what is happening and why?
>
> Thanks,
>
> --
> Giuseppe Congiu *·* Research Engineer II
> Seagate Technology, LLC
> office: +44 (0)23 9249 6082 *·* mobile:
> www.seagate.com <http://www.seagate.com>
>
>
> _______________________________________________
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/devel
>

-- 
Rob Latham
Mathematics and Computer Science Division
Argonne National Lab, IL USA


More information about the devel mailing list