[mpich-devel] MPI_Alltoall fails when called with newly created communicators

Giuseppe Congiu giuseppe.congiu at seagate.com
Wed Oct 21 02:06:50 CDT 2015


Hello everyone,

I am currently working on a partitioned collective I/O implementation for a
EU project. The idea is to partition the accessed file range into disjoint
access regions (i.e., regions that do not overlap with each other) and
assign processes from these regions to independent communicators created by
splitting the original communicator with MPI_Comm_split. Very much like it
is done in the ParColl paper or in the Memory Conscious Collective I/O
paper.

The code is pretty simple. It just uses the access pattern information to
create the new communicators, assign aggregators to each of them and then
compute file domains and data dependencies for every process using default
functions from ROMIO (i.e. ADIOI_Calc_file_domains, ADIOI_Calc_my_req,
ADIOI_Calc_others_req).

I am currently testing my implementation using coll_perf with 512
processes. Every process writes 64MB of data for a total of 32GBs (no reads
are performed). The ranks belonging to non-overlapping regions (of file
areas) are printed out by the process with rank 0 in the global
communicator:

[romio/adio/common/ad_aggregate.c:0] file_area_count = 8
[romio/adio/common/ad_aggregate.c:0] file_area_ranklist[0] = 0   1   2   3
... 63
[romio/adio/common/ad_aggregate.c:0] file_area_ranklist[1] = 64  65  66  67
... 127
[romio/adio/common/ad_aggregate.c:0] file_area_ranklist[2] = 128 129 130
131 ... 191
[romio/adio/common/ad_aggregate.c:0] file_area_ranklist[3] = 192 193 194
195 ... 255
[romio/adio/common/ad_aggregate.c:0] file_area_ranklist[4] = 256 257 258
259 ... 319
[romio/adio/common/ad_aggregate.c:0] file_area_ranklist[5] = 320 321 322
323 ... 383
[romio/adio/common/ad_aggregate.c:0] file_area_ranklist[6] = 384 385 386
387 ... 447
[romio/adio/common/ad_aggregate.c:0] file_area_ranklist[7] = 448 449 450
451 ... 511


Afterwards, every aggregator in each communication group prints its rank
and aggregator number in ranklist:

[romio/adio/common/ad_aggregate.c:0] my_cb_nodes_index = 0
[romio/adio/common/ad_aggregate.c:0] fd->hints->ranklist[0] = 0
[romio/adio/common/ad_aggregate.c:0] my_cb_nodes_index = 0
[romio/adio/common/ad_aggregate.c:0] fd->hints->ranklist[0] = 0
[romio/adio/common/ad_aggregate.c:0] my_cb_nodes_index = 0
[romio/adio/common/ad_aggregate.c:0] fd->hints->ranklist[0] = 0
[romio/adio/common/ad_aggregate.c:0] my_cb_nodes_index = 0
[romio/adio/common/ad_aggregate.c:0] fd->hints->ranklist[0] = 0


In this particular case I have set fd->hints->cb_nodes to 4. This means
that there are 4 aggregators but 8 non-overlapping regions (file areas).
Thus, I create only 4 communicators and assign to each of them 2 file
areas. In the previous printed messages the only aggregator for each
communicator is printing its information.

Then every process prints its group name, the number of process in the
group, its rank in the old and new communicator:

group-0:128:0:0
group-0:128:1:1
group-0:128:2:2
group-0:128:3:3
---
group-0:128:127:127
group-1:128:128:0
group-1:128:129:1
group-1:128:130:2
group-1:128:131:3
---
group-1:128:255:127
group-2:128:256:0
group-2:128:257:1
group-2:128:258:2
group-2:128:259:3
---
group-2:128:383:127
group-3:128:384:0
group-3:128:385:1
group-3:128:386:2
group-3:128:387:3
---
group-3:128:511:127


So far it looks like the communicators are created properly. After calling
the ADIOI_Calc_* functions every aggregator prints MIN and MAX offsets for
every independent access range:

[romio/adio/common/ad_write_coll.c:0] st_loc = 17179869184, end_loc =
25769803775
[romio/adio/common/ad_write_coll.c:0] st_loc = 8589934592, end_loc =
17179869183
[romio/adio/common/ad_write_coll.c:0] st_loc = 25769803776, end_loc =
34359738367
[romio/adio/common/ad_write_coll.c:0] st_loc = 0, end_loc = 8589934591


Even though the new communicators seems to be created correctly, something
bad happens when two phase I/O starts and the first MPI_Alltoall() is
called to exchange access information among processes. MPI_Alltoall is
called using the new communicators separately and every buffer passed to it
has the new nprocs size for the corresponding communicator. Nevertheless,
for some reason I am getting the following error message:

Fatal error in PMPI_Alltoall: Other MPI error, error stack:
PMPI_Alltoall(888)......: MPI_Alltoall(sbuf=0x2551e98, scount=1, MPI_INT,
rbuf=0x2551be8, rcount=1, MPI_INT, comm=0x84000001) failed
MPIR_Alltoall_impl(760).:
MPIR_Alltoall(725)......:
MPIR_Alltoall_intra(283):

repeated many more times.

This is actually strange because if I look at the file after coll_perf
crashes there is some data in it (~200MB). Like if the problem was caused
by only one communicator among the four.

Does anybody have any idea of what is happening and why?

Thanks,

-- 
Giuseppe Congiu *·* Research Engineer II
Seagate Technology, LLC
office: +44 (0)23 9249 6082 *·* mobile:
www.seagate.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/devel/attachments/20151021/8ca5d5eb/attachment-0001.html>


More information about the devel mailing list