[mpich-discuss] Questions about MPICH multi-thread support

Zhou, Hui zhouh at anl.gov
Wed Jan 27 08:44:35 CST 2021



Reply inline below –


Question 1. In our application there are two modules that run concurrently and rely on MPI (using two different communicators for isolation purposes). During the development time we observed the following behavior: when one of the modules sends a large buffer (say +1GB), the other one encountered a kind of contention problem where its messages were held back through the duration of the large data transfer, even though second module messages were really small. Is this a known behavior? Is there any kind of message fragmentation implemented in MPICH that would allow concurrent message progression? If not, are there any plans to implement such a feature?

Yes, this is a known behavior. By default, mpich will impose a global lock to ensure correctness. In the latest 3.4 release, we have an experimental “multi-vci” support that allows operations in different communicators to progress concurrently. You can build that by `--enable-thread-cs=per-vci –with-ch4-max-vcis=8`. `8` is arbitrary max to balance initialization overhead, but you can bump that up to 64. You also need set environment variable `MPIR_CVAR_CH4_NUM_VCIS=8` to enable the multi-vci during runtime. Again, 8 is arbitrary up to the configured maximum.

Question 2. We already have used one of the scenarios described in the previous email, where one thread of a process A sends a message to one of multiple threads on a process B, all of which are waiting on the same message triple through non-blocking receives. The expected behaviour would be that only one of the threads from process B would receive the message, but, during our tests we have faced a failure where a segfault would be thrown from inside MPI_Recv function at one of the receiving threads. Is the expected behavior correct? Is there any known issue with this use case that would trigger the described problem? I know this can probably be the application's fault, I just want to check if there is any known issue on the matter.

This sounds like a potential bug. Could you file an issue at https://giithub.com/pmodels/mpich/issues? We’ll need more details and ideally a reproducer to effectively debug.

--
Hui Zhou
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20210127/73f539bb/attachment-0001.html>


More information about the discuss mailing list