[mpich-discuss] How locking on multi-VCI works

Mon Dec 6 13:39:34 CST 2021

Guilherme,

With mpich-4.0b1, you can try building mpich using --with-ch4-vci-method=implicit. In addition, try pass "mpi_assert_no_any_tag" info hint during comm creation. That will allow mpich to distribute operations with different tags even within the same communicator to different VCIs, again, in a hashing fashion.

This feature is in the experimental stage, please file GitHub issues if you encounter problems.

It is also possible to explicitly specify which VCI you would like to use for arbitrary individual pt2pt communication by embedding the vci information in the tag. This feature is not exactly standard-conforming, thus we are only using it for experimental purposes. You'll need to edit src/mpid/ch4/src/ch4_vci.h and remove the line "#error MPICH_VCI__TAG not implemented." to enable it. With the tag method, tags are restricted to 15bits, the highest 5 bits designate source vci, the middle 5 bits designate destination vci, and the last 5 bits for user-defined purposes. The drawback, of course, is you will have much-reduced tag space for application logic.

Both the communicator, implicit, or tag vci-methods are MPI standard conformant, so you can use it with any MPI implementations without breaking your applications. But the actual performance of course will depend on the implementation.

--
Hui Zhou
________________________________
From: Guilherme Valarini <guilherme.a.valarini at gmail.com>
Sent: Monday, December 6, 2021 12:16 PM
To: Zhou, Hui <zhouh at anl.gov>
Cc: discuss at mpich.org <discuss at mpich.org>
Subject: Re: [mpich-discuss] How locking on multi-VCI works

Hello Zhou,

First of all, thanks for the help!

Let me explain a little bit my use case: my team and I have a distributed and multithreaded event system implemented on top of MPI, where multiple non-blocking MPI messages are exchanged between multiple nodes. Checking our internal performance traces, we saw that there was some contention happening at the MPI layer, especially when many threads were being used.

Digging a little bit deeper we found some studies explaining the current multithread support of many MPI implementations and even limitations regarding the standard itself, which might explain the problems we encountered. Since each event is mapped to a unique TAG, that would be the preferred mechanism of extracting network parallelism. But since we also want to support other MPI implementations (e.g. openmpi), we think that using multiple communicators might be a better option.

I was hoping that messages sent to two different processes but from the same process through the same VCI would use two different locks at their origin. But now I see that a VCI is directly mapped to a hardware context of some sort. So it makes sense that the same lock would be shared between the two previously described messages.

If you have any other general hints on how to better extract network parallelism at the MPI level, I would be grateful. 😉

Note: Sorry for the duplicate. I forgot to reply to the mailing list as well.

Thaks again for the help,
Guilherme Valarini

Em seg., 6 de dez. de 2021 às 14:34, Zhou, Hui <zhouh at anl.gov<mailto:zhouh at anl.gov>> escreveu:
The total number of VCIs are configured with --with-ch4-max-vcis=#. The maximum is 64. The default used to be 1, but it is changed to 64 in 4.0b1 release. There is also an option to control the vci assignment method: --enable-ch4-vci-method={communicator,tag,implicit}. The default is communicator, with which we assign vci to communicators in a round-robin fashion. If you create communicators consecutively, they are expected to have different VCIs. The other vci-methods are at the experimental stage. If the communicator method is insufficient for your application, it may be worth a try.  We'd like to understand your use case better before pointing you that way.

I am not exactly understanding your question. The vci locks are local process locks, so if you have N VCIs, you will have N channels for each process. With vci-method=communicator, the vcis are one-to-one matched, i.e. rank 1 vci 1 only communicates to rank 2 (any ranks with the same communicator) vci 1.

--
Hui Zhou
________________________________
From: Guilherme Valarini via discuss <discuss at mpich.org<mailto:discuss at mpich.org>>
Sent: Monday, December 6, 2021 10:10 AM
To: discuss at mpich.org<mailto:discuss at mpich.org> <discuss at mpich.org<mailto:discuss at mpich.org>>
Cc: Guilherme Valarini <guilherme.a.valarini at gmail.com<mailto:guilherme.a.valarini at gmail.com>>
Subject: [mpich-discuss] How locking on multi-VCI works

Hello everyone,

I got one question regarding the multi-VCI support and possible locking contentions of MPICH on multi-threaded environments.

I understand that there is a direct mapping between a VCI and a communicator, so global locking is avoided on a multi-threaded application. But I wanted to know: how do these VCIs work? When I have N VCIs, do I have N virtual channels per rank (thus, one global lock per VCI-rank pair) or only 2 channels at all (one lock per VCI)? I was wondering if, for example, two MPI_Sends targeting different ranks on the same comm might need to be synchronized using such a global lock.

Thanks for the help!
Guilherme Valarini
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20211206/55efc96b/attachment.html>