[mpich-discuss] Iprobe does not find all messages.

Thakur, Rajeev thakur at anl.gov
Tue Jun 29 11:31:50 CDT 2021


It shouldn’t matter whether it is eager or rendezvous. The envelope in the rendezvous contains the message matching information. Iprobe will initiate progress, so the next call to Iprobe should find a message if there is one.

Rajeev


From: "Hudson, Stephen Tobias P" <shudson at anl.gov>
Date: Tuesday, June 29, 2021 at 11:27 AM
To: "Thakur, Rajeev" <thakur at anl.gov>, "discuss at mpich.org" <discuss at mpich.org>
Subject: Re: [mpich-discuss] Iprobe does not find all messages.

Thanks Rajeev, is this going to be using rendezvous protocol, in which case the receiver has to receive an envelope and ping back. In that case, it may make sense to me? Does mpich have a way of controlling message sizes for eager/rendezvous as intel MPI does? I havn't found any environment variables.

Thanks,
Steve

________________________________
From: Thakur, Rajeev <thakur at anl.gov>
Sent: Monday, June 28, 2021 9:13 PM
To: discuss at mpich.org <discuss at mpich.org>
Cc: Hudson, Stephen Tobias P <shudson at anl.gov>
Subject: Re: [mpich-discuss] Iprobe does not find all messages.


I guess the reason is that iprobe does not find a message until some MPI function causes progress on the communication. Just a delay will not cause progress.



Rajeev



From: "Hudson, Stephen Tobias P via discuss" <discuss at mpich.org>
Reply-To: "discuss at mpich.org" <discuss at mpich.org>
Date: Monday, June 28, 2021 at 4:30 PM
To: "discuss at mpich.org" <discuss at mpich.org>
Cc: "Hudson, Stephen Tobias P" <shudson at anl.gov>
Subject: [mpich-discuss] Iprobe does not find all messages.



Hi,



TLDR: Why does iprobe not pick up messages that should be there?



In this use-case, a manager is calling iprobe  (via mpi4py) to look for messages received from various workers.



In the receive function (1.), there is a loop over workers doing an iprobe for each worker, and only if it loops round them all receiving nothing does it come out. Each worker may have sent several messages.



loop:

   1. receive_func (loop over all workers doing iprobe - until none have anything)

   2. do some other stuff.



(actual code: https://github.com/Libensemble/libensemble/blob/main/libensemble/manager.py#L346)



The workers are sending with isend, and I know they have passed the isend call for all the required messages. Manager side, the iprobe loop does not necessarily pick up all messages in one call to receive_func (but if I replace each iprobe call with its own loop of several iprobes it does).



In my code, the iprobe is part of a loop which may occasionally spend some time in another function (2.), so I want to better understand why it may not pick up all the messages. Even if I put a delay of a few seconds in (2), it still has cycles that do not pick up the messages - as though a certain number of iprobes are required.



In particular, I want to understand whether the messages are present on the receiving end, maybe in the unexpected message buffer, but may not get picked up by the iprobe?

Or is there some delay on the sending end (I know from timing the code has passed the isend for all messages).



Also, as adding several iprobes (in their own sub-loop) always finds the messages, if the delay is not receiver side, then it must be doing some handshake?



I have seen answers online, which suggest that this is expected behavior, but I wonder if there is a fairly simple explanation as to why iprobe does not pick everything up? If its too complicated, then okay.



I am running on a Ubuntu laptop with $ mpirun --version

HYDRA build details:

    Version:                                 3.2.1



Thanks,



Steve
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20210629/e16a94ea/attachment.html>


More information about the discuss mailing list