[mpich-discuss] Questions about non-blocking collective calls...

Thu Oct 22 12:42:55 CDT 2015

Eric,

I'm not sure I understand your question.  MPI assumes smart users.  :-)  If there's something you can do on top of MPI, with the same efficiency as within MPI, then we typically avoid adding it into the standard and assume that the users will do it on their own, or more likely that someone will provide a higher-level library above MPI to do that for you.  There are some exceptions to this rule, of course (e.g., MPI_DIMS_CREATE, which is a widely used, but a useless part of the standard, IMO -- you can implement it above MPI), but I'm talking about the general trend.

This particular case is no different.  If you write this above MPI, in a portable way, you can get most but not all of the performance.  For example, if MPI detects that your network is InfiniBand or Portals, it can use triggered collectives to do the same operation more efficiently.  It can internally use shared memory in some cases.  It can internally hardware offload the communication, if the network switch supports it.  As both Jeff and I pointed out -- you'll get most of the performance doing this above MPI, so you might not care about the last 10% that you might lose.  But it is in the standard for people who care about that level of performance.

Another side-effect that you might not consider is that of progress.  If you do this above MPI, if you are blocked inside a different MPI call (say you called some other library like PETSc, which did an MPI barrier), you are not making progress on your home-grown collective.  If this is integrated into MPI, the MPI library will make progress for you.

In any case, I think we have gone beyond MPICH and into the MPI standard territory.  So maybe this discussion is best to have on the Forum mailing list.  You can find more information here:

http://meetings.mpi-forum.org/MPI_4.0_main_page.php

  -- Pavan

On 10/22/15, 12:29 PM, "Eric Chamberland" <Eric.Chamberland at giref.ulaval.ca> wrote:

>Hi Pavan and Jeff,
>
>Thanks a lot for your answers.
>
>I feel there is something to tell about mpi standard expectations...
>
>Why does the knowledge (or ignorance) of MPI users about scalability of 
>communications in MPI should be a requirement for them for developing 
>high-performance MPI applications?  Especially when we want to keep it 
>simple by using high-level MPI functionalities, like MPI_gather, and let 
>the library do the best work it can...
>
>In other words, why the complexity (then scalability) of MPI algorithms 
>is not guaranteed by the standard?  If every one that is familiar with 
>performance of MPI communication have the burden to write MPI calls in a 
>way that it is scaling/performing well, isn't all of us rewriting 
>essentially the same "good" code that should be in the standard?
>
>Is the standard voluntarily blind to these (crucial) questions?
>
>I may be too naive too... tell me! :)
>
>(ie, the c++ standard guarantee the sort algorithm complexity: 
>https://en.wikipedia.org/wiki/Sort_%28C%2B%2B%29)
>
>On 21/10/15 10:03 PM, Balaji, Pavan wrote:
> > You might want to join the collectives working group and voice your 
>opinion over there.
>
>Ok, where exactly do I do this?
>
>Btw, I don't want to blame anybody... I am just learning and discussing 
>here!!! :)
>
>Thanks for reading!
>
>Eric
>
>On 21/10/15 11:56 PM, Jeff Hammond wrote:
>> Depending on the size of your data, you could pipeline a series of
>> MPI_Igather calls and process all of the data associated with the
>> partial buffer.  Of course, this will change the layout of the buffer at
>> the root unless you do something interesting with datatypes (e.g. struct
>> with offset).  This may or may not matter, if you are going to process
>> it anyways.
>>
>> In general, I think you may be able to do just fine with rolling your
>> own.  It's a myth that using higher-level functionality in MPI is
>> _always_ better.
>>
>> Jeff
>>
>> On Wed, Oct 21, 2015 at 7:03 PM, Balaji, Pavan <balaji at anl.gov
>> <mailto:balaji at anl.gov>> wrote:
>>
>>     Eric,
>>
>>     The concept of partial completion of collectives did come up in the
>>     Forum, but the Forum decided that it was rather unnatural to define
>>     Iallgather/Igather that way.  So we decided to standardize it the
>>     way it is.
>>
>>     However, there is a separate proposal for streaming collectives,
>>     which is more along the lines of what you are thinking of.  That's
>>     obviously not in MPI-3, but might be considered for a future MPI.
>>     You might want to join the collectives working group and voice your
>>     opinion over there.
>>
>>     With respect to writing your own igather implementation, as long as
>>     your implementation is logarithmic, it won't be too bad.  However, a
>>     native implementation inside MPI would almost certainly do better
>>     because: (1) it can take advantage of platform-specific features to
>>     improve performance, and (2) if the platform doesn't give anything
>>     special, it'll anyway do exactly what you are doing above MPI.
>>
>>     So, apart from any performance bugs that the implementation might
>>     have, using MPI Igather would be the recommended mechanism for the
>>     best performance.
>>
>>        -- Pavan
>>
>>
>>
>>
>>
>>     On 10/21/15, 2:45 PM, "Eric Chamberland"
>>     <Eric.Chamberland at giref.ulaval.ca
>>     <mailto:Eric.Chamberland at giref.ulaval.ca>> wrote:
>>
>>      >Hi,
>>      >
>>      >A long time ago (in 2002) we programmed here a non-blocking
>>     MPI_Igather
>>      >with equivalent calls to MPI_Isend/MPI_Irecv (see the 2 attached
>>     files).
>>      >
>>      >A very convenient advantage of this version, is that I can do some
>>     work
>>      >on the root process as soon as it start receiving data...  Then,
>>     it wait
>>      >for the next communication to terminate and process the received data.
>>      >
>>      >Now, I am looking at MPI_Igather (and all non-blocking collective MPI
>>      >calls), and I am somewhat surprised (or ignorant) that I cannot
>>     have the
>>      >root rank receive some data, then process it, then wait until the next
>>      >"MPI_irecv" terminate...
>>      >
>>      >In other words, a MPI_Igather generate only 1 MPI_Request but I would
>>      >like to have either "p" (with p = size of communicator) MPI_Request
>>      >generated or be able to call "p" times MPI_WaitAny with the same
>>      >MPI_Request...  Am I normal? :)
>>      >
>>      >So my 3 questions are:
>>      >
>>      >#1- Is there a way to use MPI_Igather with MPI_WaitAny (or something
>>      >else?) to process data as it is received?
>>      >
>>      >#2- Big question: will our implementation with MPI_Isend/MPI_Irecv
>>     scale
>>      >to a large number of processes?  What are the possible drawbacks of
>>      >doing it like we did?
>>      >
>>      >#3- Why should I replace our implementation by the native MPI_Igather?
>>      >
>>      >Thanks!
>>      >
>>      >Eric
>>     _______________________________________________
>>     discuss mailing list discuss at mpich.org <mailto:discuss at mpich.org>
>>     To manage subscription options or unsubscribe:
>>     https://lists.mpich.org/mailman/listinfo/discuss
>>
>>
>>
>>
>> --
>> Jeff Hammond
>> jeff.science at gmail.com <mailto:jeff.science at gmail.com>
>> http://jeffhammond.github.io/
>
_______________________________________________
discuss mailing list     discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss