[mpich-discuss] MPI memory allocation.

Anatoly G anatolyrishon at gmail.com
Mon Dec 9 01:46:22 CST 2013


With MPICH - 3.0.4 the situation repeated. It looks like MPI allocates
memory for messages.
Can you please advice about scenario when MPI or may be TCP under MPI
allocates memory due to high transfer rate?


On Mon, Dec 9, 2013 at 9:32 AM, Anatoly G <anatolyrishon at gmail.com> wrote:

> Thank you very much.
> Issend - is not so good, It can't support me Fault tolerance. If slave
> process fails, the master stall.
> I tried mpich-3.0.4 with hydra-3.0.4 but my program which uses MPI Fault
> tolerance doesn't recognize failure of slave process, but recognizes
> failure with MPICH2. May be you can suggest solution?
> I tried to use hydra from MPICH2 but link my program with MPICH3. This
> combination recognizes failures, but I"m not sure that such combination is
> stable enough.
> Can you please advice?
> Anatoly.
>
>
>
> On Sat, Dec 7, 2013 at 5:20 PM, Pavan Balaji <balaji at mcs.anl.gov> wrote:
>
>>
>> As much as I hate saying this — some people find it easier to think of it
>> as “MPICH3”.
>>
>>   — Pavan
>>
>> On Dec 7, 2013, at 7:37 AM, Wesley Bland <wbland at mcs.anl.gov> wrote:
>>
>> > MPICH is just the new version of MPICH2. We renamed it when we went
>> past version 3.0.
>> >
>> > On Dec 7, 2013, at 3:55 AM, Anatoly G <anatolyrishon at gmail.com> wrote:
>> >
>> >> Ok. I"ll try both Issend, and next step to upgrade MPICH to 3.0.4.
>> >> I thought before that MPICH & MPICH2 are two different branches, when
>> MPICH2 partially supports Fault tolerance, but MPICH not. Now I understand,
>> that I was wrong and MPICH2 is just main version of MPICH.
>> >>
>> >> Thank you very much,
>> >> Anatoly.
>> >>
>> >>
>> >>
>> >> On Thu, Dec 5, 2013 at 11:01 PM, Rajeev Thakur <thakur at mcs.anl.gov>
>> wrote:
>> >> The master is receiving too many incoming messages than it can match
>> quickly enough with Irecvs. Try using MPI_Issend instead of MPI_Isend.
>> >>
>> >> Rajeev
>> >>
>> >> On Dec 5, 2013, at 2:58 AM, Anatoly G <anatolyrishon at gmail.com> wrote:
>> >>
>> >> > Hello.
>> >> > I"m using MPICH2 1.5.
>> >> > My system contains master and 16 slaves.
>> >> > System uses number of communicators.
>> >> > The single communicator used for below scenario:
>> >> > Each slave sends non-stop 2Kbyte data buffer using MPI_Isend and
>> waits using MPI_Wait.
>> >> > Master starts with MPI_Irecv to each slave
>> >> > Then in endless loop:
>> >> > MPI_Waitany and MPI_Irecv on rank returned by MPI_Waitany.
>> >> >
>> >> > Another communicator used for broadcast communication (commands
>> between master + slaves),
>> >> > but it's not used in parallel with previous communicator,
>> >> > only before or after data transfer.
>> >> >
>> >> > The system executed on two computers linked by 1Gbit/s Ethernet.
>> >> > Master executed on first computer, all slaves on other one.
>> >> > Network traffic is ~800Mbit/s.
>> >> >
>> >> > After 1-2 minutes of execution, master process starts to increase
>> it's memory allocation and network traffic becomes low.
>> >> > This memory allocation & network traffic slow down continues until
>> fail of MPI,
>> >> > without core file save.
>> >> > My program doesn't allocate memory. Can you please explain this
>> behaviour.
>> >> > How can I cause MPI to stop sending slaves if Master can't serve
>> such traffic, instead of memory allocation and fail?
>> >> >
>> >> >
>> >> > Thank you,
>> >> > Anatoly.
>> >> >
>> >> > P.S.
>> >> > On my stand alone test, I simulate similar behaviour, but with
>> single thread on each process (master & hosts).
>> >> > When I start stand alone test, master stops slaves until it
>> completes accumulated data processing and MPI doesn't increase memory
>> allocation.
>> >> > When Master is free slaves continue to send data.
>> >> > _______________________________________________
>> >> > discuss mailing list     discuss at mpich.org
>> >> > To manage subscription options or unsubscribe:
>> >> > https://lists.mpich.org/mailman/listinfo/discuss
>> >>
>> >> _______________________________________________
>> >> discuss mailing list     discuss at mpich.org
>> >> To manage subscription options or unsubscribe:
>> >> https://lists.mpich.org/mailman/listinfo/discuss
>> >>
>> >> _______________________________________________
>> >> discuss mailing list     discuss at mpich.org
>> >> To manage subscription options or unsubscribe:
>> >> https://lists.mpich.org/mailman/listinfo/discuss
>> > _______________________________________________
>> > discuss mailing list     discuss at mpich.org
>> > To manage subscription options or unsubscribe:
>> > https://lists.mpich.org/mailman/listinfo/discuss
>>
>> --
>> Pavan Balaji
>> http://www.mcs.anl.gov/~balaji
>>
>> _______________________________________________
>> discuss mailing list     discuss at mpich.org
>> To manage subscription options or unsubscribe:
>> https://lists.mpich.org/mailman/listinfo/discuss
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20131209/dc703004/attachment.html>


More information about the discuss mailing list