[mpich-discuss] MPI memory allocation.

Wesley Bland wbland at mcs.anl.gov
Mon Dec 9 11:28:24 CST 2013


Are you actually seeing hardware failure or is your code just crashing? It's odd that one specific process would fail so often in the same way if it were a hardware problem. 

Thanks,
Wesley

> On Dec 9, 2013, at 11:15 AM, Anatoly G <anatolyrishon at gmail.com> wrote:
> 
> One more interesting fact. Each time I have a failure, the fails only master process, but slaves are still exists together with mpiexec.hydra. I thought that slaves should fail too, but slaves are live.
> 
> 
>> On Mon, Dec 9, 2013 at 10:30 AM, Anatoly G <anatolyrishon at gmail.com> wrote:
>> I configure MPICH-3.1rc2 build w/o "so" files. But instead of MPICH2 & MPICH-3.0.4 I get so files. What should I change in configure line to link MPI with my application statically.
>> 
>> 
>> 
>>> On Mon, Dec 9, 2013 at 9:47 AM, Pavan Balaji <balaji at mcs.anl.gov> wrote:
>>> 
>>> Can you try mpich-3.1rc2?  There were several fixes for this in this version and it’ll be good to try that out before we go digging too far into this.
>>> 
>>>   — Pavan
>>> 
>>> On Dec 9, 2013, at 1:46 AM, Anatoly G <anatolyrishon at gmail.com> wrote:
>>> 
>>> > With MPICH - 3.0.4 the situation repeated. It looks like MPI allocates memory for messages.
>>> > Can you please advice about scenario when MPI or may be TCP under MPI allocates memory due to high transfer rate?
>>> >
>>> >
>>> > On Mon, Dec 9, 2013 at 9:32 AM, Anatoly G <anatolyrishon at gmail.com> wrote:
>>> > Thank you very much.
>>> > Issend - is not so good, It can't support me Fault tolerance. If slave process fails, the master stall.
>>> > I tried mpich-3.0.4 with hydra-3.0.4 but my program which uses MPI Fault tolerance doesn't recognize failure of slave process, but recognizes failure with MPICH2. May be you can suggest solution?
>>> > I tried to use hydra from MPICH2 but link my program with MPICH3. This combination recognizes failures, but I"m not sure that such combination is stable enough.
>>> > Can you please advice?
>>> > Anatoly.
>>> >
>>> >
>>> >
>>> > On Sat, Dec 7, 2013 at 5:20 PM, Pavan Balaji <balaji at mcs.anl.gov> wrote:
>>> >
>>> > As much as I hate saying this — some people find it easier to think of it as “MPICH3”.
>>> >
>>> >   — Pavan
>>> >
>>> > On Dec 7, 2013, at 7:37 AM, Wesley Bland <wbland at mcs.anl.gov> wrote:
>>> >
>>> > > MPICH is just the new version of MPICH2. We renamed it when we went past version 3.0.
>>> > >
>>> > > On Dec 7, 2013, at 3:55 AM, Anatoly G <anatolyrishon at gmail.com> wrote:
>>> > >
>>> > >> Ok. I"ll try both Issend, and next step to upgrade MPICH to 3.0.4.
>>> > >> I thought before that MPICH & MPICH2 are two different branches, when MPICH2 partially supports Fault tolerance, but MPICH not. Now I understand, that I was wrong and MPICH2 is just main version of MPICH.
>>> > >>
>>> > >> Thank you very much,
>>> > >> Anatoly.
>>> > >>
>>> > >>
>>> > >>
>>> > >> On Thu, Dec 5, 2013 at 11:01 PM, Rajeev Thakur <thakur at mcs.anl.gov> wrote:
>>> > >> The master is receiving too many incoming messages than it can match quickly enough with Irecvs. Try using MPI_Issend instead of MPI_Isend.
>>> > >>
>>> > >> Rajeev
>>> > >>
>>> > >> On Dec 5, 2013, at 2:58 AM, Anatoly G <anatolyrishon at gmail.com> wrote:
>>> > >>
>>> > >> > Hello.
>>> > >> > I"m using MPICH2 1.5.
>>> > >> > My system contains master and 16 slaves.
>>> > >> > System uses number of communicators.
>>> > >> > The single communicator used for below scenario:
>>> > >> > Each slave sends non-stop 2Kbyte data buffer using MPI_Isend and waits using MPI_Wait.
>>> > >> > Master starts with MPI_Irecv to each slave
>>> > >> > Then in endless loop:
>>> > >> > MPI_Waitany and MPI_Irecv on rank returned by MPI_Waitany.
>>> > >> >
>>> > >> > Another communicator used for broadcast communication (commands between master + slaves),
>>> > >> > but it's not used in parallel with previous communicator,
>>> > >> > only before or after data transfer.
>>> > >> >
>>> > >> > The system executed on two computers linked by 1Gbit/s Ethernet.
>>> > >> > Master executed on first computer, all slaves on other one.
>>> > >> > Network traffic is ~800Mbit/s.
>>> > >> >
>>> > >> > After 1-2 minutes of execution, master process starts to increase it's memory allocation and network traffic becomes low.
>>> > >> > This memory allocation & network traffic slow down continues until fail of MPI,
>>> > >> > without core file save.
>>> > >> > My program doesn't allocate memory. Can you please explain this behaviour.
>>> > >> > How can I cause MPI to stop sending slaves if Master can't serve such traffic, instead of memory allocation and fail?
>>> > >> >
>>> > >> >
>>> > >> > Thank you,
>>> > >> > Anatoly.
>>> > >> >
>>> > >> > P.S.
>>> > >> > On my stand alone test, I simulate similar behaviour, but with single thread on each process (master & hosts).
>>> > >> > When I start stand alone test, master stops slaves until it completes accumulated data processing and MPI doesn't increase memory allocation.
>>> > >> > When Master is free slaves continue to send data.
>>> > >> > _______________________________________________
>>> > >> > discuss mailing list     discuss at mpich.org
>>> > >> > To manage subscription options or unsubscribe:
>>> > >> > https://lists.mpich.org/mailman/listinfo/discuss
>>> > >>
>>> > >> _______________________________________________
>>> > >> discuss mailing list     discuss at mpich.org
>>> > >> To manage subscription options or unsubscribe:
>>> > >> https://lists.mpich.org/mailman/listinfo/discuss
>>> > >>
>>> > >> _______________________________________________
>>> > >> discuss mailing list     discuss at mpich.org
>>> > >> To manage subscription options or unsubscribe:
>>> > >> https://lists.mpich.org/mailman/listinfo/discuss
>>> > > _______________________________________________
>>> > > discuss mailing list     discuss at mpich.org
>>> > > To manage subscription options or unsubscribe:
>>> > > https://lists.mpich.org/mailman/listinfo/discuss
>>> >
>>> > --
>>> > Pavan Balaji
>>> > http://www.mcs.anl.gov/~balaji
>>> >
>>> > _______________________________________________
>>> > discuss mailing list     discuss at mpich.org
>>> > To manage subscription options or unsubscribe:
>>> > https://lists.mpich.org/mailman/listinfo/discuss
>>> >
>>> >
>>> > _______________________________________________
>>> > discuss mailing list     discuss at mpich.org
>>> > To manage subscription options or unsubscribe:
>>> > https://lists.mpich.org/mailman/listinfo/discuss
>>> 
>>> --
>>> Pavan Balaji
>>> http://www.mcs.anl.gov/~balaji
>>> 
>>> _______________________________________________
>>> discuss mailing list     discuss at mpich.org
>>> To manage subscription options or unsubscribe:
>>> https://lists.mpich.org/mailman/listinfo/discuss
> 
> _______________________________________________
> discuss mailing list     discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20131209/c4acd82c/attachment.html>


More information about the discuss mailing list