[mpich-discuss] MPI memory allocation.

Pavan Balaji balaji at mcs.anl.gov
Mon Dec 9 12:10:41 CST 2013


Do you actually need Fault Tolerance (one of your previous emails seemed to indicate that)?

It sounds like there a bug in either your application or in the MPICH stack and you are trying to trace that down, and don’t really care about fault tolerance.  Is that a correct assessment?

Do you have a simplified program that reproduces this error, that we can try?

  — Pavan

On Dec 9, 2013, at 11:44 AM, Anatoly G <anatolyrishon at gmail.com> wrote:

> No. Hardware is Ok. Master process  allocates memory (check with MemoryScape doesn't show any sufficient memory allocation in my code). Then network traffic become low, and then Master process crashes w/o saving core file. I have unlimited size of core files. The same fail (w/o core) I see when I call MPI_Abort, but I don't call it.
> 
> 
> On Mon, Dec 9, 2013 at 7:28 PM, Wesley Bland <wbland at mcs.anl.gov> wrote:
> Are you actually seeing hardware failure or is your code just crashing? It's odd that one specific process would fail so often in the same way if it were a hardware problem. 
> 
> Thanks,
> Wesley
> 
> On Dec 9, 2013, at 11:15 AM, Anatoly G <anatolyrishon at gmail.com> wrote:
> 
>> One more interesting fact. Each time I have a failure, the fails only master process, but slaves are still exists together with mpiexec.hydra. I thought that slaves should fail too, but slaves are live.
>> 
>> 
>> On Mon, Dec 9, 2013 at 10:30 AM, Anatoly G <anatolyrishon at gmail.com> wrote:
>> I configure MPICH-3.1rc2 build w/o "so" files. But instead of MPICH2 & MPICH-3.0.4 I get so files. What should I change in configure line to link MPI with my application statically.
>> 
>> 
>> 
>> On Mon, Dec 9, 2013 at 9:47 AM, Pavan Balaji <balaji at mcs.anl.gov> wrote:
>> 
>> Can you try mpich-3.1rc2?  There were several fixes for this in this version and it’ll be good to try that out before we go digging too far into this.
>> 
>>   — Pavan
>> 
>> On Dec 9, 2013, at 1:46 AM, Anatoly G <anatolyrishon at gmail.com> wrote:
>> 
>> > With MPICH - 3.0.4 the situation repeated. It looks like MPI allocates memory for messages.
>> > Can you please advice about scenario when MPI or may be TCP under MPI allocates memory due to high transfer rate?
>> >
>> >
>> > On Mon, Dec 9, 2013 at 9:32 AM, Anatoly G <anatolyrishon at gmail.com> wrote:
>> > Thank you very much.
>> > Issend - is not so good, It can't support me Fault tolerance. If slave process fails, the master stall.
>> > I tried mpich-3.0.4 with hydra-3.0.4 but my program which uses MPI Fault tolerance doesn't recognize failure of slave process, but recognizes failure with MPICH2. May be you can suggest solution?
>> > I tried to use hydra from MPICH2 but link my program with MPICH3. This combination recognizes failures, but I"m not sure that such combination is stable enough.
>> > Can you please advice?
>> > Anatoly.
>> >
>> >
>> >
>> > On Sat, Dec 7, 2013 at 5:20 PM, Pavan Balaji <balaji at mcs.anl.gov> wrote:
>> >
>> > As much as I hate saying this — some people find it easier to think of it as “MPICH3”.
>> >
>> >   — Pavan
>> >
>> > On Dec 7, 2013, at 7:37 AM, Wesley Bland <wbland at mcs.anl.gov> wrote:
>> >
>> > > MPICH is just the new version of MPICH2. We renamed it when we went past version 3.0.
>> > >
>> > > On Dec 7, 2013, at 3:55 AM, Anatoly G <anatolyrishon at gmail.com> wrote:
>> > >
>> > >> Ok. I"ll try both Issend, and next step to upgrade MPICH to 3.0.4.
>> > >> I thought before that MPICH & MPICH2 are two different branches, when MPICH2 partially supports Fault tolerance, but MPICH not. Now I understand, that I was wrong and MPICH2 is just main version of MPICH.
>> > >>
>> > >> Thank you very much,
>> > >> Anatoly.
>> > >>
>> > >>
>> > >>
>> > >> On Thu, Dec 5, 2013 at 11:01 PM, Rajeev Thakur <thakur at mcs.anl.gov> wrote:
>> > >> The master is receiving too many incoming messages than it can match quickly enough with Irecvs. Try using MPI_Issend instead of MPI_Isend.
>> > >>
>> > >> Rajeev
>> > >>
>> > >> On Dec 5, 2013, at 2:58 AM, Anatoly G <anatolyrishon at gmail.com> wrote:
>> > >>
>> > >> > Hello.
>> > >> > I"m using MPICH2 1.5.
>> > >> > My system contains master and 16 slaves.
>> > >> > System uses number of communicators.
>> > >> > The single communicator used for below scenario:
>> > >> > Each slave sends non-stop 2Kbyte data buffer using MPI_Isend and waits using MPI_Wait.
>> > >> > Master starts with MPI_Irecv to each slave
>> > >> > Then in endless loop:
>> > >> > MPI_Waitany and MPI_Irecv on rank returned by MPI_Waitany.
>> > >> >
>> > >> > Another communicator used for broadcast communication (commands between master + slaves),
>> > >> > but it's not used in parallel with previous communicator,
>> > >> > only before or after data transfer.
>> > >> >
>> > >> > The system executed on two computers linked by 1Gbit/s Ethernet.
>> > >> > Master executed on first computer, all slaves on other one.
>> > >> > Network traffic is ~800Mbit/s.
>> > >> >
>> > >> > After 1-2 minutes of execution, master process starts to increase it's memory allocation and network traffic becomes low.
>> > >> > This memory allocation & network traffic slow down continues until fail of MPI,
>> > >> > without core file save.
>> > >> > My program doesn't allocate memory. Can you please explain this behaviour.
>> > >> > How can I cause MPI to stop sending slaves if Master can't serve such traffic, instead of memory allocation and fail?
>> > >> >
>> > >> >
>> > >> > Thank you,
>> > >> > Anatoly.
>> > >> >
>> > >> > P.S.
>> > >> > On my stand alone test, I simulate similar behaviour, but with single thread on each process (master & hosts).
>> > >> > When I start stand alone test, master stops slaves until it completes accumulated data processing and MPI doesn't increase memory allocation.
>> > >> > When Master is free slaves continue to send data.
>> > >> > _______________________________________________
>> > >> > discuss mailing list     discuss at mpich.org
>> > >> > To manage subscription options or unsubscribe:
>> > >> > https://lists.mpich.org/mailman/listinfo/discuss
>> > >>
>> > >> _______________________________________________
>> > >> discuss mailing list     discuss at mpich.org
>> > >> To manage subscription options or unsubscribe:
>> > >> https://lists.mpich.org/mailman/listinfo/discuss
>> > >>
>> > >> _______________________________________________
>> > >> discuss mailing list     discuss at mpich.org
>> > >> To manage subscription options or unsubscribe:
>> > >> https://lists.mpich.org/mailman/listinfo/discuss
>> > > _______________________________________________
>> > > discuss mailing list     discuss at mpich.org
>> > > To manage subscription options or unsubscribe:
>> > > https://lists.mpich.org/mailman/listinfo/discuss
>> >
>> > --
>> > Pavan Balaji
>> > http://www.mcs.anl.gov/~balaji
>> >
>> > _______________________________________________
>> > discuss mailing list     discuss at mpich.org
>> > To manage subscription options or unsubscribe:
>> > https://lists.mpich.org/mailman/listinfo/discuss
>> >
>> >
>> > _______________________________________________
>> > discuss mailing list     discuss at mpich.org
>> > To manage subscription options or unsubscribe:
>> > https://lists.mpich.org/mailman/listinfo/discuss
>> 
>> --
>> Pavan Balaji
>> http://www.mcs.anl.gov/~balaji
>> 
>> _______________________________________________
>> discuss mailing list     discuss at mpich.org
>> To manage subscription options or unsubscribe:
>> https://lists.mpich.org/mailman/listinfo/discuss
>> 
>> 
>> _______________________________________________
>> discuss mailing list     discuss at mpich.org
>> To manage subscription options or unsubscribe:
>> https://lists.mpich.org/mailman/listinfo/discuss
> 
> _______________________________________________
> discuss mailing list     discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss
> 
> _______________________________________________
> discuss mailing list     discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss

--
Pavan Balaji
http://www.mcs.anl.gov/~balaji




More information about the discuss mailing list