[mpich-discuss] MPI memory allocation.

Pavan Balaji balaji at mcs.anl.gov
Thu Dec 19 09:16:36 CST 2013


Are you printing it out?

  — Pavan

On Dec 19, 2013, at 10:51 PM, Anatoly G <anatolyrishon at gmail.com> wrote:

> Can you please remove comment from section
> /*
>         // swap tag & enter blocked recv
>         MPI_Status stat;
>         tags[slaveIdx] = (tags[slaveIdx] == TAG1) ? TAG2 : TAG1;
>         MPI_Recv(RcvBufs[slaveIdx], BUF_SZ, MPI::CHAR, slaveRank, tags[slaveIdx], MPI_COMM_WORLD, &stat);
> 
>         ++SlavesRcvIters[slaveIdx];
> */ 
> 
> And then run it.
> Do you see memory allocation increase?
> 
> Regards,
> Anatoly.
> 
> 
> On Thu, Dec 19, 2013 at 4:29 PM, Pavan Balaji <balaji at mcs.anl.gov> wrote:
> 
> I’m not sure what I should look for.  I ran the program and it completed fine.
> 
>   — Pavan
> 
> On Dec 19, 2013, at 7:16 PM, Anatoly G <anatolyrishon at gmail.com> wrote:
> 
> > Good afternoon.
> > My program enters a lot to attached stack functions.
> > Can you please explain if it's ok?
> > Did you success to execute simulation from previous mail.
> > Did you see the memory raise when MPI_Recv not in comments?
> >
> > Regards,
> > Anatoly.
> >
> >
> > On Thu, Dec 12, 2013 at 9:36 PM, Anatoly G <anatolyrishon at gmail.com> wrote:
> > Hi.
> > Finally, I got an additional info.
> > I build short simulation of my real application.
> >
> > The short description of real scenario.
> > I have Master + N slaves. Each slave sends to Master 2 types of messages:
> >       • constant length message with predefined fields (one of it's fields is length of second message).
> >       • second message - length of this message each time is different and passed in first message.
> > Master should use MPI_Irecv commands, in order to be tolerant to slaves failure (blocked MPI_Recv is blocking Master in the failure case).
> > Master executes MPI_Irecv to each slave with buffer size equal to the constant size of first message type. After receiving first type of message, Master allocates expected buffer for second message and performs receive too. This happens in endless loop for each slaves. I use MPI_Waitany to monitor all receives.
> > In order to separate messages Master & slaves use different tags (as ids) for first & second messages.
> >
> > Simulation description:
> > All passed buffers (first & second) have the same size.
> > Slave (SndSyncSlave) sends messages and swaps 2 tags between them (like 2 types of messages, but second one has constant size too).
> > Master routine (Rcv_WaitAny function) executes MPI_Irecv for first message, and after receive executes MPI_Irecv for the second one.
> >
> > In this scenario: 5 processes works fine, but if I execute 20 processes and remove comment from line "usleep(200000)"  I see 800 Mbit/s on network at the test beginning, but after 1-2 second network speed become 200-300Kbit/s and not increased back any more.
> >
> > If I add MPI_Recv block in Master (remove comment from "MPI_Recv" and line around) I see that Master starts increase memory as my real application, but again on 5 processes this not happens. This is scenario used in my real application.
> >
> > Command line: mpiexec.hydra -genvall -f MpiConfigMachines.txt -launcher=ssh -n 20 mpi_rcv_any_multithread 100000 1000000 out
> >
> > where
> > 100000 - number of sends from each slave
> > 1000000 - scale to separate input from each scale (used for debug only)
> > out - prefix of output file. Each process produce out_"rank".txt file.
> >
> > MpiConfigMachines.txt - configuration file for my computers, 2 computers back to back 1 Gbit/s network.
> >
> >
> > Can you please test this case, and give me yours suggestions.
> >
> > Thank you,
> > Anatoly.
> >
> >
> >
> > On Mon, Dec 9, 2013 at 9:55 PM, Anatoly G <anatolyrishon at gmail.com> wrote:
> > Yes, I understand that. I"ll try to make my stand alone test closer to real application. Thank you.
> >
> >
> > On Mon, Dec 9, 2013 at 9:31 PM, Pavan Balaji <balaji at mcs.anl.gov> wrote:
> >
> > It sounds like MPICH is working correctly.  Without a test case, it’s unfortunately quite hard for us to even know what to look for.  It’s also possible that there’s a bug in your code which might be causing some bad behavior.
> >
> >   — Pavan
> >
> > On Dec 9, 2013, at 1:27 PM, Anatoly G <anatolyrishon at gmail.com> wrote:
> >
> > > Yes, I"m actually need Fault tolerance, and it was the main reason for choosing MPICH2. I use fault tolerance for unpredictable bugs in the future. My system should survive partially. But in the regular case I just need full performance. I"m suspect that I don't use MPI correctly, but on slow rate everything works fine. The fail caused by increasing rate of MPI_Isend or increasing data buffer size. I didn't find yet any strong dependence, only main stream.
> > >
> > > Unfortunately I have a complex system which has a number of threads in each process. Part of the threads use different communicators.
> > >
> > > I try to simulate the same MPI behavior in simple stand alone test, but stand alone test works fine. It shows a full network performance, when I slow down master (on stand alone test), all slaves are stopped too and are waiting for master to continue. Can I open any MPICH log to send you the results?
> > >
> > >
> > > On Mon, Dec 9, 2013 at 8:10 PM, Pavan Balaji <balaji at mcs.anl.gov> wrote:
> > >
> > > Do you actually need Fault Tolerance (one of your previous emails seemed to indicate that)?
> > >
> > > It sounds like there a bug in either your application or in the MPICH stack and you are trying to trace that down, and don’t really care about fault tolerance.  Is that a correct assessment?
> > >
> > > Do you have a simplified program that reproduces this error, that we can try?
> > >
> > >   — Pavan
> > >
> > > On Dec 9, 2013, at 11:44 AM, Anatoly G <anatolyrishon at gmail.com> wrote:
> > >
> > > > No. Hardware is Ok. Master process  allocates memory (check with MemoryScape doesn't show any sufficient memory allocation in my code). Then network traffic become low, and then Master process crashes w/o saving core file. I have unlimited size of core files. The same fail (w/o core) I see when I call MPI_Abort, but I don't call it.
> > > >
> > > >
> > > > On Mon, Dec 9, 2013 at 7:28 PM, Wesley Bland <wbland at mcs.anl.gov> wrote:
> > > > Are you actually seeing hardware failure or is your code just crashing? It's odd that one specific process would fail so often in the same way if it were a hardware problem.
> > > >
> > > > Thanks,
> > > > Wesley
> > > >
> > > > On Dec 9, 2013, at 11:15 AM, Anatoly G <anatolyrishon at gmail.com> wrote:
> > > >
> > > >> One more interesting fact. Each time I have a failure, the fails only master process, but slaves are still exists together with mpiexec.hydra. I thought that slaves should fail too, but slaves are live.
> > > >>
> > > >>
> > > >> On Mon, Dec 9, 2013 at 10:30 AM, Anatoly G <anatolyrishon at gmail.com> wrote:
> > > >> I configure MPICH-3.1rc2 build w/o "so" files. But instead of MPICH2 & MPICH-3.0.4 I get so files. What should I change in configure line to link MPI with my application statically.
> > > >>
> > > >>
> > > >>
> > > >> On Mon, Dec 9, 2013 at 9:47 AM, Pavan Balaji <balaji at mcs.anl.gov> wrote:
> > > >>
> > > >> Can you try mpich-3.1rc2?  There were several fixes for this in this version and it’ll be good to try that out before we go digging too far into this.
> > > >>
> > > >>   — Pavan
> > > >>
> > > >> On Dec 9, 2013, at 1:46 AM, Anatoly G <anatolyrishon at gmail.com> wrote:
> > > >>
> > > >> > With MPICH - 3.0.4 the situation repeated. It looks like MPI allocates memory for messages.
> > > >> > Can you please advice about scenario when MPI or may be TCP under MPI allocates memory due to high transfer rate?
> > > >> >
> > > >> >
> > > >> > On Mon, Dec 9, 2013 at 9:32 AM, Anatoly G <anatolyrishon at gmail.com> wrote:
> > > >> > Thank you very much.
> > > >> > Issend - is not so good, It can't support me Fault tolerance. If slave process fails, the master stall.
> > > >> > I tried mpich-3.0.4 with hydra-3.0.4 but my program which uses MPI Fault tolerance doesn't recognize failure of slave process, but recognizes failure with MPICH2. May be you can suggest solution?
> > > >> > I tried to use hydra from MPICH2 but link my program with MPICH3. This combination recognizes failures, but I"m not sure that such combination is stable enough.
> > > >> > Can you please advice?
> > > >> > Anatoly.
> > > >> >
> > > >> >
> > > >> >
> > > >> > On Sat, Dec 7, 2013 at 5:20 PM, Pavan Balaji <balaji at mcs.anl.gov> wrote:
> > > >> >
> > > >> > As much as I hate saying this — some people find it easier to think of it as “MPICH3”.
> > > >> >
> > > >> >   — Pavan
> > > >> >
> > > >> > On Dec 7, 2013, at 7:37 AM, Wesley Bland <wbland at mcs.anl.gov> wrote:
> > > >> >
> > > >> > > MPICH is just the new version of MPICH2. We renamed it when we went past version 3.0.
> > > >> > >
> > > >> > > On Dec 7, 2013, at 3:55 AM, Anatoly G <anatolyrishon at gmail.com> wrote:
> > > >> > >
> > > >> > >> Ok. I"ll try both Issend, and next step to upgrade MPICH to 3.0.4.
> > > >> > >> I thought before that MPICH & MPICH2 are two different branches, when MPICH2 partially supports Fault tolerance, but MPICH not. Now I understand, that I was wrong and MPICH2 is just main version of MPICH.
> > > >> > >>
> > > >> > >> Thank you very much,
> > > >> > >> Anatoly.
> > > >> > >>
> > > >> > >>
> > > >> > >>
> > > >> > >> On Thu, Dec 5, 2013 at 11:01 PM, Rajeev Thakur <thakur at mcs.anl.gov> wrote:
> > > >> > >> The master is receiving too many incoming messages than it can match quickly enough with Irecvs. Try using MPI_Issend instead of MPI_Isend.
> > > >> > >>
> > > >> > >> Rajeev
> > > >> > >>
> > > >> > >> On Dec 5, 2013, at 2:58 AM, Anatoly G <anatolyrishon at gmail.com> wrote:
> > > >> > >>
> > > >> > >> > Hello.
> > > >> > >> > I"m using MPICH2 1.5.
> > > >> > >> > My system contains master and 16 slaves.
> > > >> > >> > System uses number of communicators.
> > > >> > >> > The single communicator used for below scenario:
> > > >> > >> > Each slave sends non-stop 2Kbyte data buffer using MPI_Isend and waits using MPI_Wait.
> > > >> > >> > Master starts with MPI_Irecv to each slave
> > > >> > >> > Then in endless loop:
> > > >> > >> > MPI_Waitany and MPI_Irecv on rank returned by MPI_Waitany.
> > > >> > >> >
> > > >> > >> > Another communicator used for broadcast communication (commands between master + slaves),
> > > >> > >> > but it's not used in parallel with previous communicator,
> > > >> > >> > only before or after data transfer.
> > > >> > >> >
> > > >> > >> > The system executed on two computers linked by 1Gbit/s Ethernet.
> > > >> > >> > Master executed on first computer, all slaves on other one.
> > > >> > >> > Network traffic is ~800Mbit/s.
> > > >> > >> >
> > > >> > >> > After 1-2 minutes of execution, master process starts to increase it's memory allocation and network traffic becomes low.
> > > >> > >> > This memory allocation & network traffic slow down continues until fail of MPI,
> > > >> > >> > without core file save.
> > > >> > >> > My program doesn't allocate memory. Can you please explain this behaviour.
> > > >> > >> > How can I cause MPI to stop sending slaves if Master can't serve such traffic, instead of memory allocation and fail?
> > > >> > >> >
> > > >> > >> >
> > > >> > >> > Thank you,
> > > >> > >> > Anatoly.
> > > >> > >> >
> > > >> > >> > P.S.
> > > >> > >> > On my stand alone test, I simulate similar behaviour, but with single thread on each process (master & hosts).
> > > >> > >> > When I start stand alone test, master stops slaves until it completes accumulated data processing and MPI doesn't increase memory allocation.
> > > >> > >> > When Master is free slaves continue to send data.
> > > >> > >> > _______________________________________________
> > > >> > >> > discuss mailing list     discuss at mpich.org
> > > >> > >> > To manage subscription options or unsubscribe:
> > > >> > >> > https://lists.mpich.org/mailman/listinfo/discuss
> > > >> > >>
> > > >> > >> _______________________________________________
> > > >> > >> discuss mailing list     discuss at mpich.org
> > > >> > >> To manage subscription options or unsubscribe:
> > > >> > >> https://lists.mpich.org/mailman/listinfo/discuss
> > > >> > >>
> > > >> > >> _______________________________________________
> > > >> > >> discuss mailing list     discuss at mpich.org
> > > >> > >> To manage subscription options or unsubscribe:
> > > >> > >> https://lists.mpich.org/mailman/listinfo/discuss
> > > >> > > _______________________________________________
> > > >> > > discuss mailing list     discuss at mpich.org
> > > >> > > To manage subscription options or unsubscribe:
> > > >> > > https://lists.mpich.org/mailman/listinfo/discuss
> > > >> >
> > > >> > --
> > > >> > Pavan Balaji
> > > >> > http://www.mcs.anl.gov/~balaji
> > > >> >
> > > >> > _______________________________________________
> > > >> > discuss mailing list     discuss at mpich.org
> > > >> > To manage subscription options or unsubscribe:
> > > >> > https://lists.mpich.org/mailman/listinfo/discuss
> > > >> >
> > > >> >
> > > >> > _______________________________________________
> > > >> > discuss mailing list     discuss at mpich.org
> > > >> > To manage subscription options or unsubscribe:
> > > >> > https://lists.mpich.org/mailman/listinfo/discuss
> > > >>
> > > >> --
> > > >> Pavan Balaji
> > > >> http://www.mcs.anl.gov/~balaji
> > > >>
> > > >> _______________________________________________
> > > >> discuss mailing list     discuss at mpich.org
> > > >> To manage subscription options or unsubscribe:
> > > >> https://lists.mpich.org/mailman/listinfo/discuss
> > > >>
> > > >>
> > > >> _______________________________________________
> > > >> discuss mailing list     discuss at mpich.org
> > > >> To manage subscription options or unsubscribe:
> > > >> https://lists.mpich.org/mailman/listinfo/discuss
> > > >
> > > > _______________________________________________
> > > > discuss mailing list     discuss at mpich.org
> > > > To manage subscription options or unsubscribe:
> > > > https://lists.mpich.org/mailman/listinfo/discuss
> > > >
> > > > _______________________________________________
> > > > discuss mailing list     discuss at mpich.org
> > > > To manage subscription options or unsubscribe:
> > > > https://lists.mpich.org/mailman/listinfo/discuss
> > >
> > > --
> > > Pavan Balaji
> > > http://www.mcs.anl.gov/~balaji
> > >
> > > _______________________________________________
> > > discuss mailing list     discuss at mpich.org
> > > To manage subscription options or unsubscribe:
> > > https://lists.mpich.org/mailman/listinfo/discuss
> > >
> > > _______________________________________________
> > > discuss mailing list     discuss at mpich.org
> > > To manage subscription options or unsubscribe:
> > > https://lists.mpich.org/mailman/listinfo/discuss
> >
> > --
> > Pavan Balaji
> > http://www.mcs.anl.gov/~balaji
> >
> > _______________________________________________
> > discuss mailing list     discuss at mpich.org
> > To manage subscription options or unsubscribe:
> > https://lists.mpich.org/mailman/listinfo/discuss
> >
> >
> >
> > <backtrace2.txt>_______________________________________________
> > discuss mailing list     discuss at mpich.org
> > To manage subscription options or unsubscribe:
> > https://lists.mpich.org/mailman/listinfo/discuss
> 
> --
> Pavan Balaji
> http://www.mcs.anl.gov/~balaji
> 
> _______________________________________________
> discuss mailing list     discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss
> 
> _______________________________________________
> discuss mailing list     discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss

--
Pavan Balaji
http://www.mcs.anl.gov/~balaji




More information about the discuss mailing list