[mpich-discuss] MPI memory allocation.

Anatoly G anatolyrishon at gmail.com
Thu Dec 19 05:16:35 CST 2013


Good afternoon.
My program enters a lot to attached stack functions.
Can you please explain if it's ok?
Did you success to execute simulation from previous mail.
Did you see the memory raise when MPI_Recv not in comments?

Regards,
Anatoly.


On Thu, Dec 12, 2013 at 9:36 PM, Anatoly G <anatolyrishon at gmail.com> wrote:

> Hi.
> Finally, I got an additional info.
> I build short simulation of my real application.
>
> *The short description of real scenario.*
> I have Master + N slaves. Each slave sends to Master 2 types of messages:
>
>    1. constant length message with predefined fields (one of it's fields
>    is length of second message).
>    2. second message - length of this message each time is different and
>    passed in first message.
>
> Master should use MPI_Irecv commands, in order to be tolerant to slaves
> failure (blocked MPI_Recv is blocking Master in the failure case).
> Master executes MPI_Irecv to each slave with buffer size equal to the
> constant size of first message type. After receiving first type of message,
> Master allocates expected buffer for second message and performs receive
> too. This happens in endless loop for each slaves. I use MPI_Waitany to
> monitor all receives.
> In order to separate messages Master & slaves use different tags (as ids)
> for first & second messages.
>
> *Simulation description:*
> All passed buffers (first & second) have the same size.
> Slave (SndSyncSlave) sends messages and swaps 2 tags between them (like 2
> types of messages, but second one has constant size too).
> Master routine (Rcv_WaitAny function) executes MPI_Irecv for first
> message, and after receive executes MPI_Irecv for the second one.
>
> In this scenario: 5 processes works fine, but if I execute 20 processes
> and remove comment from line "usleep(200000)"  I see 800 Mbit/s on network
> at the test beginning, but after 1-2 second network speed become
> 200-300Kbit/s and not increased back any more.
>
> If I add MPI_Recv block in Master (remove comment from "MPI_Recv" and line
> around) I see that Master starts increase memory as my real application,
> but again on 5 processes this not happens. This is scenario used in my real
> application.
>
> *Command line:* mpiexec.hydra -genvall -f MpiConfigMachines.txt
> -launcher=ssh -n 20 mpi_rcv_any_multithread 100000 1000000 out
>
> where
> 100000 - number of sends from each slave
> 1000000 - scale to separate input from each scale (used for debug only)
> out - prefix of output file. Each process produce out_"rank".txt file.
>
> MpiConfigMachines.txt - configuration file for my computers, 2 computers
> back to back 1 Gbit/s network.
>
>
> Can you please test this case, and give me yours suggestions.
>
> Thank you,
> Anatoly.
>
>
>
> On Mon, Dec 9, 2013 at 9:55 PM, Anatoly G <anatolyrishon at gmail.com> wrote:
>
>> Yes, I understand that. I"ll try to make my stand alone test closer to
>> real application. Thank you.
>>
>>
>> On Mon, Dec 9, 2013 at 9:31 PM, Pavan Balaji <balaji at mcs.anl.gov> wrote:
>>
>>>
>>> It sounds like MPICH is working correctly.  Without a test case, it’s
>>> unfortunately quite hard for us to even know what to look for.  It’s also
>>> possible that there’s a bug in your code which might be causing some bad
>>> behavior.
>>>
>>>   — Pavan
>>>
>>> On Dec 9, 2013, at 1:27 PM, Anatoly G <anatolyrishon at gmail.com> wrote:
>>>
>>> > Yes, I"m actually need Fault tolerance, and it was the main reason for
>>> choosing MPICH2. I use fault tolerance for unpredictable bugs in the
>>> future. My system should survive partially. But in the regular case I just
>>> need full performance. I"m suspect that I don't use MPI correctly, but on
>>> slow rate everything works fine. The fail caused by increasing rate of
>>> MPI_Isend or increasing data buffer size. I didn't find yet any strong
>>> dependence, only main stream.
>>> >
>>> > Unfortunately I have a complex system which has a number of threads in
>>> each process. Part of the threads use different communicators.
>>> >
>>> > I try to simulate the same MPI behavior in simple stand alone test,
>>> but stand alone test works fine. It shows a full network performance, when
>>> I slow down master (on stand alone test), all slaves are stopped too and
>>> are waiting for master to continue. Can I open any MPICH log to send you
>>> the results?
>>> >
>>> >
>>> > On Mon, Dec 9, 2013 at 8:10 PM, Pavan Balaji <balaji at mcs.anl.gov>
>>> wrote:
>>> >
>>> > Do you actually need Fault Tolerance (one of your previous emails
>>> seemed to indicate that)?
>>> >
>>> > It sounds like there a bug in either your application or in the MPICH
>>> stack and you are trying to trace that down, and don’t really care about
>>> fault tolerance.  Is that a correct assessment?
>>> >
>>> > Do you have a simplified program that reproduces this error, that we
>>> can try?
>>> >
>>> >   — Pavan
>>> >
>>> > On Dec 9, 2013, at 11:44 AM, Anatoly G <anatolyrishon at gmail.com>
>>> wrote:
>>> >
>>> > > No. Hardware is Ok. Master process  allocates memory (check with
>>> MemoryScape doesn't show any sufficient memory allocation in my code). Then
>>> network traffic become low, and then Master process crashes w/o saving core
>>> file. I have unlimited size of core files. The same fail (w/o core) I see
>>> when I call MPI_Abort, but I don't call it.
>>> > >
>>> > >
>>> > > On Mon, Dec 9, 2013 at 7:28 PM, Wesley Bland <wbland at mcs.anl.gov>
>>> wrote:
>>> > > Are you actually seeing hardware failure or is your code just
>>> crashing? It's odd that one specific process would fail so often in the
>>> same way if it were a hardware problem.
>>> > >
>>> > > Thanks,
>>> > > Wesley
>>> > >
>>> > > On Dec 9, 2013, at 11:15 AM, Anatoly G <anatolyrishon at gmail.com>
>>> wrote:
>>> > >
>>> > >> One more interesting fact. Each time I have a failure, the fails
>>> only master process, but slaves are still exists together with
>>> mpiexec.hydra. I thought that slaves should fail too, but slaves are live.
>>> > >>
>>> > >>
>>> > >> On Mon, Dec 9, 2013 at 10:30 AM, Anatoly G <anatolyrishon at gmail.com>
>>> wrote:
>>> > >> I configure MPICH-3.1rc2 build w/o "so" files. But instead of
>>> MPICH2 & MPICH-3.0.4 I get so files. What should I change in configure line
>>> to link MPI with my application statically.
>>> > >>
>>> > >>
>>> > >>
>>> > >> On Mon, Dec 9, 2013 at 9:47 AM, Pavan Balaji <balaji at mcs.anl.gov>
>>> wrote:
>>> > >>
>>> > >> Can you try mpich-3.1rc2?  There were several fixes for this in
>>> this version and it’ll be good to try that out before we go digging too far
>>> into this.
>>> > >>
>>> > >>   — Pavan
>>> > >>
>>> > >> On Dec 9, 2013, at 1:46 AM, Anatoly G <anatolyrishon at gmail.com>
>>> wrote:
>>> > >>
>>> > >> > With MPICH - 3.0.4 the situation repeated. It looks like MPI
>>> allocates memory for messages.
>>> > >> > Can you please advice about scenario when MPI or may be TCP under
>>> MPI allocates memory due to high transfer rate?
>>> > >> >
>>> > >> >
>>> > >> > On Mon, Dec 9, 2013 at 9:32 AM, Anatoly G <
>>> anatolyrishon at gmail.com> wrote:
>>> > >> > Thank you very much.
>>> > >> > Issend - is not so good, It can't support me Fault tolerance. If
>>> slave process fails, the master stall.
>>> > >> > I tried mpich-3.0.4 with hydra-3.0.4 but my program which uses
>>> MPI Fault tolerance doesn't recognize failure of slave process, but
>>> recognizes failure with MPICH2. May be you can suggest solution?
>>> > >> > I tried to use hydra from MPICH2 but link my program with MPICH3.
>>> This combination recognizes failures, but I"m not sure that such
>>> combination is stable enough.
>>> > >> > Can you please advice?
>>> > >> > Anatoly.
>>> > >> >
>>> > >> >
>>> > >> >
>>> > >> > On Sat, Dec 7, 2013 at 5:20 PM, Pavan Balaji <balaji at mcs.anl.gov>
>>> wrote:
>>> > >> >
>>> > >> > As much as I hate saying this — some people find it easier to
>>> think of it as “MPICH3”.
>>> > >> >
>>> > >> >   — Pavan
>>> > >> >
>>> > >> > On Dec 7, 2013, at 7:37 AM, Wesley Bland <wbland at mcs.anl.gov>
>>> wrote:
>>> > >> >
>>> > >> > > MPICH is just the new version of MPICH2. We renamed it when we
>>> went past version 3.0.
>>> > >> > >
>>> > >> > > On Dec 7, 2013, at 3:55 AM, Anatoly G <anatolyrishon at gmail.com>
>>> wrote:
>>> > >> > >
>>> > >> > >> Ok. I"ll try both Issend, and next step to upgrade MPICH to
>>> 3.0.4.
>>> > >> > >> I thought before that MPICH & MPICH2 are two different
>>> branches, when MPICH2 partially supports Fault tolerance, but MPICH not.
>>> Now I understand, that I was wrong and MPICH2 is just main version of MPICH.
>>> > >> > >>
>>> > >> > >> Thank you very much,
>>> > >> > >> Anatoly.
>>> > >> > >>
>>> > >> > >>
>>> > >> > >>
>>> > >> > >> On Thu, Dec 5, 2013 at 11:01 PM, Rajeev Thakur <
>>> thakur at mcs.anl.gov> wrote:
>>> > >> > >> The master is receiving too many incoming messages than it can
>>> match quickly enough with Irecvs. Try using MPI_Issend instead of MPI_Isend.
>>> > >> > >>
>>> > >> > >> Rajeev
>>> > >> > >>
>>> > >> > >> On Dec 5, 2013, at 2:58 AM, Anatoly G <anatolyrishon at gmail.com>
>>> wrote:
>>> > >> > >>
>>> > >> > >> > Hello.
>>> > >> > >> > I"m using MPICH2 1.5.
>>> > >> > >> > My system contains master and 16 slaves.
>>> > >> > >> > System uses number of communicators.
>>> > >> > >> > The single communicator used for below scenario:
>>> > >> > >> > Each slave sends non-stop 2Kbyte data buffer using MPI_Isend
>>> and waits using MPI_Wait.
>>> > >> > >> > Master starts with MPI_Irecv to each slave
>>> > >> > >> > Then in endless loop:
>>> > >> > >> > MPI_Waitany and MPI_Irecv on rank returned by MPI_Waitany.
>>> > >> > >> >
>>> > >> > >> > Another communicator used for broadcast communication
>>> (commands between master + slaves),
>>> > >> > >> > but it's not used in parallel with previous communicator,
>>> > >> > >> > only before or after data transfer.
>>> > >> > >> >
>>> > >> > >> > The system executed on two computers linked by 1Gbit/s
>>> Ethernet.
>>> > >> > >> > Master executed on first computer, all slaves on other one.
>>> > >> > >> > Network traffic is ~800Mbit/s.
>>> > >> > >> >
>>> > >> > >> > After 1-2 minutes of execution, master process starts to
>>> increase it's memory allocation and network traffic becomes low.
>>> > >> > >> > This memory allocation & network traffic slow down continues
>>> until fail of MPI,
>>> > >> > >> > without core file save.
>>> > >> > >> > My program doesn't allocate memory. Can you please explain
>>> this behaviour.
>>> > >> > >> > How can I cause MPI to stop sending slaves if Master can't
>>> serve such traffic, instead of memory allocation and fail?
>>> > >> > >> >
>>> > >> > >> >
>>> > >> > >> > Thank you,
>>> > >> > >> > Anatoly.
>>> > >> > >> >
>>> > >> > >> > P.S.
>>> > >> > >> > On my stand alone test, I simulate similar behaviour, but
>>> with single thread on each process (master & hosts).
>>> > >> > >> > When I start stand alone test, master stops slaves until it
>>> completes accumulated data processing and MPI doesn't increase memory
>>> allocation.
>>> > >> > >> > When Master is free slaves continue to send data.
>>> > >> > >> > _______________________________________________
>>> > >> > >> > discuss mailing list     discuss at mpich.org
>>> > >> > >> > To manage subscription options or unsubscribe:
>>> > >> > >> > https://lists.mpich.org/mailman/listinfo/discuss
>>> > >> > >>
>>> > >> > >> _______________________________________________
>>> > >> > >> discuss mailing list     discuss at mpich.org
>>> > >> > >> To manage subscription options or unsubscribe:
>>> > >> > >> https://lists.mpich.org/mailman/listinfo/discuss
>>> > >> > >>
>>> > >> > >> _______________________________________________
>>> > >> > >> discuss mailing list     discuss at mpich.org
>>> > >> > >> To manage subscription options or unsubscribe:
>>> > >> > >> https://lists.mpich.org/mailman/listinfo/discuss
>>> > >> > > _______________________________________________
>>> > >> > > discuss mailing list     discuss at mpich.org
>>> > >> > > To manage subscription options or unsubscribe:
>>> > >> > > https://lists.mpich.org/mailman/listinfo/discuss
>>> > >> >
>>> > >> > --
>>> > >> > Pavan Balaji
>>> > >> > http://www.mcs.anl.gov/~balaji
>>> > >> >
>>> > >> > _______________________________________________
>>> > >> > discuss mailing list     discuss at mpich.org
>>> > >> > To manage subscription options or unsubscribe:
>>> > >> > https://lists.mpich.org/mailman/listinfo/discuss
>>> > >> >
>>> > >> >
>>> > >> > _______________________________________________
>>> > >> > discuss mailing list     discuss at mpich.org
>>> > >> > To manage subscription options or unsubscribe:
>>> > >> > https://lists.mpich.org/mailman/listinfo/discuss
>>> > >>
>>> > >> --
>>> > >> Pavan Balaji
>>> > >> http://www.mcs.anl.gov/~balaji
>>> > >>
>>> > >> _______________________________________________
>>> > >> discuss mailing list     discuss at mpich.org
>>> > >> To manage subscription options or unsubscribe:
>>> > >> https://lists.mpich.org/mailman/listinfo/discuss
>>> > >>
>>> > >>
>>> > >> _______________________________________________
>>> > >> discuss mailing list     discuss at mpich.org
>>> > >> To manage subscription options or unsubscribe:
>>> > >> https://lists.mpich.org/mailman/listinfo/discuss
>>> > >
>>> > > _______________________________________________
>>> > > discuss mailing list     discuss at mpich.org
>>> > > To manage subscription options or unsubscribe:
>>> > > https://lists.mpich.org/mailman/listinfo/discuss
>>> > >
>>> > > _______________________________________________
>>> > > discuss mailing list     discuss at mpich.org
>>> > > To manage subscription options or unsubscribe:
>>> > > https://lists.mpich.org/mailman/listinfo/discuss
>>> >
>>> > --
>>> > Pavan Balaji
>>> > http://www.mcs.anl.gov/~balaji
>>> >
>>> > _______________________________________________
>>> > discuss mailing list     discuss at mpich.org
>>> > To manage subscription options or unsubscribe:
>>> > https://lists.mpich.org/mailman/listinfo/discuss
>>> >
>>> > _______________________________________________
>>> > discuss mailing list     discuss at mpich.org
>>> > To manage subscription options or unsubscribe:
>>> > https://lists.mpich.org/mailman/listinfo/discuss
>>>
>>> --
>>> Pavan Balaji
>>> http://www.mcs.anl.gov/~balaji
>>>
>>> _______________________________________________
>>> discuss mailing list     discuss at mpich.org
>>> To manage subscription options or unsubscribe:
>>> https://lists.mpich.org/mailman/listinfo/discuss
>>>
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20131219/301bbbfe/attachment.html>
-------------- next part --------------

file: src/mpid/ch3/src/ch3u_handle_recv_pkt.c
func: int MPIDI_CH3U_Receive_data_unexpected(MPID_Request * rreq, char *buf, MPIDI_msg_sz_t *buflen, int *complete)
line: rreq->dev.tmpbuf = MPIU_Malloc(rreq->dev.recv_data_sz);

file: src/mpid/ch3/src/ch3u_eager.c
func: int MPIDI_CH3_PktHandler_EagerSend( MPIDI_VC_t *vc, MPIDI_CH3_Pkt_t *pkt,
                    MPIDI_msg_sz_t *buflen, MPID_Request **rreqp )
line: mpi_errno = MPIDI_CH3U_Receive_data_unexpected( rreq, data_buf,
                                                       &data_len, &complete );

file: src/mpid/ch3/channels/nemesis/src/ch3_progress.c
func: int MPID_nem_handle_pkt(MPIDI_VC_t *vc, char *buf, MPIDI_msg_sz_t buflen)
line: if (mpi_errno) MPIU_ERR_POP(mpi_errno);


file: src/mpid/ch3/channels/nemesis/nemesis/netmod/tcp/socksm.c
func: static int MPID_nem_tcp_recv_handler(sockconn_t *const sc)
line: mpi_errno = MPID_nem_handle_pkt(sc_vc, recv_buf, bytes_recvd);

file: src/mpid/ch3/channels/nemesis/nemesis/netmod/tcp/socksm.c
func: int MPID_nem_tcp_connpoll(int in_blocking_poll)
line: if (mpi_errno) MPIU_ERR_POP (mpi_errno);

file: src/mpid/ch3/channels/nemesis/nemesis/include/mpid_nem_inline.h
func: static inline int
      MPID_nem_mpich2_test_recv(MPID_nem_cell_ptr_t *cell, int *in_fbox, int in_blocking_progress)
line: if (mpi_errno) MPIU_ERR_POP (mpi_errno);

file: src/mpi/pt2pt/waitany.c
func: int MPI_Waitany(int count, MPI_Request array_of_requests[], int *index,
        MPI_Status *status)
line: if (mpi_errno != MPI_SUCCESS) goto fn_progress_end_fail;


More information about the discuss mailing list