[mpich-discuss] Failed to allocate memory for an unexpected message

Luiz Carlos da Costa Junior lcjunior at ufrj.br
Thu Jul 2 13:22:55 CDT 2015


Hello all,

In 2013 I had problems regarding the allocation of unexpected messages in
MPI.
After your kind assistance, I implemented a "buffer" matrix in the receiver
process, using MPI_IRECV, MPI_WAITANY and MPI_TESTANY functions (the code
snippet is attached).

It has been working nicely since than until recently, when I faced the same
problems again:

Fatal error in MPI_Recv: Other MPI error, error stack:
> MPI_Recv(186)......................: MPI_Recv(buf=0x7fffe8dd5974, count=1,
> MPI_INTEGER, src=0, tag=MPI_ANY_TAG, MPI_COMM_WORLD, status=0xd213d0) failed
> MPIDI_CH3I_Progress(402)...........:
> MPID_nem_mpich2_blocking_recv(905).:
> MPID_nem_tcp_connpoll(1838)........:
> state_commrdy_handler(1676)........:
> MPID_nem_tcp_recv_handler(1564)....:
> MPID_nem_handle_pkt(636)...........:
> MPIDI_CH3_PktHandler_EagerSend(606): Failed to allocate memory for an
> unexpected message. 261895 unexpected messages queued.
> Fatal error in MPI_Recv: Other MPI error, error stack:
> MPI_Recv(186).............: MPI_Recv(buf=0x7fffd052b9f4, count=1,
> MPI_INTEGER, src=0, tag=MPI_ANY_TAG, MPI_COMM_WORLD, status=0xd213d0) failed
> dequeue_and_set_error(596): Communication error with rank 0
> Fatal error in MPI_Recv: Other MPI error, error stack:
> MPI_Recv(186).............: MPI_Recv(buf=0x7fff58fe5b74, count=1,
> MPI_INTEGER, src=0, tag=MPI_ANY_TAG, MPI_COMM_WORLD, status=0xd213d0) failed
> dequeue_and_set_error(596): Communication error with rank 0
> Fatal error in MPI_Recv: Other MPI error, error stack:
> MPI_Recv(186).............: MPI_Recv(buf=0x7fff6fae19f4, count=1,
> MPI_INTEGER, src=0, tag=MPI_ANY_TAG, MPI_COMM_WORLD, status=0xd213d0) failed
> dequeue_and_set_error(596): Communication error with rank 0
> Fatal error in MPI_Recv: Other MPI error, error stack:
> MPI_Recv(186).............: MPI_Recv(buf=0x7fff55bc8e74, count=1,
> MPI_INTEGER, src=0, tag=MPI_ANY_TAG, MPI_COMM_WORLD, status=0xd213d0) failed
> dequeue_and_set_error(596): Communication error with rank 0
>

I'm using the MPICH2 1.4.1p1 on a Linux x64 machine (AWS EC2 instance). The
last execution with this error had 63 working processes sending all the
output to just one receiver/writer process.

The program and number of messages sent/received are pretty much the same.
The only thing I can imagine is that, probably, the processor is
proportionally faster than the network/IO speed today when compared to 2013
AWS EC2 instance. In this way, probably the writer process gets "flooded"
with messages earlier. Does it make sense?

Could you please give some advice on how to solve this issue?

Best regards,
Luiz

On 13 March 2014 at 16:01, Luiz Carlos da Costa Junior <lcjunior at ufrj.br>
wrote:

> Thanks again Kenneth, I could solve using MPI_TESTANY.
> Regards, Luiz
>
>
> On 13 March 2014 15:35, Kenneth Raffenetti <raffenet at mcs.anl.gov> wrote:
>
>> On 03/13/2014 12:35 PM, Luiz Carlos da Costa Junior wrote:
>>
>>> Does anyone have any clue about this?
>>>
>>> Thanks in advance.
>>>
>>>
>>> On 12 March 2014 14:40, Luiz Carlos da Costa Junior <lcjunior at ufrj.br
>>> <mailto:lcjunior at ufrj.br>> wrote:
>>>
>>>     Dear Kenneth,
>>>
>>>     Thanks for your quick reply.
>>>     I tested your suggestion and, unfortunately, this approach didn't
>>> work.
>>>
>>>     Question: when I call MPI_IPROBE it accounts also for the messages
>>>     that were already received asynchronously?
>>>
>>
>> That should not be the case. If a message has been matched by a
>> recv/irecv, MPI_Probe should not match it again.
>>
>>
>>
>>>     Is there any way to know, for my list of mpi_requests (from my
>>>     MPI_IRECV's), which ones are "opened" and which ones have messages?
>>>
>>
>> MPI_Test will take a request as an argument and tell you whether or not
>> that requested operation has been completed.
>>
>> Ken
>>
>> _______________________________________________
>> discuss mailing list     discuss at mpich.org
>> To manage subscription options or unsubscribe:
>> https://lists.mpich.org/mailman/listinfo/discuss
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20150702/2b1ced48/attachment.html>
-------------- next part --------------

c     Parameter
c     ---------
      integer*4 zrecv                   ! number of simultaneous receive messages
      parameter(zrecv = 512)

c     Local
c     -----
      integer*4 m_stat(MPI_STATUS_SIZE)
      integer*4 m_slave
      logical*4 m_flag

      integer*4 m_request(zrecv)        ! request identifier for assynchonous receives
      character cbuff(zrecv)*(zbuf)     ! character buffer for receiving messages
      logical*4 continua

      continua = .true.
      kont = 0

c     Pre-post RECVs
c     --------------
      do irecv = 1, zrecv
        call MPI_IRECV(cbuff(irecv), zbuf, MPI_CHARACTER,
     .                 MPI_ANY_SOURCE, MPI_ANY_TAG, M_COMM_ALL,
     .                 m_request(irecv), m_ierr )
        if( m_ierr .ne. 0 ) then
          write(idpy,1001) 'MPI_IRECV'
          write(iprn,1001) 'MPI_IRECV'
          call fexit(m_ierr)
        end if
      end do !irecv

      do while( continua )

c       Wait for any of the pre-posted requests to arrive
c       -------------------------------------------------
        if( kont .lt. npro-2 ) then
          call MPI_WAITANY(zrecv, m_request, irecv, m_stat, m_ierr)
          if( m_ierr .ne. 0 ) then
            write(idpy,1001) 'MPI_WAITANY'
            write(iprn,1001) 'MPI_WAITANY'
            call fexit(m_ierr)
          end if
        end if

        if( m_stat(MPI_TAG) .eq. M_RECCSV ) then

c         Get message size, unpack information and call output
c         ----------------------------------------------------
          call MPI_GET_COUNT(m_stat,MPI_CHARACTER,isize,m_ierr)
          isize = isize - 8
          clu_irec(1:8) = cbuff(irecv)(isize+1:isize+8)
          lu   = ilu_irec(1)
          irec = ilu_irec(2)
          call output( lu, irec, isize, cbuff(irecv) )

        else if( m_stat(MPI_TAG) .eq. M_FIMCSV ) then
          kont = kont + 1

        else
          write(iprn,*) ' Unexpected message received.'
          write(idpy,*) ' Unexpected message received.'
          call sfexit( rotina, 001 )

        end if

c       Re-post RECV
c       ------------
        call MPI_IRECV(cbuff(irecv), zbuf, MPI_CHARACTER,
     .                 MPI_ANY_SOURCE, MPI_ANY_TAG, M_COMM_ALL,
     .                 m_request(irecv), m_ierr )
        if( m_ierr .ne. 0 ) then
          write(idpy,1001) 'MPI_IRECV'
          write(iprn,1001) 'MPI_IRECV'
          call fexit(m_ierr)
        end if

c       Check end of process
c       --------------------
        if( kont .eq. npro-2 ) then

c         Check if there is a pending RECCSV message to be received
c         ---------------------------------------------------------
          call MPI_TESTANY(zrecv, m_request, irecv, m_flag,
     .                     m_stat, m_ierr)
          if( m_ierr .ne. 0 ) then
            write(idpy,1001) 'MPI_TESTANY'
            write(iprn,1001) 'MPI_TESTANY'
            call fexit(m_ierr)
          end if
          continua = m_flag ! if there is some pending message, the process continues

        end if

      end do

c     Cancel unused RECVs
c     -------------------
      do irecv = 1, zrecv
        call MPI_CANCEL( m_request(irecv), m_ierr )
        if( m_ierr .ne. 0 ) then
          write(idpy,1001) 'MPI_CANCEL'
          write(iprn,1001) 'MPI_CANCEL'
          call fexit(m_ierr)
        end if
      end do !irecv

 1001 format(1x,'Error calling MPI function: ',a,' routine: ',a)
-------------- next part --------------
_______________________________________________
discuss mailing list     discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss


More information about the discuss mailing list