[mpich-discuss] Failed to allocate memory for an unexpected message
Luiz Carlos da Costa Junior
lcjunior at ufrj.br
Thu Jul 2 13:22:55 CDT 2015
Hello all,
In 2013 I had problems regarding the allocation of unexpected messages in
MPI.
After your kind assistance, I implemented a "buffer" matrix in the receiver
process, using MPI_IRECV, MPI_WAITANY and MPI_TESTANY functions (the code
snippet is attached).
It has been working nicely since than until recently, when I faced the same
problems again:
Fatal error in MPI_Recv: Other MPI error, error stack:
> MPI_Recv(186)......................: MPI_Recv(buf=0x7fffe8dd5974, count=1,
> MPI_INTEGER, src=0, tag=MPI_ANY_TAG, MPI_COMM_WORLD, status=0xd213d0) failed
> MPIDI_CH3I_Progress(402)...........:
> MPID_nem_mpich2_blocking_recv(905).:
> MPID_nem_tcp_connpoll(1838)........:
> state_commrdy_handler(1676)........:
> MPID_nem_tcp_recv_handler(1564)....:
> MPID_nem_handle_pkt(636)...........:
> MPIDI_CH3_PktHandler_EagerSend(606): Failed to allocate memory for an
> unexpected message. 261895 unexpected messages queued.
> Fatal error in MPI_Recv: Other MPI error, error stack:
> MPI_Recv(186).............: MPI_Recv(buf=0x7fffd052b9f4, count=1,
> MPI_INTEGER, src=0, tag=MPI_ANY_TAG, MPI_COMM_WORLD, status=0xd213d0) failed
> dequeue_and_set_error(596): Communication error with rank 0
> Fatal error in MPI_Recv: Other MPI error, error stack:
> MPI_Recv(186).............: MPI_Recv(buf=0x7fff58fe5b74, count=1,
> MPI_INTEGER, src=0, tag=MPI_ANY_TAG, MPI_COMM_WORLD, status=0xd213d0) failed
> dequeue_and_set_error(596): Communication error with rank 0
> Fatal error in MPI_Recv: Other MPI error, error stack:
> MPI_Recv(186).............: MPI_Recv(buf=0x7fff6fae19f4, count=1,
> MPI_INTEGER, src=0, tag=MPI_ANY_TAG, MPI_COMM_WORLD, status=0xd213d0) failed
> dequeue_and_set_error(596): Communication error with rank 0
> Fatal error in MPI_Recv: Other MPI error, error stack:
> MPI_Recv(186).............: MPI_Recv(buf=0x7fff55bc8e74, count=1,
> MPI_INTEGER, src=0, tag=MPI_ANY_TAG, MPI_COMM_WORLD, status=0xd213d0) failed
> dequeue_and_set_error(596): Communication error with rank 0
>
I'm using the MPICH2 1.4.1p1 on a Linux x64 machine (AWS EC2 instance). The
last execution with this error had 63 working processes sending all the
output to just one receiver/writer process.
The program and number of messages sent/received are pretty much the same.
The only thing I can imagine is that, probably, the processor is
proportionally faster than the network/IO speed today when compared to 2013
AWS EC2 instance. In this way, probably the writer process gets "flooded"
with messages earlier. Does it make sense?
Could you please give some advice on how to solve this issue?
Best regards,
Luiz
On 13 March 2014 at 16:01, Luiz Carlos da Costa Junior <lcjunior at ufrj.br>
wrote:
> Thanks again Kenneth, I could solve using MPI_TESTANY.
> Regards, Luiz
>
>
> On 13 March 2014 15:35, Kenneth Raffenetti <raffenet at mcs.anl.gov> wrote:
>
>> On 03/13/2014 12:35 PM, Luiz Carlos da Costa Junior wrote:
>>
>>> Does anyone have any clue about this?
>>>
>>> Thanks in advance.
>>>
>>>
>>> On 12 March 2014 14:40, Luiz Carlos da Costa Junior <lcjunior at ufrj.br
>>> <mailto:lcjunior at ufrj.br>> wrote:
>>>
>>> Dear Kenneth,
>>>
>>> Thanks for your quick reply.
>>> I tested your suggestion and, unfortunately, this approach didn't
>>> work.
>>>
>>> Question: when I call MPI_IPROBE it accounts also for the messages
>>> that were already received asynchronously?
>>>
>>
>> That should not be the case. If a message has been matched by a
>> recv/irecv, MPI_Probe should not match it again.
>>
>>
>>
>>> Is there any way to know, for my list of mpi_requests (from my
>>> MPI_IRECV's), which ones are "opened" and which ones have messages?
>>>
>>
>> MPI_Test will take a request as an argument and tell you whether or not
>> that requested operation has been completed.
>>
>> Ken
>>
>> _______________________________________________
>> discuss mailing list discuss at mpich.org
>> To manage subscription options or unsubscribe:
>> https://lists.mpich.org/mailman/listinfo/discuss
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20150702/2b1ced48/attachment.html>
-------------- next part --------------
c Parameter
c ---------
integer*4 zrecv ! number of simultaneous receive messages
parameter(zrecv = 512)
c Local
c -----
integer*4 m_stat(MPI_STATUS_SIZE)
integer*4 m_slave
logical*4 m_flag
integer*4 m_request(zrecv) ! request identifier for assynchonous receives
character cbuff(zrecv)*(zbuf) ! character buffer for receiving messages
logical*4 continua
continua = .true.
kont = 0
c Pre-post RECVs
c --------------
do irecv = 1, zrecv
call MPI_IRECV(cbuff(irecv), zbuf, MPI_CHARACTER,
. MPI_ANY_SOURCE, MPI_ANY_TAG, M_COMM_ALL,
. m_request(irecv), m_ierr )
if( m_ierr .ne. 0 ) then
write(idpy,1001) 'MPI_IRECV'
write(iprn,1001) 'MPI_IRECV'
call fexit(m_ierr)
end if
end do !irecv
do while( continua )
c Wait for any of the pre-posted requests to arrive
c -------------------------------------------------
if( kont .lt. npro-2 ) then
call MPI_WAITANY(zrecv, m_request, irecv, m_stat, m_ierr)
if( m_ierr .ne. 0 ) then
write(idpy,1001) 'MPI_WAITANY'
write(iprn,1001) 'MPI_WAITANY'
call fexit(m_ierr)
end if
end if
if( m_stat(MPI_TAG) .eq. M_RECCSV ) then
c Get message size, unpack information and call output
c ----------------------------------------------------
call MPI_GET_COUNT(m_stat,MPI_CHARACTER,isize,m_ierr)
isize = isize - 8
clu_irec(1:8) = cbuff(irecv)(isize+1:isize+8)
lu = ilu_irec(1)
irec = ilu_irec(2)
call output( lu, irec, isize, cbuff(irecv) )
else if( m_stat(MPI_TAG) .eq. M_FIMCSV ) then
kont = kont + 1
else
write(iprn,*) ' Unexpected message received.'
write(idpy,*) ' Unexpected message received.'
call sfexit( rotina, 001 )
end if
c Re-post RECV
c ------------
call MPI_IRECV(cbuff(irecv), zbuf, MPI_CHARACTER,
. MPI_ANY_SOURCE, MPI_ANY_TAG, M_COMM_ALL,
. m_request(irecv), m_ierr )
if( m_ierr .ne. 0 ) then
write(idpy,1001) 'MPI_IRECV'
write(iprn,1001) 'MPI_IRECV'
call fexit(m_ierr)
end if
c Check end of process
c --------------------
if( kont .eq. npro-2 ) then
c Check if there is a pending RECCSV message to be received
c ---------------------------------------------------------
call MPI_TESTANY(zrecv, m_request, irecv, m_flag,
. m_stat, m_ierr)
if( m_ierr .ne. 0 ) then
write(idpy,1001) 'MPI_TESTANY'
write(iprn,1001) 'MPI_TESTANY'
call fexit(m_ierr)
end if
continua = m_flag ! if there is some pending message, the process continues
end if
end do
c Cancel unused RECVs
c -------------------
do irecv = 1, zrecv
call MPI_CANCEL( m_request(irecv), m_ierr )
if( m_ierr .ne. 0 ) then
write(idpy,1001) 'MPI_CANCEL'
write(iprn,1001) 'MPI_CANCEL'
call fexit(m_ierr)
end if
end do !irecv
1001 format(1x,'Error calling MPI function: ',a,' routine: ',a)
-------------- next part --------------
_______________________________________________
discuss mailing list discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss
More information about the discuss
mailing list