[mpich-discuss] Error: failed to allocate memory for an unexpected message

XingFENG xingfeng at cse.unsw.edu.au
Fri Oct 3 04:51:07 CDT 2014


Hi Wesley Bland,

When I was searching on the Internet, I realized that it could because the
mpich installed on these machines are too old. I found a FAQ entry as below
:
https://wiki.mpich.org/mpich/index.php/Frequently_Asked_Questions#Q:_I_don.27t_like_.3CWHATEVER.3E_about_mpd.2C_or_I.27m_having_a_problem_with_mpdboot.2C_can_you_fix_it.3F

I installed an up-to-date version of OpenMPI on these two machines. And the
problem is gone. I guess a recent MPICH would help, too.

Thanks very much for your time.

On Thu, Oct 2, 2014 at 8:57 PM, XingFENG <xingfeng at cse.unsw.edu.au> wrote:

> Hi Wesley Bland,
>
> Thanks for your reply.
>
> My codes is relatively big( around 2000 lines). I will try to make and
> post one small example later.
>
> On Thu, Oct 2, 2014 at 8:50 PM, Wesley Bland <wbland at anl.gov> wrote:
>
>> Can you provide a minimal example code that reproduced the problem?
>>
>>
>>
>> On Oct 2, 2014, at 2:13 AM, XingFENG <xingfeng at cse.unsw.edu.au> wrote:
>>
>> Hi Wesley Bland,
>>
>> Thanks for your reply.
>>
>> I have modified my codes. For each process, it first receives then sends
>> message from/to others. However, same error still appears.
>>
>> I also noted that the code works fine for single node machine. It crushed
>> with this error on multi-node cluster.
>>
>>
>> On Sun, Sep 28, 2014 at 10:44 PM, Wesley Bland <wbland at anl.gov> wrote:
>>
>>> The problem in this situation usually is that you're not posting enough
>>> receives and too many of your messages are getting buffered by the MPI
>>> library. Make sure you match up your sends and receives and whenever
>>> possible you post your receives early.
>>>
>>> Wesley
>>>
>>>
>>>
>>> > On Sep 28, 2014, at 7:13 AM, XingFENG <xingfeng at cse.unsw.edu.au>
>>> wrote:
>>> >
>>> > Hi all,
>>> >
>>> > I am running a MPI program on two machines. I got errors as follows:
>>> >
>>> >
>>> > ====================================================================
>>> > Fatal error in MPI_Test: Other MPI error, error stack:
>>> > MPI_Test(153)......................: MPI_Test(request=0xa0a088,
>>> flag=0x7fff470e86fc,  status=0x7fff470e86e0) failed
>>> > MPIDI_CH3I_Progress(150)...........:
>>> > MPID_nem_mpich2_test_recv(800).....:
>>> > MPID_nem_tcp_connpoll(1720)........:
>>> > state_commrdy_handler(1556)........:
>>> > MPID_nem_tcp_recv_handler(1459)....:
>>> > MPID_nem_handle_pkt(493)...........:
>>> > MPIDI_CH3_PktHandler_EagerSend(589): Failed to allocate memory for an
>>> unexpected message. 261892 unexpected messages queued.
>>> > Fatal error in MPI_Test: Other MPI error, error stack:
>>> > MPI_Test(153)......................: MPI_Test(request=0xadb128,
>>> flag=0x7fff33cba448, status=0x7fff33cba430) failed
>>> > MPIDI_CH3I_Progress(150)...........:
>>> > MPID_nem_mpich2_test_recv(800).....:
>>> > MPID_nem_tcp_connpoll(1720)........:
>>> > state_commrdy_handler(1556)........:
>>> > MPID_nem_tcp_recv_handler(1459)....:
>>> > MPID_nem_handle_pkt(493)...........:
>>> > MPIDI_CH3_PktHandler_EagerSend(589): Failed to allocate memory for an
>>> unexpected message. 261890 unexpected messages queued.
>>> > rank 1 in job 11  slave_36134   caused collective abort of all ranks
>>> >   exit status of rank 1: killed by signal 9
>>> >
>>> > ====================================================================
>>> >
>>> >
>>> > I have never seen such errors before. What is the cause of this error?
>>> Is it an out of memory error? ( There is 20% remaining memory on machines )
>>> >
>>> > Any help would be greatly appreciated. Thanks in advance!
>>> >
>>> >
>>> > --
>>> > Best Regards.
>>> > ---
>>> > Xing FENG
>>> > PhD Candidate
>>> > Database Research Group
>>> >
>>> > School of Computer Science and Engineering
>>> > University of New South Wales
>>> > NSW 2052, Sydney
>>> >
>>> > Phone: (+61) 413 857 288
>>> > _______________________________________________
>>> > discuss mailing list     discuss at mpich.org
>>> > To manage subscription options or unsubscribe:
>>> > https://lists.mpich.org/mailman/listinfo/discuss
>>> _______________________________________________
>>> discuss mailing list     discuss at mpich.org
>>> To manage subscription options or unsubscribe:
>>> https://lists.mpich.org/mailman/listinfo/discuss
>>>
>>
>>
>>
>> --
>> Best Regards.
>> ---
>> Xing FENG
>> PhD Candidate
>> Database Research Group
>>
>> School of Computer Science and Engineering
>> University of New South Wales
>> NSW 2052, Sydney
>>
>> Phone: (+61) 413 857 288
>>
>> _______________________________________________
>> discuss mailing list     discuss at mpich.org
>> To manage subscription options or unsubscribe:
>> https://lists.mpich.org/mailman/listinfo/discuss
>>
>>
>> _______________________________________________
>> discuss mailing list     discuss at mpich.org
>> To manage subscription options or unsubscribe:
>> https://lists.mpich.org/mailman/listinfo/discuss
>>
>
>
>
> --
> Best Regards.
> ---
> Xing FENG
> PhD Candidate
> Database Research Group
>
> School of Computer Science and Engineering
> University of New South Wales
> NSW 2052, Sydney
>
> Phone: (+61) 413 857 288
>



-- 
Best Regards.
---
Xing FENG
PhD Candidate
Database Research Group

School of Computer Science and Engineering
University of New South Wales
NSW 2052, Sydney

Phone: (+61) 413 857 288
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20141003/66e6fb5a/attachment.html>
-------------- next part --------------
_______________________________________________
discuss mailing list     discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss


More information about the discuss mailing list