[mpich-discuss] running parallel job issue

Alexandra Betouni alexandra_99 at windowslive.com
Wed Oct 23 15:33:42 CDT 2013


Ok Antonio I know what you're talking about‎.
I will try this and I hope to solve it!
Thanks anyway to all of you :)

Sent from my BlackBerry 10 smartphone.
From: discuss-request at mpich.org
Sent: Wednesday, 23 October 2013 23:22
To: discuss at mpich.org
Reply To: discuss at mpich.org
Subject: discuss Digest, Vol 12, Issue 15


Send discuss mailing list submissions to
        discuss at mpich.org

To subscribe or unsubscribe via the World Wide Web, visit
        https://lists.mpich.org/mailman/listinfo/discuss
or, via email, send a message with subject or body 'help' to
        discuss-request at mpich.org

You can reach the person managing the list at
        discuss-owner at mpich.org

When replying, please edit your Subject line so it is more specific
than "Re: Contents of discuss digest..."


Today's Topics:

   1. Re:  running parralel job issue(alexandra) (Antonio J. Pe?a)


----------------------------------------------------------------------

Message: 1
Date: Wed, 23 Oct 2013 15:21:57 -0500
From: Antonio J. Pe?a <apenya at mcs.anl.gov>
To: discuss at mpich.org
Subject: Re: [mpich-discuss] running parralel job issue(alexandra)
Message-ID: <2206013.AaZ6q0B8ti at localhost.localdomain>
Content-Type: text/plain; charset="utf-8"


The address you got is nothing but a loopback address. I mean, that's not
the real IP address of your network interfaces, and is only used for self
communications through the sockets interface. In Linux, you can
determine the IP address of your network interface with something like:

/sbin/ifconfig eth0

(you may need to replace eth0 with the identifier of your network interface,
but in most cases this should work).

You should assign different IP addresses to all your computers in the same
network in order to be able to perform any communication among them.
You most likely will want to assign private addresses, such as 192.168.0.1,
192.168.0.2, etc. You can easily find how to do this by googling a little bit.

  Antonio


On Wednesday, October 23, 2013 11:10:48 PM Alexandra Betouni wrote:


?Well yes, the job runs on host2 locally, but parallel execution does the
same thing like on host1.
Someone here said that if all the computers have the same ip direction it
won't work..
Well, every node has ?127.0.1.1 As ip, and all of them had same host name
till I changed the two of them. Hydra is the default launcher.
I also forgot to mencione that ping host2 and same, ping host1 works
fine...


Sent from my BlackBerry 10 smartphone.
*From: *?discuss-request at mpich.org
*Sent: *?Wednesday, 23 October 2013 22:42
*To: *discuss at mpich.org
*Reply To: *discuss at mpich.org
*Subject: *discuss Digest, Vol 12, Issue 13



Send discuss mailing list submissions to        discuss at mpich.org

To subscribe or unsubscribe via the World Wide Web, visit
https://lists.mpich.org/mailman/listinfo/discuss[1]
_https://lists.mpich.org/mailman/listinfo/discuss_
_https://lists.mpich.org/mailman/listinfo/discuss_
_https://lists.mpich.org/mailman/listinfo/discuss_
_https://lists.mpich.org/mailman/listinfo/discuss_
_https://lists.mpich.org/mailman/listinfo/discuss_
http://lists.mpich.org/pipermail/discuss/attachments/20131023/fdf9d820/at
tachment-0001.html[2]>

------------------------------

Message: 3Date: Wed, 23 Oct 2013 17:27:27 -0200From: Luiz Carlos da
Costa Junior <lcjunior at ufrj.br>To: MPICH Discuss <mpich-
discuss at mcs.anl.gov>Subject: [mpich-discuss] Failed to allocate memory
for an unexpected        messageMessage-ID:
<CAOv4ofRY4ajVZecZcDN3d3tdENV=XBMd=5i1TjX3310ZnEFUdg at mail.gmai
l.com>Content-Type: text/plain; charset="iso-8859-1"

Hi,

I am getting the following error when running my parallel application:

MPI_Recv(186)......................: MPI_Recv(buf=0x125bd840,
count=2060,MPI_CHARACTER, src=24, tag=94, comm=0x84000002,
status=0x125fcff0)
failedMPIDI_CH3I_Progress(402)...........:MPID_nem_mpich2_blocking_recv(90
5).:MPID_nem_tcp_connpoll(1838)........:state_commrdy_handler(1676)........
:MPID_nem_tcp_recv_handler(1564)....:MPID_nem_handle_pkt(636)...........:M
PIDI_CH3_PktHandler_EagerSend(606): Failed to allocate memory for
anunexpected message. 261895 unexpected messages queued.Fatal
error in MPI_Send: Other MPI error, error stack:MPI_Send(173)..............:
MPI_Send(buf=0x765d2e60, count=2060,MPI_CHARACTER, dest=0,
tag=94, comm=0x84000004) failedMPID_nem_tcp_connpoll(1826):
Communication error with rank 1: Connectionreset by peer


I went to MPICH's FAQ (

http://wiki.mpich.org/mpich/index.php/Frequently_Asked_Questions#Q:_Why
_am_I_getting_so_many_unexpected_messages.3F[3]
http://lists.mpich.org/pipermail/discuss/attachments/20131023/3a02fa51/a
ttachment-0001.html[4]>

------------------------------

Message: 4Date: Wed, 23 Oct 2013 14:42:15 -0500From: Antonio J. Pe?a
<apenya at mcs.anl.gov>To: discuss at mpich.orgCc: MPICH Discuss <mpich-
discuss at mcs.anl.gov>Subject: Re: [mpich-discuss] Failed to allocate
memory for an        unexpected      messageMessage-ID:
<1965559.SsluspJNke at localhost.localdomain>Content-Type: text/plain;
charset="iso-8859-1"


Hi Luiz,

Your error trace indicates that the receiver went out of memory due to a
too large amount (261,895) of eager unexpected messages received, i.e.,
small messages received without a matching receive operation. Whenever
this happens, the receiver allocates a temporary buffer to hold the
received message. This exhausted the available memory in the computer
where the receiver was executing.

To avoid this, try to pre-post receives before messages arrive. Indeed, this
is far more efficient. Maybe you could do an MPI_IRecv per worker in your
writer process, and process them after an MPI_Waitany. You may also
consider having multiple writer processes if your use case permits and the
volume of received messages is too high to be processed by a single
writer.

  Antonio


On Wednesday, October 23, 2013 05:27:27 PM Luiz Carlos da Costa Junior
wrote:


Hi,


I am getting the following error when running my parallel application:


MPI_Recv(186)......................: MPI_Recv(buf=0x125bd840, count=2060,
MPI_CHARACTER, src=24, tag=94, comm=0x84000002, status=0x125fcff0)
failed MPIDI_CH3I_Progress(402)...........:
MPID_nem_mpich2_blocking_recv(905).:
MPID_nem_tcp_connpoll(1838)........:  state_commrdy_handler(1676)........:
MPID_nem_tcp_recv_handler(1564)....:  MPID_nem_handle_pkt(636)...........:
MPIDI_CH3_PktHandler_EagerSend(606): Failed to allocate memory for an
unexpected message. 261895 unexpected messages queued. Fatal error
in MPI_Send: Other MPI error, error stack:MPI_Send(173)..............:
MPI_Send(buf=0x765d2e60, count=2060, MPI_CHARACTER, dest=0,
tag=94, comm=0x84000004) failed MPID_nem_tcp_connpoll(1826):
Communication error with rank 1: Connection reset by peer


I went to MPICH's FAQ
(http://wiki.mpich.org/mpich/index.php/Frequently_Asked_Questions#Q:_Wh
y_am_I_getting_so_many_unexpected_messages.3F[1]). It says that most
likely the receiver process can't cope to process the high number of
messages it is receiving.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20131023/b13e2436/attachment.html>

------------------------------

_______________________________________________
discuss mailing list
discuss at mpich.org
https://lists.mpich.org/mailman/listinfo/discuss

End of discuss Digest, Vol 12, Issue 15
***************************************
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20131023/f4183c17/attachment.html>


More information about the discuss mailing list