[mpich-discuss] running parralel job issue(alexandra)
Antonio J. Peña
apenya at mcs.anl.gov
Wed Oct 23 15:21:57 CDT 2013
The address you got is nothing but a loopback address. I mean, that's not
the real IP address of your network interfaces, and is only used for self
communications through the sockets interface. In Linux, you can
determine the IP address of your network interface with something like:
/sbin/ifconfig eth0
(you may need to replace eth0 with the identifier of your network interface,
but in most cases this should work).
You should assign different IP addresses to all your computers in the same
network in order to be able to perform any communication among them.
You most likely will want to assign private addresses, such as 192.168.0.1,
192.168.0.2, etc. You can easily find how to do this by googling a little bit.
Antonio
On Wednesday, October 23, 2013 11:10:48 PM Alexandra Betouni wrote:
Well yes, the job runs on host2 locally, but parallel execution does the
same thing like on host1.
Someone here said that if all the computers have the same ip direction it
won't work..
Well, every node has 127.0.1.1 As ip, and all of them had same host name
till I changed the two of them. Hydra is the default launcher.
I also forgot to mencione that ping host2 and same, ping host1 works
fine...
Sent from my BlackBerry 10 smartphone.
*From: *discuss-request at mpich.org
*Sent: *Wednesday, 23 October 2013 22:42
*To: *discuss at mpich.org
*Reply To: *discuss at mpich.org
*Subject: *discuss Digest, Vol 12, Issue 13
Send discuss mailing list submissions to discuss at mpich.org
To subscribe or unsubscribe via the World Wide Web, visit
https://lists.mpich.org/mailman/listinfo/discuss[1]
_https://lists.mpich.org/mailman/listinfo/discuss_
_https://lists.mpich.org/mailman/listinfo/discuss_
_https://lists.mpich.org/mailman/listinfo/discuss_
_https://lists.mpich.org/mailman/listinfo/discuss_
_https://lists.mpich.org/mailman/listinfo/discuss_
http://lists.mpich.org/pipermail/discuss/attachments/20131023/fdf9d820/at
tachment-0001.html[2]>
------------------------------
Message: 3Date: Wed, 23 Oct 2013 17:27:27 -0200From: Luiz Carlos da
Costa Junior <lcjunior at ufrj.br>To: MPICH Discuss <mpich-
discuss at mcs.anl.gov>Subject: [mpich-discuss] Failed to allocate memory
for an unexpected messageMessage-ID:
<CAOv4ofRY4ajVZecZcDN3d3tdENV=XBMd=5i1TjX3310ZnEFUdg at mail.gmai
l.com>Content-Type: text/plain; charset="iso-8859-1"
Hi,
I am getting the following error when running my parallel application:
MPI_Recv(186)......................: MPI_Recv(buf=0x125bd840,
count=2060,MPI_CHARACTER, src=24, tag=94, comm=0x84000002,
status=0x125fcff0)
failedMPIDI_CH3I_Progress(402)...........:MPID_nem_mpich2_blocking_recv(90
5).:MPID_nem_tcp_connpoll(1838)........:state_commrdy_handler(1676)........
:MPID_nem_tcp_recv_handler(1564)....:MPID_nem_handle_pkt(636)...........:M
PIDI_CH3_PktHandler_EagerSend(606): Failed to allocate memory for
anunexpected message. 261895 unexpected messages queued.Fatal
error in MPI_Send: Other MPI error, error stack:MPI_Send(173)..............:
MPI_Send(buf=0x765d2e60, count=2060,MPI_CHARACTER, dest=0,
tag=94, comm=0x84000004) failedMPID_nem_tcp_connpoll(1826):
Communication error with rank 1: Connectionreset by peer
I went to MPICH's FAQ (
http://wiki.mpich.org/mpich/index.php/Frequently_Asked_Questions#Q:_Why
_am_I_getting_so_many_unexpected_messages.3F[3]
http://lists.mpich.org/pipermail/discuss/attachments/20131023/3a02fa51/a
ttachment-0001.html[4]>
------------------------------
Message: 4Date: Wed, 23 Oct 2013 14:42:15 -0500From: Antonio J. Pe?a
<apenya at mcs.anl.gov>To: discuss at mpich.orgCc: MPICH Discuss <mpich-
discuss at mcs.anl.gov>Subject: Re: [mpich-discuss] Failed to allocate
memory for an unexpected messageMessage-ID:
<1965559.SsluspJNke at localhost.localdomain>Content-Type: text/plain;
charset="iso-8859-1"
Hi Luiz,
Your error trace indicates that the receiver went out of memory due to a
too large amount (261,895) of eager unexpected messages received, i.e.,
small messages received without a matching receive operation. Whenever
this happens, the receiver allocates a temporary buffer to hold the
received message. This exhausted the available memory in the computer
where the receiver was executing.
To avoid this, try to pre-post receives before messages arrive. Indeed, this
is far more efficient. Maybe you could do an MPI_IRecv per worker in your
writer process, and process them after an MPI_Waitany. You may also
consider having multiple writer processes if your use case permits and the
volume of received messages is too high to be processed by a single
writer.
Antonio
On Wednesday, October 23, 2013 05:27:27 PM Luiz Carlos da Costa Junior
wrote:
Hi,
I am getting the following error when running my parallel application:
MPI_Recv(186)......................: MPI_Recv(buf=0x125bd840, count=2060,
MPI_CHARACTER, src=24, tag=94, comm=0x84000002, status=0x125fcff0)
failed MPIDI_CH3I_Progress(402)...........:
MPID_nem_mpich2_blocking_recv(905).:
MPID_nem_tcp_connpoll(1838)........: state_commrdy_handler(1676)........:
MPID_nem_tcp_recv_handler(1564)....: MPID_nem_handle_pkt(636)...........:
MPIDI_CH3_PktHandler_EagerSend(606): Failed to allocate memory for an
unexpected message. 261895 unexpected messages queued. Fatal error
in MPI_Send: Other MPI error, error stack:MPI_Send(173)..............:
MPI_Send(buf=0x765d2e60, count=2060, MPI_CHARACTER, dest=0,
tag=94, comm=0x84000004) failed MPID_nem_tcp_connpoll(1826):
Communication error with rank 1: Connection reset by peer
I went to MPICH's FAQ
(http://wiki.mpich.org/mpich/index.php/Frequently_Asked_Questions#Q:_Wh
y_am_I_getting_so_many_unexpected_messages.3F[1]). It says that most
likely the receiver process can't cope to process the high number of
messages it is receiving.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20131023/b13e2436/attachment.html>
More information about the discuss
mailing list