[mpich-discuss] running parralel job issue(alexandra)

Wed Oct 23 15:21:57 CDT 2013

The address you got is nothing but a loopback address. I mean, that's not 
the real IP address of your network interfaces, and is only used for self 
communications through the sockets interface. In Linux, you can 
determine the IP address of your network interface with something like:

/sbin/ifconfig eth0

(you may need to replace eth0 with the identifier of your network interface, 
but in most cases this should work).

You should assign different IP addresses to all your computers in the same 
network in order to be able to perform any communication among them. 
You most likely will want to assign private addresses, such as 192.168.0.1, 
192.168.0.2, etc. You can easily find how to do this by googling a little bit.

  Antonio

On Wednesday, October 23, 2013 11:10:48 PM Alexandra Betouni wrote:

‎Well yes, the job runs on host2 locally, but parallel execution does the 
same thing like on host1. 
Someone here said that if all the computers have the same ip direction it 
won't work..
Well, every node has ‎127.0.1.1 As ip, and all of them had same host name 
till I changed the two of them. Hydra is the default launcher.  
I also forgot to mencione that ping host2 and same, ping host1 works 
fine...

Sent from my BlackBerry 10 smartphone.
*From: *‎discuss-request at mpich.org
*Sent: *‎Wednesday, 23 October 2013 22:42
*To: *discuss at mpich.org
*Reply To: *discuss at mpich.org
*Subject: *discuss Digest, Vol 12, Issue 13

Send discuss mailing list submissions to        discuss at mpich.org

To subscribe or unsubscribe via the World Wide Web, visit        
https://lists.mpich.org/mailman/listinfo/discuss[1]
_https://lists.mpich.org/mailman/listinfo/discuss_
_https://lists.mpich.org/mailman/listinfo/discuss_
_https://lists.mpich.org/mailman/listinfo/discuss_
_https://lists.mpich.org/mailman/listinfo/discuss_
_https://lists.mpich.org/mailman/listinfo/discuss_
http://lists.mpich.org/pipermail/discuss/attachments/20131023/fdf9d820/at
tachment-0001.html[2]>

------------------------------

Message: 3Date: Wed, 23 Oct 2013 17:27:27 -0200From: Luiz Carlos da 
Costa Junior <lcjunior at ufrj.br>To: MPICH Discuss <mpich-
discuss at mcs.anl.gov>Subject: [mpich-discuss] Failed to allocate memory 
for an unexpected        messageMessage-ID:        
<CAOv4ofRY4ajVZecZcDN3d3tdENV=XBMd=5i1TjX3310ZnEFUdg at mail.gmai
l.com>Content-Type: text/plain; charset="iso-8859-1"

Hi,

I am getting the following error when running my parallel application:

MPI_Recv(186)......................: MPI_Recv(buf=0x125bd840, 
count=2060,MPI_CHARACTER, src=24, tag=94, comm=0x84000002, 
status=0x125fcff0) 
failedMPIDI_CH3I_Progress(402)...........:MPID_nem_mpich2_blocking_recv(90
5).:MPID_nem_tcp_connpoll(1838)........:state_commrdy_handler(1676)........
:MPID_nem_tcp_recv_handler(1564)....:MPID_nem_handle_pkt(636)...........:M
PIDI_CH3_PktHandler_EagerSend(606): Failed to allocate memory for 
anunexpected message. 261895 unexpected messages queued.Fatal 
error in MPI_Send: Other MPI error, error stack:MPI_Send(173)..............: 
MPI_Send(buf=0x765d2e60, count=2060,MPI_CHARACTER, dest=0, 
tag=94, comm=0x84000004) failedMPID_nem_tcp_connpoll(1826): 
Communication error with rank 1: Connectionreset by peer

I went to MPICH's FAQ (

http://wiki.mpich.org/mpich/index.php/Frequently_Asked_Questions#Q:_Why
_am_I_getting_so_many_unexpected_messages.3F[3]
http://lists.mpich.org/pipermail/discuss/attachments/20131023/3a02fa51/a
ttachment-0001.html[4]>

------------------------------

Message: 4Date: Wed, 23 Oct 2013 14:42:15 -0500From: Antonio J. Pe?a 
<apenya at mcs.anl.gov>To: discuss at mpich.orgCc: MPICH Discuss <mpich-
discuss at mcs.anl.gov>Subject: Re: [mpich-discuss] Failed to allocate 
memory for an        unexpected      messageMessage-ID: 
<1965559.SsluspJNke at localhost.localdomain>Content-Type: text/plain; 
charset="iso-8859-1"

Hi Luiz,

Your error trace indicates that the receiver went out of memory due to a 
too large amount (261,895) of eager unexpected messages received, i.e., 
small messages received without a matching receive operation. Whenever 
this happens, the receiver allocates a temporary buffer to hold the 
received message. This exhausted the available memory in the computer 
where the receiver was executing.

To avoid this, try to pre-post receives before messages arrive. Indeed, this 
is far more efficient. Maybe you could do an MPI_IRecv per worker in your 
writer process, and process them after an MPI_Waitany. You may also 
consider having multiple writer processes if your use case permits and the 
volume of received messages is too high to be processed by a single 
writer.

  Antonio

On Wednesday, October 23, 2013 05:27:27 PM Luiz Carlos da Costa Junior 
wrote:

Hi,

I am getting the following error when running my parallel application:

MPI_Recv(186)......................: MPI_Recv(buf=0x125bd840, count=2060, 
MPI_CHARACTER, src=24, tag=94, comm=0x84000002, status=0x125fcff0) 
failed MPIDI_CH3I_Progress(402)...........:  
MPID_nem_mpich2_blocking_recv(905).:  
MPID_nem_tcp_connpoll(1838)........:  state_commrdy_handler(1676)........:  
MPID_nem_tcp_recv_handler(1564)....:  MPID_nem_handle_pkt(636)...........:  
MPIDI_CH3_PktHandler_EagerSend(606): Failed to allocate memory for an 
unexpected message. 261895 unexpected messages queued. Fatal error 
in MPI_Send: Other MPI error, error stack:MPI_Send(173)..............: 
MPI_Send(buf=0x765d2e60, count=2060, MPI_CHARACTER, dest=0, 
tag=94, comm=0x84000004) failed MPID_nem_tcp_connpoll(1826): 
Communication error with rank 1: Connection reset by peer 

I went to MPICH's FAQ 
(http://wiki.mpich.org/mpich/index.php/Frequently_Asked_Questions#Q:_Wh
y_am_I_getting_so_many_unexpected_messages.3F[1]). It says that most 
likely the receiver process can't cope to process the high number of 
messages it is receiving.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20131023/b13e2436/attachment.html>