[mpich-discuss] running parralel job issue(alexandra)

Wed Oct 23 15:10:48 CDT 2013

‎Well yes, the job runs on host2 locally, but parallel execution does the same thing like on host1.
Someone here said that if all the computers have the same ip direction it won't work..
Well, every node has ‎127.0.1.1 As ip, and all of them had same host name till I changed the two of them. Hydra is the default launcher.
I also forgot to mencione that ping host2 and same, ping host1 works fine...

Sent from my BlackBerry 10 smartphone.
From: ‎discuss-request at mpich.org
Sent: ‎Wednesday, 23 October 2013 22:42
To: discuss at mpich.org
Reply To: discuss at mpich.org
Subject: discuss Digest, Vol 12, Issue 13

Send discuss mailing list submissions to
        discuss at mpich.org

To subscribe or unsubscribe via the World Wide Web, visit
        https://lists.mpich.org/mailman/listinfo/discuss
or, via email, send a message with subject or body 'help' to
        discuss-request at mpich.org

You can reach the person managing the list at
        discuss-owner at mpich.org

When replying, please edit your Subject line so it is more specific
than "Re: Contents of discuss digest..."

Today's Topics:

   1. Re:  running parallel job issue! (Reuti)
   2. Re:  running parallel job issue! (Ricardo Rom?n Brenes)
   3.  Failed to allocate memory for an unexpected message
      (Luiz Carlos da Costa Junior)
   4. Re:  Failed to allocate memory for an unexpected  message
      (Antonio J. Pe?a)

----------------------------------------------------------------------

Message: 1
Date: Wed, 23 Oct 2013 17:00:53 +0200
From: Reuti <reuti at staff.uni-marburg.de>
To: discuss at mpich.org
Subject: Re: [mpich-discuss] running parallel job issue!
Message-ID:
        <071A0111-ECEF-437E-9D3C-7412FDCC84D5 at staff.uni-marburg.de>
Content-Type: text/plain; charset=iso-8859-1

Am 23.10.2013 um 16:51 schrieb Ricardo Rom?n Brenes:

> Hi. I'm not sure how it's done on that version of mpich, but you need to be sure that the mpd daemon is running on both hosts (mpdboot or maybe mpiexec starts it ).

No. The `mpdboot` is gone for some time - nowadays Hydra is used (since MPICH2 v1.3).

-- Reuti

> Does cpi runs in hosts 2 locally?
>
> On Oct 23, 2013 8:47 AM, "Alexandra Betouni" <alexandra_99 at windowslive.com> wrote:
>
>
>
>
> Hey there, I am trying to set up a parallel invironment with 14 machines, running Linux XUbuntu, connected via ethernet.
> They all have same IP's and same hostnames. Well I started installing mpich-3.0.4 on a single machine, I run the cpi example on localhost by giving mpiexec -host localhost -n 4 ./examples/cpi and everything worked fine!
> So I continued changing the hostnames of 2 pc's for a start, and setting up the ssh in these two, also I installed the mpich-3.0.4 on the other machine too.
> By giving the ssh <othermachine> date commant , I get the date of the other host without giving a password, so I think I passed that step too.
> Next step was to check if the mpich-3.0.4 runs parallel, so  I created a machine file (I made a text file giving the hostnames of the two computers , host1 and host2), and save it in my mpich-3.0.4 build directory. Though when I am trying to parallel run the cpi code by giving mpiexec -n 4 -f machinefile ./examples/cpi on my working directory, I get NO errors but neither parallel job...
> All processes still running on host1 which is my work station.
> What am I doing wrong?
> Thanks
>
>
> _______________________________________________
> discuss mailing list     discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss
> _______________________________________________
> discuss mailing list     discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss

------------------------------

Message: 2
Date: Wed, 23 Oct 2013 09:04:29 -0600
From: Ricardo Rom?n Brenes <roman.ricardo at gmail.com>
To: discuss at mpich.org
Subject: Re: [mpich-discuss] running parallel job issue!
Message-ID:
        <CAG-vK_yyAQ_CNFhYBxqoMniE01HVMA2YxJH=wC79m9KkBDVWFg at mail.gmail.com>
Content-Type: text/plain; charset="iso-8859-1"

Then the mpiexec should handle it... sorry, I use an outdated version that
is in centos repos
On Oct 23, 2013 9:01 AM, "Reuti" <reuti at staff.uni-marburg.de> wrote:

> Am 23.10.2013 um 16:51 schrieb Ricardo Rom?n Brenes:
>
> > Hi. I'm not sure how it's done on that version of mpich, but you need to
> be sure that the mpd daemon is running on both hosts (mpdboot or maybe
> mpiexec starts it ).
>
> No. The `mpdboot` is gone for some time - nowadays Hydra is used (since
> MPICH2 v1.3).
>
> -- Reuti
>
>
> > Does cpi runs in hosts 2 locally?
> >
> > On Oct 23, 2013 8:47 AM, "Alexandra Betouni" <
> alexandra_99 at windowslive.com> wrote:
> >
> >
> >
> >
> > Hey there, I am trying to set up a parallel invironment with 14
> machines, running Linux XUbuntu, connected via ethernet.
> > They all have same IP's and same hostnames. Well I started installing
> mpich-3.0.4 on a single machine, I run the cpi example on localhost by
> giving mpiexec -host localhost -n 4 ./examples/cpi and everything worked
> fine!
> > So I continued changing the hostnames of 2 pc's for a start, and setting
> up the ssh in these two, also I installed the mpich-3.0.4 on the other
> machine too.
> > By giving the ssh <othermachine> date commant , I get the date of the
> other host without giving a password, so I think I passed that step too.
> > Next step was to check if the mpich-3.0.4 runs parallel, so  I created a
> machine file (I made a text file giving the hostnames of the two computers
> , host1 and host2), and save it in my mpich-3.0.4 build directory. Though
> when I am trying to parallel run the cpi code by giving mpiexec -n 4 -f
> machinefile ./examples/cpi on my working directory, I get NO errors but
> neither parallel job...
> > All processes still running on host1 which is my work station.
> > What am I doing wrong?
> > Thanks
> >
> >
> > _______________________________________________
> > discuss mailing list     discuss at mpich.org
> > To manage subscription options or unsubscribe:
> > https://lists.mpich.org/mailman/listinfo/discuss
> > _______________________________________________
> > discuss mailing list     discuss at mpich.org
> > To manage subscription options or unsubscribe:
> > https://lists.mpich.org/mailman/listinfo/discuss
>
> _______________________________________________
> discuss mailing list     discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20131023/fdf9d820/attachment-0001.html>

------------------------------

Message: 3
Date: Wed, 23 Oct 2013 17:27:27 -0200
From: Luiz Carlos da Costa Junior <lcjunior at ufrj.br>
To: MPICH Discuss <mpich-discuss at mcs.anl.gov>
Subject: [mpich-discuss] Failed to allocate memory for an unexpected
        message
Message-ID:
        <CAOv4ofRY4ajVZecZcDN3d3tdENV=XBMd=5i1TjX3310ZnEFUdg at mail.gmail.com>
Content-Type: text/plain; charset="iso-8859-1"

Hi,

I am getting the following error when running my parallel application:

MPI_Recv(186)......................: MPI_Recv(buf=0x125bd840, count=2060,
MPI_CHARACTER, src=24, tag=94, comm=0x84000002, status=0x125fcff0) failed
MPIDI_CH3I_Progress(402)...........:
MPID_nem_mpich2_blocking_recv(905).:
MPID_nem_tcp_connpoll(1838)........:
state_commrdy_handler(1676)........:
MPID_nem_tcp_recv_handler(1564)....:
MPID_nem_handle_pkt(636)...........:
MPIDI_CH3_PktHandler_EagerSend(606): Failed to allocate memory for an
unexpected message. 261895 unexpected messages queued.
Fatal error in MPI_Send: Other MPI error, error stack:
MPI_Send(173)..............: MPI_Send(buf=0x765d2e60, count=2060,
MPI_CHARACTER, dest=0, tag=94, comm=0x84000004) failed
MPID_nem_tcp_connpoll(1826): Communication error with rank 1: Connection
reset by peer

I went to MPICH's FAQ (
http://wiki.mpich.org/mpich/index.php/Frequently_Asked_Questions#Q:_Why_am_I_getting_so_many_unexpected_messages.3F
).
It says that most likely the receiver process can't cope to process the
high number of messages it is receiving.

In my application, the worker processes perform a very large number of
small computations and, after some computation is complete, they sent the
data to a special "writer" process that is responsible to write the output
to disk.
This scheme use to work in a very reasonable fashion, until we faced some
new data with larger parameters that caused the problem above.

Even though we can redesign the application, for example, by creating a
pool of writer process we still have only one hard disk, so the bottleneck
would not be solved. So, this doesn't seem to be a good approach.

As far as I understood, MPICH saves the content of every MPI_SEND in a
internal buffer (I don't know where the buffer in located, sender or
receiver?) to allow asynchronous sender's computation while the messages
are being received.
The problem is that buffer has been exhausted due some resource limitation.

It is very interesting to have a buffer but if the buffer in the writer
process is close to its limit the workers processes should stop and wait
until it frees some space to restart sending new data to be written to disk.

Is it possible to check this buffer in MPICH? Or is it possible to check
the number of messages to be received?
Can anyone suggest a better (easy to implement) solution?

Thanks in advance.

Regards,
Luiz
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20131023/3a02fa51/attachment-0001.html>

------------------------------

Message: 4
Date: Wed, 23 Oct 2013 14:42:15 -0500
From: Antonio J. Pe?a <apenya at mcs.anl.gov>
To: discuss at mpich.org
Cc: MPICH Discuss <mpich-discuss at mcs.anl.gov>
Subject: Re: [mpich-discuss] Failed to allocate memory for an
        unexpected      message
Message-ID: <1965559.SsluspJNke at localhost.localdomain>
Content-Type: text/plain; charset="iso-8859-1"

Hi Luiz,

Your error trace indicates that the receiver went out of memory due to a
too large amount (261,895) of eager unexpected messages received, i.e.,
small messages received without a matching receive operation. Whenever
this happens, the receiver allocates a temporary buffer to hold the
received message. This exhausted the available memory in the computer
where the receiver was executing.

To avoid this, try to pre-post receives before messages arrive. Indeed, this
is far more efficient. Maybe you could do an MPI_IRecv per worker in your
writer process, and process them after an MPI_Waitany. You may also
consider having multiple writer processes if your use case permits and the
volume of received messages is too high to be processed by a single
writer.

  Antonio

On Wednesday, October 23, 2013 05:27:27 PM Luiz Carlos da Costa Junior
wrote:

Hi,

I am getting the following error when running my parallel application:

MPI_Recv(186)......................: MPI_Recv(buf=0x125bd840, count=2060,
MPI_CHARACTER, src=24, tag=94, comm=0x84000002, status=0x125fcff0)
failed
MPIDI_CH3I_Progress(402)...........:
MPID_nem_mpich2_blocking_recv(905).:
MPID_nem_tcp_connpoll(1838)........:
state_commrdy_handler(1676)........:
MPID_nem_tcp_recv_handler(1564)....:
MPID_nem_handle_pkt(636)...........:
MPIDI_CH3_PktHandler_EagerSend(606): Failed to allocate memory for an
unexpected message. 261895 unexpected messages queued.
Fatal error in MPI_Send: Other MPI error, error stack:
MPI_Send(173)..............: MPI_Send(buf=0x765d2e60, count=2060,
MPI_CHARACTER, dest=0, tag=94, comm=0x84000004) failed
MPID_nem_tcp_connpoll(1826): Communication error with rank 1:
Connection reset by peer

I went to MPICH's FAQ
(http://wiki.mpich.org/mpich/index.php/Frequently_Asked_Questions#Q:_Wh
y_am_I_getting_so_many_unexpected_messages.3F[1]).
It says that most likely the receiver process can't cope to process the high
number of messages it is receiving.

In my application, the worker processes perform a very large number of
small computations and, after some computation is complete, they sent
the data to a special "writer" process that is responsible to write the
output to disk.
This scheme use to work in a very reasonable fashion, until we faced some
new data with larger parameters that caused the problem above.

Even though we can redesign the application, for example, by creating a
pool of writer process we still have only one hard disk, so the bottleneck
would not be solved. So, this doesn't seem to be a good approach.

As far as I understood, MPICH saves the content of every MPI_SEND in a
internal buffer (I don't know where the buffer in located, sender or
receiver?) to allow asynchronous sender's computation while the
messages are being received.
The problem is that buffer has been exhausted due some resource
limitation.

It is very interesting to have a buffer but if the buffer in the writer process
is close to its limit the workers processes should stop and wait until it
frees some space to restart sending new data to be written to disk.

Is it possible to check this buffer in MPICH? Or is it possible to check the
number of messages to be received?
Can anyone suggest a better (easy to implement) solution?

Thanks in advance.

Regards,

Luiz

--------
[1]
http://wiki.mpich.org/mpich/index.php/Frequently_Asked_Questions#Q:_Why
_am_I_getting_so_many_unexpected_messages.3F
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20131023/ff1485b7/attachment.html>

------------------------------

_______________________________________________
discuss mailing list
discuss at mpich.org
https://lists.mpich.org/mailman/listinfo/discuss

End of discuss Digest, Vol 12, Issue 13
***************************************
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20131023/a45552b1/attachment.html>