[mpich-discuss] using MPICH_FAILED_PROCESSES

Ma, Zhaoming zhaoming.ma at citi.com
Fri Nov 30 16:50:32 CST 2012


Hi Darius,

Here is the error message when the default error handler was used,

Fatal error in MPIR_CommGetAttr: Invalid argument, error stack:
MPIR_CommGetAttr(261): MPIR_Comm_get_attr(MPI_COMM_WORLD, comm_keyval=-153930956
8, attribute_val=(nil), flag=0x7fff7f5dc048) failed
MPIR_CommGetAttr(85).: Null pointer in parameter attr_val
/Users/zm35101:$ mpiexec -n 2 ./MPI_test
Fatal error in MPIR_CommGetAttr: Invalid argument, error stack:
MPIR_CommGetAttr(261): MPIR_Comm_get_attr(MPI_COMM_WORLD, comm_keyval=-1539309568, attribute_val=(nil), flag=0x7fffb364fdf8) failed
MPIR_CommGetAttr(85).: Null pointer in parameter attr_val

Zhaoming

-----Original Message-----
From: discuss-bounces at mpich.org [mailto:discuss-bounces at mpich.org] On Behalf Of discuss-request at mpich.org
Sent: Friday, November 30, 2012 4:37 PM
To: discuss at mpich.org
Subject: discuss Digest, Vol 1, Issue 18

Send discuss mailing list submissions to
        discuss at mpich.org

To subscribe or unsubscribe via the World Wide Web, visit
        https://lists.mpich.org/mailman/listinfo/discuss
or, via email, send a message with subject or body 'help' to
        discuss-request at mpich.org

You can reach the person managing the list at
        discuss-owner at mpich.org

When replying, please edit your Subject line so it is more specific than "Re: Contents of discuss digest..."


Today's Topics:

   1. Re:  How to specify number of cores for each process
      (Pavan Balaji)
   2. Re:  How to specify number of cores for each process
      (Pavan Balaji)
   3. Re:  How to specify number of cores for each process
      (Pavan Balaji)
   4.  MPI and signal handlers (Matthieu Dorier)
   5. Re:  using MPICH_FAILED_PROCESSES (Darius Buntinas)
   6. Re:  MPI and signal handlers (Darius Buntinas)
   7. Re:  MPI and signal handlers (Matthieu Dorier)
   8.  Hanging code in MPI_Comm_Spawn (Tim Gallagher)
   9. Re:  Hanging code in MPI_Comm_Spawn (Reuti)
  10. Re:  Support for MIC in mpich2-1.5 (John Fettig)


----------------------------------------------------------------------

Message: 1
Date: Wed, 28 Nov 2012 07:41:46 -0600
From: Pavan Balaji <balaji at mcs.anl.gov>
To: discuss at mpich.org
Subject: Re: [mpich-discuss] How to specify number of cores for each
        process
Message-ID: <50B6149A.6030903 at mcs.anl.gov>
Content-Type: text/plain; charset=ISO-8859-1


There are two problems in your usage below.

1. I'm guessing that your host file specifies only one core for each host.  Something like:

host1
host2

Did you look through the mpiexec usage document that tells you how to specify multiple cores on each host?

http://wiki.mpich.org/mpich/index.php/Using_the_Hydra_Process_Manager

You should do something like:

host1:16
host2:16

(please read the documentation on the above link for more information).

2. There's option called -binding auto.  Did you mean -binding none?

 -- Pavan

On 11/27/2012 09:37 PM US Central Time, Zachary Stanko wrote:
> Hello,
>
> I am running an MPI program on a machine with two 16-core processors
> yet, no matter what configuration I use with mpiexec, I am only
> receiving 4 simultaneous processes.
> I would like to maximize this machine's potential and run 8 or 16
> processes. I have tried all of the channel selections and tried the
> -binding option.
> Since I need to specify a different working directory for each process
> (due to many output files with naming conflicts), I have a config file
> with 8 lines of the form:
>
> -n 1 -binding auto -dir <mydir01> <myprog> <inpfile> -n 1 -binding
> auto -dir <mydir02> <myprog> <inpfile> ...etc
>
> and I run:
>
> mpiexec -configfile <filename>
>
> Am I doing something wrong, or does MPICH just know best and cannot
> run more than 4 jobs at a time on this system?
>
> Thanks,
>
> Zak
>
>
> _______________________________________________
> discuss mailing list     discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss
>

--
Pavan Balaji
http://www.mcs.anl.gov/~balaji


------------------------------

Message: 2
Date: Wed, 28 Nov 2012 07:43:01 -0600
From: Pavan Balaji <balaji at mcs.anl.gov>
To: discuss at mpich.org
Subject: Re: [mpich-discuss] How to specify number of cores for each
        process
Message-ID: <50B614E5.7020801 at mcs.anl.gov>
Content-Type: text/plain; charset=ISO-8859-1


As Zachary explained below, each process is in a different directory, so
he'll need to use an MPMD launch and can't do -n 16 directly.

 -- Pavan

On 11/27/2012 09:42 PM US Central Time, Jeff Hammond wrote:
> Did you try "mpiexec -n 16 ..."?
>
> Sent from my iPhone
>
> On Nov 27, 2012, at 10:37 PM, Zachary Stanko <zstanko at usgs.gov
> <mailto:zstanko at usgs.gov>> wrote:
>
>> Hello,
>>
>> I am running an MPI program on a machine with two 16-core processors
>> yet, no matter what configuration I use with mpiexec, I am only
>> receiving 4 simultaneous processes.
>> I would like to maximize this machine's potential and run 8 or 16
>> processes. I have tried all of the channel selections and tried the
>> -binding option.
>> Since I need to specify a different working directory for each process
>> (due to many output files with naming conflicts), I have a config file
>> with 8 lines of the form:
>>
>> -n 1 -binding auto -dir <mydir01> <myprog> <inpfile>
>> -n 1 -binding auto -dir <mydir02> <myprog> <inpfile>
>> ...etc
>>
>> and I run:
>>
>> mpiexec -configfile <filename>
>>
>> Am I doing something wrong, or does MPICH just know best and cannot
>> run more than 4 jobs at a time on this system?
>>
>> Thanks,
>>
>> Zak
>> _______________________________________________
>> discuss mailing list     <mailto:discuss at mpich.org>discuss at mpich.org
>> <mailto:discuss at mpich.org>
>> To manage subscription options or unsubscribe:
>> https://lists.mpich.org/mailman/listinfo/discuss
>
>
> _______________________________________________
> discuss mailing list     discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss
>

--
Pavan Balaji
http://www.mcs.anl.gov/~balaji


------------------------------

Message: 3
Date: Wed, 28 Nov 2012 07:44:52 -0600
From: Pavan Balaji <balaji at mcs.anl.gov>
To: discuss at mpich.org
Subject: Re: [mpich-discuss] How to specify number of cores for each
        process
Message-ID: <50B61554.6090209 at mcs.anl.gov>
Content-Type: text/plain; charset=ISO-8859-1


On 11/28/2012 07:41 AM US Central Time, Pavan Balaji wrote:
> 2. There's option called -binding auto.  Did you mean -binding none?

Gah.  I meant, there's *no* option called -binding auto.

 -- Pavan

--
Pavan Balaji
http://www.mcs.anl.gov/~balaji


------------------------------

Message: 4
Date: Wed, 28 Nov 2012 15:54:59 +0100 (CET)
From: Matthieu Dorier <matthieu.dorier at irisa.fr>
To: discuss at mpich.org
Subject: [mpich-discuss] MPI and signal handlers
Message-ID: <665045738.7499664.1354114498957.JavaMail.root at irisa.fr>
Content-Type: text/plain; charset="iso-8859-1"

Hello,


I read in the MPI-3 standard that allowing MPI calls within signal handlers is implementation dependent (I guess it's the same in previous standards) .
Does mpich allow MPI calls within signal handlers?


Also which signals are used by mpich? Does mpich use SIGSEGV?


Thank you,


Matthieu Dorier
PhD student at ENS Cachan Brittany and IRISA
http://people.irisa.fr/Matthieu.Dorier
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20121128/4cf1a680/attachment-0001.html>

------------------------------

Message: 5
Date: Wed, 28 Nov 2012 10:00:49 -0600
From: Darius Buntinas <buntinas at mcs.anl.gov>
To: discuss at mpich.org
Subject: Re: [mpich-discuss] using MPICH_FAILED_PROCESSES
Message-ID: <C7076D99-DD62-4AE8-BCA1-E78238CAD17F at mcs.anl.gov>
Content-Type: text/plain; charset=us-ascii

Hi Zhaoming,

Leave the default error handler set so that an error gets printed, then try it again and send us the output.

-d


On Nov 27, 2012, at 3:27 PM, Ma, Zhaoming wrote:

> I am trying to use the following to catch failed processes,
>
>       MPI::COMM_WORLD.Get_attr(MPICH_ATTR_FAILED_PROCESSES, void*)
>
> I am using MPICH2 1.4 and g++. Someone posted a C program that does this successfully. The link is http://hi.baidu.com/ejoywx/item/74233ccb9dd20815515058ae. However, I am having trouble making my C++ testing program (attached) working. The code compiled ok but produces the following runtime error,
>
>       terminate called after throwing an instance of 'MPI::Exception'
>       APPLICATION TERMINATED WITH THE EXIT STRING: Aborted (signal 6)
>
> If I comment out the line "MPI::COMM_WORLD.Get_attr(MPICH_ATTR_FAILED_PROCESSES, void*)" or put it in a try and catch block, the test program runs fine.
>
> Thank you for your help.
>
> Zhaoming
> <MPI_test.cpp.txt>_______________________________________________
> discuss mailing list     discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss



------------------------------

Message: 6
Date: Wed, 28 Nov 2012 10:22:46 -0600
From: Darius Buntinas <buntinas at mcs.anl.gov>
To: discuss at mpich.org
Subject: Re: [mpich-discuss] MPI and signal handlers
Message-ID: <05CC506C-EBC3-40E2-8D65-F88220A36561 at mcs.anl.gov>
Content-Type: text/plain; charset=iso-8859-1

Sorry, MPICH doesn't allow MPI calls from signal handlers.

-d


On Nov 28, 2012, at 8:54 AM, Matthieu Dorier wrote:

> Hello,
>
> I read in the MPI-3 standard that allowing MPI calls within signal handlers is implementation dependent  (I guess it's the same in previous standards).
> Does mpich allow MPI calls within signal handlers?
>
> Also which signals are used by mpich? Does mpich use SIGSEGV?
>
> Thank you,
>
> Matthieu Dorier
> PhD student at ENS Cachan Brittany and IRISA
> http://people.irisa.fr/Matthieu.Dorier
> _______________________________________________
> discuss mailing list     discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss



------------------------------

Message: 7
Date: Wed, 28 Nov 2012 18:05:46 +0100 (CET)
From: Matthieu Dorier <matthieu.dorier at irisa.fr>
To: discuss at mpich.org
Subject: Re: [mpich-discuss] MPI and signal handlers
Message-ID: <2009043130.7564091.1354122346261.JavaMail.root at irisa.fr>
Content-Type: text/plain; charset=ISO-8859-1

Alright, thanks.

Matthieu

----- Mail original -----
> De: "Darius Buntinas" <buntinas at mcs.anl.gov>
> ?: discuss at mpich.org
> Envoy?: Mercredi 28 Novembre 2012 17:22:46
> Objet: Re: [mpich-discuss] MPI and signal handlers
>
> Sorry, MPICH doesn't allow MPI calls from signal handlers.
>
> -d
>
>
> On Nov 28, 2012, at 8:54 AM, Matthieu Dorier wrote:
>
> > Hello,
> >
> > I read in the MPI-3 standard that allowing MPI calls within signal
> > handlers is implementation dependent  (I guess it's the same in
> > previous standards).
> > Does mpich allow MPI calls within signal handlers?
> >
> > Also which signals are used by mpich? Does mpich use SIGSEGV?
> >
> > Thank you,
> >
> > Matthieu Dorier
> > PhD student at ENS Cachan Brittany and IRISA
> > http://people.irisa.fr/Matthieu.Dorier
> > _______________________________________________
> > discuss mailing list     discuss at mpich.org
> > To manage subscription options or unsubscribe:
> > https://lists.mpich.org/mailman/listinfo/discuss
>
> _______________________________________________
> discuss mailing list     discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss
>


------------------------------

Message: 8
Date: Thu, 29 Nov 2012 14:18:10 -0500 (EST)
From: Tim Gallagher <tim.gallagher at gatech.edu>
To: discuss at mpich.org
Subject: [mpich-discuss] Hanging code in MPI_Comm_Spawn
Message-ID:
        <1418759320.4762388.1354216690625.JavaMail.root at mail.gatech.edu>
Content-Type: text/plain; charset=utf-8

Hi,

I have a Fortran application that uses MPI_Comm_Spawn. When I run with it compiled using gfortran/gcc (and mpich compiled with gfortran/gcc), it just hangs forever. When I run with it compiled using ifort/icc (and mpich compiled with ifort/icc), it runs correctly.

When I run it in GDB with the GNU compiler suite and interrupt it once it's stuck, it tells me:

Program received signal SIGINT, Interrupt.
0x00007ffff7800fc9 in pmpi_comm_spawn__ () from /opt/mpi/mpich/gnu/system/lib64/libmpich.so.8

and this is with both mpich 1.4 and 1.5.

Does anybody have any suggestions for what could be going on? It really just sits there, doing absolutely nothing, forever. No timeouts, no errors or warnings. I'm not ruling out a bug in my codes, but I don't even know where to begin.

Thanks,

Tim


------------------------------

Message: 9
Date: Thu, 29 Nov 2012 22:24:39 +0100
From: Reuti <reuti at staff.uni-marburg.de>
To: tim.gallagher at gatech.edu, discuss at mpich.org
Subject: Re: [mpich-discuss] Hanging code in MPI_Comm_Spawn
Message-ID:
        <7BA788BF-4718-4D5A-8B34-A6C3222381CB at staff.uni-marburg.de>
Content-Type: text/plain; charset=us-ascii

Am 29.11.2012 um 20:18 schrieb Tim Gallagher:

> I have a Fortran application that uses MPI_Comm_Spawn. When I run with it compiled using gfortran/gcc (and mpich compiled with gfortran/gcc), it just hangs forever. When I run with it compiled using ifort/icc (and mpich compiled with ifort/icc), it runs correctly.
>
> When I run it in GDB with the GNU compiler suite and interrupt it once it's stuck, it tells me:
>
> Program received signal SIGINT, Interrupt.
> 0x00007ffff7800fc9 in pmpi_comm_spawn__ () from /opt/mpi/mpich/gnu/system/lib64/libmpich.so.8
>
> and this is with both mpich 1.4 and 1.5.

Did you compile MPICH2 on your own and set the LD_LIBRARY_PATH correctly? In 1.4.1p1 I have only libmpich.so.3.3 and so I assume the libmpich.so.8 is a one from your installation? IIRC by default MPICH2 generates static libs.

-- Reuti


> Does anybody have any suggestions for what could be going on? It really just sits there, doing absolutely nothing, forever. No timeouts, no errors or warnings. I'm not ruling out a bug in my codes, but I don't even know where to begin.
>
> Thanks,
>
> Tim
> _______________________________________________
> discuss mailing list     discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss



------------------------------

Message: 10
Date: Fri, 30 Nov 2012 17:36:17 -0500
From: John Fettig <john.fettig at gmail.com>
To: Pavan Balaji <balaji at mcs.anl.gov>
Cc: mpich-discuss <mpich-discuss at mcs.anl.gov>
Subject: Re: [mpich-discuss] Support for MIC in mpich2-1.5
Message-ID:
        <CAD8Bu=p4FsorbJA6yZ4=O9CnBq=qt5t9AGfPEnFr5fVO0C-ANw at mail.gmail.com>
Content-Type: text/plain; charset="iso-8859-1"

Any thoughts about this?

Regards,
John


On Tue, Nov 13, 2012 at 5:07 PM, John Fettig <john.fettig at gmail.com> wrote:

> On Mon, Nov 5, 2012 at 9:37 PM, Pavan Balaji <balaji at mcs.anl.gov> wrote:
>
>>
>> On 11/05/12 13:12, John Fettig wrote:
>>
>>> I believe I have a working build, I'll append my cross file to the end
>>> of this email if anybody else wants to try it.
>>>
>>
>> Thanks!
>>
>>
>>  I have a followup question:  is there any support for launching jobs
>>> that use both the MIC and the host CPU?
>>>
>>
>> Yes.  Once you have setup MPICH on both the host and MIC, you can launch
>> jobs across them.
>>
>> If you didn't pass any configure option, it'll use TCP/IP, which is very
>> slow.  If you configure with --with-device=ch3:nemesis:scif, it'll use the
>> SCIF protocol, which is much faster.
>>
>
> I compiled examples/hellow.c for both the MIC and the host CPU, and copied
> it to the card.  This seems to work:
>
> $ mpiexec -hosts 172.31.1.1:1,172.31.1.254:1 -n 1 ./hellow.mic : -n 1
> ./hellow
> Hello world from process 1 of 2
> Hello world from process 0 of 2
>
> However, if I try to run more processes it crashes:
>
> $ mpiexec -hosts 172.31.1.1:3,172.31.1.254:3 -n 3 ./hellow.mic : -n 3
> ./hellow
> Hello world from process 4 of 6
> Hello world from process 0 of 6
> Hello world from process 3 of 6
> Hello world from process 1 of 6
>  0:  3: 00000033: 00000042: readv err 0
> Fatal error in MPI_Finalize: Other MPI error, error stack:
> MPI_Finalize(293).................: MPI_Finalize failed
> MPI_Finalize(213).................:
> MPID_Finalize(117)................:
> MPIDI_CH3U_VC_WaitForClose(385)...: an error occurred while the device was
> waiting for all open connections to close
> MPIDI_CH3I_Progress(367)..........:
> MPID_nem_mpich2_blocking_recv(904):
> state_commrdy_handler(175)........:
> state_commrdy_handler(138)........:
> MPID_nem_scif_recv_handler(115)...: Communication error with rank 3
> MPID_nem_scif_recv_handler(35)....: scif_scif_read failed (scif_scif_read
> failed with error 'Success')
>  1:  3: 00000033: 00000042: readv err 0
> Fatal error in MPI_Finalize: Other MPI error, error stack:
> MPI_Finalize(293).................: MPI_Finalize failed
> MPI_Finalize(213).................:
> MPID_Finalize(117)................:
> MPIDI_CH3U_VC_WaitForClose(385)...: an error occurred while the device was
> waiting for all open connections to close
> MPIDI_CH3I_Progress(367)..........:
> MPID_nem_mpich2_blocking_recv(904):
> state_commrdy_handler(175)........:
> state_commrdy_handler(138)........:
> MPID_nem_scif_recv_handler(115)...: Communication error with rank 3
> MPID_nem_scif_recv_handler(35)....: scif_scif_read failed (scif_scif_read
> failed with error 'Success')
> Hello world from process 5 of 6
> Fatal error in MPI_Finalize: Other MPI error, error stack:
> MPI_Finalize(293).................: MPI_Finalize failed
> MPI_Finalize(213).................:
> MPID_Finalize(117)................:
> MPIDI_CH3U_VC_WaitForClose(385)...: an error occurred while the device was
> waiting for all open connections to close
> MPIDI_CH3I_Progress(367)..........:
> MPID_nem_mpich2_blocking_recv(904):
> state_commrdy_handler(184)........: poll of socket fds failed
> Fatal error in MPI_Finalize: Other MPI error, error stack:
> MPI_Finalize(293).................: MPI_Finalize failed
> MPI_Finalize(213).................:
> MPID_Finalize(117)................:
> MPIDI_CH3U_VC_WaitForClose(385)...: an error occurred while the device was
> waiting for all open connections to close
> MPIDI_CH3I_Progress(367)..........:
> MPID_nem_mpich2_blocking_recv(904):
> state_commrdy_handler(184)........: poll of socket fds failed
>
>
> ===================================================================================
> =   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
> =   EXIT CODE: 1
> =   CLEANING UP REMAINING PROCESSES
> =   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
>
> ===================================================================================
> [proxy:0:0 at mic0.local] HYD_pmcd_pmip_control_cmd_cb
> (./pm/pmiserv/pmip_cb.c:883): assert (!closed) failed
> [proxy:0:0 at mic0.local] HYDT_dmxu_poll_wait_for_event
> (./tools/demux/demux_poll.c:77): callback returned error status
> [proxy:0:0 at mic0.local] main (./pm/pmiserv/pmip.c:210): demux engine error
> waiting for event
> [mpiexec at host] HYDT_bscu_wait_for_completion
> (./tools/bootstrap/utils/bscu_wait.c:76): one of the processes terminated
> badly; aborting
> [mpiexec at host] HYDT_bsci_wait_for_completion
> (./tools/bootstrap/src/bsci_wait.c:23): launcher returned error waiting for
> completion
> [mpiexec at host] HYD_pmci_wait_for_completion
> (./pm/pmiserv/pmiserv_pmci.c:216): launcher returned error waiting for
> completion
> [mpiexec at host] main (./ui/mpich/mpiexec.c:325): process manager error
> waiting for completion
>
> Any ideas?
>
> John
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20121130/f8165829/attachment.html>

------------------------------

_______________________________________________
discuss mailing list
discuss at mpich.org
https://lists.mpich.org/mailman/listinfo/discuss

End of discuss Digest, Vol 1, Issue 18
**************************************



More information about the discuss mailing list