[mpich-discuss] Error Running MPICH for Photochemical Modeling

Gus Correa gus at ldeo.columbia.edu
Mon Sep 15 13:27:47 CDT 2014


Hi Abhishek

1) Is your run script using the mpd (mpdboot, etc) launcher?
I think it was mostly phased out, now the hydra launcher is used:
https://wiki.mpich.org/mpich/index.php/Using_the_Hydra_Process_Manager

2) Many programs require a large stack, although I don't know if
yours does.
The standard value in Linux is normally too small.
In an case, your "error stack" message is suggestive of that.
You could add your script, before mpiexec:

limit stacksize unlimited  (if using csh/tcsh)

or

ulimit -s unlimited (sh/bash syntax)

The max number of open files (1024 now) may also need to increase,
depending on how much IO the program does.
If you were using Infiniband (probably not because this is mpich),
the locked memory should also be set to unlimited.

If the system doesn't let you change the stacksize, you may need to ask 
the system administrator to change /etc/security/limits.conf to allow 
the change.

My two cents,
Gus Correa


On 09/15/2014 11:58 AM, Abhishek Bhat wrote:
> Sangmin,
>
> Please see attached the logs from demsg.  I apologies but I am not a
> computer expert so it’s all Greek to me.. Can you please see if you can
> find any error or reason for the failure?
>
> Thank You
>
> Abhishek
>
> ………………………………………………………………………………………………….
>
> *Abhishek Bhat, PhD, EPI,
> *Senior Consultant
>
> *From:*Seo, Sangmin [mailto:sseo at anl.gov]
> *Sent:* Monday, September 15, 2014 9:21 AM
> *To:* Abhishek Bhat
> *Cc:* <discuss at mpich.org>
> *Subject:* Re: [mpich-discuss] Error Running MPICH for Photochemical
> Modeling
>
> Can you run dmesg on the node of rank 1, which is killed by signal 9,
> after you execute your application? You can find the reason that the
> process is killed at the end of dmesg output, e.g., out of memory.
>
> — Sangmin
>
> On Sep 14, 2014, at 12:37 PM, Abhishek Bhat
> <abhat at trinityconsultants.com <mailto:abhat at trinityconsultants.com>> wrote:
>
>
>
>     Because the application works when less intensive runs and fails for
>     more intensive runs, it is likely that the application is requesting
>     too many resources.  When\where should I run ulimit –a and dmesg,
>     after I get the error?  If that is true, is there any way to change
>     the environment in MPI to increase the capacity so that the
>     increased resources can be accommodated?
>
>     If I run it in new terminal – here is what I get
>
>     core file size          (blocks, -c) 0
>
>     data seg size           (kbytes, -d) unlimited
>
>     scheduling priority             (-e) 0
>
>     file size               (blocks, -f) unlimited
>
>     pending signals                 (-i) 250598
>
>     max locked memory       (kbytes, -l) 64
>
>     max memory size         (kbytes, -m) unlimited
>
>     open files                      (-n) 1024
>
>     pipe size            (512 bytes, -p) 8
>
>     POSIX message queues     (bytes, -q) 819200
>
>     real-time priority              (-r) 0
>
>     stack size              (kbytes, -s) 10240
>
>     cpu time               (seconds, -t) unlimited
>
>     max user processes              (-u) 1024
>
>     virtual memory          (kbytes, -v) unlimited
>
>     file locks                      (-x) unlimited
>
>     in my job, I try to set the stack size to – unlimited but I guess it
>     is not working.
>
>     Let me know.  Thank you for all the help.
>
>     Abhishek
>
>     ………………………………………………………………………………………………….
>
>     *Abhishek Bhat, PhD, EPI,
>     *Senior Consultant
>
>     *From:*Seo, Sangmin [mailto:sseo at anl.gov]
>     *Sent:*Sunday, September 14, 2014 11:16 AM
>     *To:*<discuss at mpich.org <mailto:discuss at mpich.org>>
>     *Subject:*Re: [mpich-discuss] Error Running MPICH for Photochemical
>     Modeling
>
>     Abhishek,
>
>     Signal 9 is caused by many reasons, e.g., CPU time, out of memory,
>     etc., but it is mostly because the application requests too many
>     resources. You can check the environment settings with ulimit -a.
>     And, you may find some information about your error from dmesg.
>
>     Thanks,
>
>     Sangmin
>
>     On Sep 12, 2014, at 5:51 PM, Abhishek Bhat
>     <abhat at trinityconsultants.com <mailto:abhat at trinityconsultants.com>>
>     wrote:
>
>
>
>
>         Sangmin.
>
>         I updated to mpich3 and getting the following error
>
>         Fatal error in MPI_Recv: A process has failed, error stack:
>
>         MPI_Recv(187).............: MPI_Recv(buf=0x7fff93840c30,
>         count=644490, MPI_REAL, src=1, tag=14131, MPI_COMM_WORLD,
>         status=0x7fff94444f20) failed
>
>         dequeue_and_set_error(865): Communication error with rank 1
>
>         rank 1 in job 1  dfw-camx_55000   caused collective abort of all
>         ranks
>
>            exit status of rank 1: killed by signal 9
>
>         Same situation, successful runs for smaller resource runs and
>         for up to 7 processes.  Error at more than 7.  Here is the mpich
>         command I am using to run from my job file…
>
>         cat << ieof > nodes
>
>         dfw-camx:1
>
>         dfw-camx-n1:1
>
>         dfw-camx-n2:1
>
>         dfw-camx-n3:1
>
>         dfw-camx-n4:1
>
>         dfw-camx-n5:1
>
>         dfw-camx-n6:1
>
>         dfw-camx-n7:1
>
>         ieof
>
>         set NUMPROCS =8
>
>         set RING = `wc -l nodes | awk '{print $1}'`
>
>         mpdboot -n $RING -f nodes –verbose
>
>         if( ! { mpiexec -machinefile nodes -np $NUMPROCS $EXEC } ) then
>
>             mpdallexit
>
>             exit
>
>         endif
>
>         For a successful run the NUMPROCS has to be < = 7.
>
>         Any help is much appreciated.
>
>         Thank You
>
>         Abhishek
>
>         ………………………………………………………………………………………………….
>
>         *Abhishek Bhat, PhD, EPI,
>         *Senior Consultant
>
>         *From:*Seo, Sangmin [mailto:sseo at anl.gov]
>         *Sent:*Friday, September 12, 2014 1:11 PM
>         *To:*<discuss at mpich.org <mailto:discuss at mpich.org>>
>         *Subject:*Re: [mpich-discuss] Error Running MPICH for
>         Photochemical Modeling
>
>         Hi Abhishek,
>
>         Can you try with the recent MPICH release to see if the same
>         error happens? You can download the recent release, 3.1.2, from
>         http://www.mpich.org/downloads/.
>
>         Thanks,
>
>         Sangmin
>
>         On Sep 12, 2014, at 12:59 PM, Abhishek Bhat
>         <abhat at trinityconsultants.com
>         <mailto:abhat at trinityconsultants.com>> wrote:
>
>
>
>
>
>             I am running a photochemical modeling on Linux cluster
>             (CentOS_64 bit) with 1 master and 8 slave nodes with quad
>             core (intel i7) on each node.  I have two scenarios, in
>             first scenario, I am running less data intensive run on all
>             8 nodes (NUMPROCS = 9) and the run will go fine.  When
>             running same configuration for a more intense run, I am
>             getting following error.
>
>             Fatal error in MPI_Recv: Other MPI error, error stack:
>
>             MPI_Recv(187).....................:
>             MPI_Recv(buf=0x7fff989d53b0, count=644490, MPI_REAL, src=1,
>             tag=14131, MPI_COMM_WORLD, status=0x7fff995d96a0) failed
>
>             MPIDI_CH3I_Progress(150)..........:
>
>             MPID_nem_mpich2_blocking_recv(948):
>
>             MPID_nem_tcp_connpoll(1720).......:
>
>             state_commrdy_handler(1556).......:
>
>             MPID_nem_tcp_recv_handler(1446)...: socket closed
>
>             rank 1 in job 1  dfw-camx_55000   caused collective abort of
>             all ranks
>
>                exit status of rank 1: killed by signal 9
>
>             If I run the program with smaller nodes (smaller than 7
>             NUMPROCS) the run goes fine.
>
>             It appears that the rank 1 (my first node) is collectively
>             causing all the ranks, but I could identify why.  I tried
>             following solutions –
>
>             1.Increased master memory to 32 gb
>
>             2.Increased all nodes memory to 32 gb
>
>             3.Exchanged the rank 1 to different node in the parallel.
>
>             In all situations, I am getting this error.  Surprisingly,
>             when I am running smaller (less data intensive runs), I am
>             not getting this error even if I increase the NUMPROCS to 32
>             processes.
>
>             Any help will be highly appreciated.
>
>             I am running mpich 1.4
>
>             Thank You
>             Abhishek
>
>             ………………………………………………………………………………………………….
>
>             *Abhishek Bhat, PhD, EPI,
>             *Senior Consultant
>
>             *Trinity Consultants*
>
>             12770 Merit Drive, Suite 900  |  Dallas, Texas 75251
>
>             Office: *972-661-8100*|  Mobile:  806-281-7617
>
>             Email: abhat at trinityconsultants.com
>             <mailto:abhat at trinityconsultants.com>__ | LinkedIn:
>             www.linkedin.com/in/abhattrinityconsultants
>             <http://www.linkedin.com/in/abhattrinityconsultants>
>
>             Stay current on environmental issues. Subscribe
>             <http://www.trinityconsultants.com/Subscribe/>today to
>             receive Trinity's free/Environmental Quarterly/
>             <http://www.trinityconsultants.com/EnvironmentalQuarterly/>.
>
>             Learn about Trinity’scourses
>             <http://www.trinityconsultants.com/Training/>for
>             environmental professionals.
>
>             <image001.gif>
>             <http://www.linkedin.com/company/trinity-consultants><image002.gif>
>             <http://www.facebook.com/TrinityConsults><image003.gif>
>             <http://twitter.com/trinityconsults><image004.gif>
>             <http://www.youtube.com/trinityconsultants>
>
>             <image005.jpg>
>
>
>             _________________________________________________________________________
>
>             The information transmitted is intended only for the person
>             or entity to
>             which it is addressed and may contain confidential and/or
>             privileged
>             material. Any review, retransmission, dissemination or other
>             use of, or
>             taking of any action in reliance upon, this information by
>             persons or
>             entities other than the intended recipient is prohibited. If
>             you received
>             this in error, please contact the sender and delete the
>             material from any
>             computer.
>             _________________________________________________________________________
>             _______________________________________________
>             discuss mailing list discuss at mpich.org
>             <mailto:discuss at mpich.org>
>             To manage subscription options or unsubscribe:
>             https://lists.mpich.org/mailman/listinfo/discuss
>
>
>         _________________________________________________________________________
>
>         The information transmitted is intended only for the person or
>         entity to
>         which it is addressed and may contain confidential and/or privileged
>         material. Any review, retransmission, dissemination or other use
>         of, or
>         taking of any action in reliance upon, this information by
>         persons or
>         entities other than the intended recipient is prohibited. If you
>         received
>         this in error, please contact the sender and delete the material
>         from any
>         computer.
>         _________________________________________________________________________
>         _______________________________________________
>         discuss mailing list discuss at mpich.org <mailto:discuss at mpich.org>
>         To manage subscription options or unsubscribe:
>         https://lists.mpich.org/mailman/listinfo/discuss
>
>
>     _________________________________________________________________________
>
>     The information transmitted is intended only for the person or entity to
>     which it is addressed and may contain confidential and/or privileged
>     material. Any review, retransmission, dissemination or other use of, or
>     taking of any action in reliance upon, this information by persons or
>     entities other than the intended recipient is prohibited. If you
>     received
>     this in error, please contact the sender and delete the material
>     from any
>     computer.
>     _________________________________________________________________________
>
>
> _________________________________________________________________________
>
> The information transmitted is intended only for the person or entity to
> which it is addressed and may contain confidential and/or privileged
> material. Any review, retransmission, dissemination or other use of, or
> taking of any action in reliance upon, this information by persons or
> entities other than the intended recipient is prohibited. If you received
> this in error, please contact the sender and delete the material from any
> computer.
> _________________________________________________________________________
>
>
> _______________________________________________
> discuss mailing list     discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss
>




More information about the discuss mailing list