[mpich-discuss] Error Running MPICH for Photochemical Modeling

Seo, Sangmin sseo at anl.gov
Wed Sep 17 12:56:32 CDT 2014


I have looked into your log file, but it seems that the real error happened in other process, not rank 0. The socket of rank 0 was closed because the other process died. Can you send us log files of other processes?

And, can you also send us the execution result with "-print-all-exitcodes” option? For example,
mpiexec -machinefile nodes -np $NUMPROCS -print-all-exitcodes $EXEC -mpich-dbg=file -mpich-dbg-class=all -mpich-dbg-level=verbose
-print-all-exitcodes option will show the exit codes for each process.

Regards,
Sangmin


On Sep 17, 2014, at 12:33 PM, Abhishek Bhat <abhat at trinityconsultants.com<mailto:abhat at trinityconsultants.com>> wrote:

Any luck on finding solution on the error in the log file provided?


Thank You
Abhishek

………………………………………………………………………………………………….
Abhishek Bhat, PhD, EPI,
Senior Consultant


From: Abhishek Bhat
Sent: Tuesday, September 16, 2014 11:36 AM
To: <discuss at mpich.org<mailto:discuss at mpich.org>>
Cc: 'sseo at anl.gov<mailto:sseo at anl.gov>'
Subject: RE: [mpich-discuss] Error Running MPICH for Photochemical Modeling

Sangmin,

Sorry for the confusion, I was looking in the wrong location.  Please find attached the splice of the debug file 0 where the Error is occurring.


Thank You
Abhishek

………………………………………………………………………………………………….
Abhishek Bhat, PhD, EPI,
Senior Consultant


From: Seo, Sangmin [mailto:sseo at anl.gov]
Sent: Tuesday, September 16, 2014 11:12 AM
To: Abhishek Bhat
Cc: <discuss at mpich.org<mailto:discuss at mpich.org>>
Subject: Re: [mpich-discuss] Error Running MPICH for Photochemical Modeling

The log files should be found in the directory where you execute the application. Did you run the application with debug options, -mpich-dbg=file -mpich-dbg-class=all -mpich-dbg-level=verbose?

— Sangmin


On Sep 16, 2014, at 11:05 AM, Abhishek Bhat <abhat at trinityconsultants.com<mailto:abhat at trinityconsultants.com>> wrote:

Sangmin,

I configured the MPICH with revised code and included the additional options as described.  Where should I be looking for the log files?
I cannot see any files generated.

Thank You
Abhishek

………………………………………………………………………………………………….
Abhishek Bhat, PhD, EPI,
Senior Consultant


From: Seo, Sangmin [mailto:sseo at anl.gov]
Sent: Monday, September 15, 2014 12:43 PM
To: Abhishek Bhat
Cc: <discuss at mpich.org<mailto:discuss at mpich.org>>
Subject: Re: [mpich-discuss] Error Running MPICH for Photochemical Modeling

Sorry, “./make” should have been “make”

— Sangmin

On Sep 15, 2014, at 12:40 PM, Seo, Sangmin <sseo at anl.gov<mailto:sseo at anl.gov>> wrote:


Did you run dmesg on the node which executed the rank failed? I don't see any error message from your log file.

Anyway, let's try with MPICH DEBUG messages. As described inhttps://wiki.mpich.org/mpich/index.php/Debug_Event_Logging, first configure MPICH with "--enable-g=dbg,log" to enable event logging and built it. For example,
$ ./configure --prefix=$INSTALL_DIR --enable-g=dbg,log
$ ./make -j8
$ ./make install

Then, execute your application with additional options, "-mpich-dbg=file -mpich-dbg-class=all -mpich-dbg-level=verbose." With the recent MPICH, you don't need to use mpdboot. Just execute your application with mpiexec like:
mpiexec -machinefile nodes -np $NUMPROCS $EXEC -mpich-dbg=file -mpich-dbg-class=all -mpich-dbg-level=verbose

This will generate log files, such as dbg0-xxxx.log, dbg1-xxxx.log, etc. Since the log files may be large, it would be difficult to send them to us. If so, please take a look at each file, find error messages, and send the error message to us.

And, if you can reproduce the same error with small code, it will be really helpful to figure out your problem. Could you write a simplified code and send us the code?

Regards,
Sangmin


On Sep 15, 2014, at 10:58 AM, Abhishek Bhat <abhat at trinityconsultants.com<mailto:abhat at trinityconsultants.com>> wrote:


Sangmin,

Please see attached the logs from demsg.  I apologies but I am not a computer expert so it’s all Greek to me.. Can you please see if you can find any error or reason for the failure?


Thank You
Abhishek

………………………………………………………………………………………………….
Abhishek Bhat, PhD, EPI,
Senior Consultant


From: Seo, Sangmin [mailto:sseo at anl.gov]
Sent: Monday, September 15, 2014 9:21 AM
To: Abhishek Bhat
Cc: <discuss at mpich.org<mailto:discuss at mpich.org>>
Subject: Re: [mpich-discuss] Error Running MPICH for Photochemical Modeling

Can you run dmesg on the node of rank 1, which is killed by signal 9, after you execute your application? You can find the reason that the process is killed at the end of dmesg output, e.g., out of memory.

— Sangmin


On Sep 14, 2014, at 12:37 PM, Abhishek Bhat <abhat at trinityconsultants.com<mailto:abhat at trinityconsultants.com>> wrote:



Because the application works when less intensive runs and fails for more intensive runs, it is likely that the application is requesting too many resources.  When\where should I run ulimit –a and dmesg, after I get the error?  If that is true, is there any way to change the environment in MPI to increase the capacity so that the increased resources can be accommodated?

If I run it in new terminal – here is what I get

core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 250598
max locked memory       (kbytes, -l) 64
max memory size         (kbytes, -m) unlimited
open files                      (-n) 1024
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 10240
cpu time               (seconds, -t) unlimited
max user processes              (-u) 1024
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

in my job, I try to set the stack size to – unlimited but I guess it is not working.

Let me know.  Thank you for all the help.
Abhishek
………………………………………………………………………………………………….
Abhishek Bhat, PhD, EPI,
Senior Consultant


From: Seo, Sangmin [mailto:sseo at anl.gov]
Sent: Sunday, September 14, 2014 11:16 AM
To: <discuss at mpich.org<mailto:discuss at mpich.org>>
Subject: Re: [mpich-discuss] Error Running MPICH for Photochemical Modeling

Abhishek,

Signal 9 is caused by many reasons, e.g., CPU time, out of memory, etc., but it is mostly because the application requests too many resources. You can check the environment settings with ulimit -a. And, you may find some information about your error from dmesg.

Thanks,
Sangmin


On Sep 12, 2014, at 5:51 PM, Abhishek Bhat <abhat at trinityconsultants.com<mailto:abhat at trinityconsultants.com>> wrote:




Sangmin.

I updated to mpich3 and getting the following error

Fatal error in MPI_Recv: A process has failed, error stack:
MPI_Recv(187).............: MPI_Recv(buf=0x7fff93840c30, count=644490, MPI_REAL, src=1, tag=14131, MPI_COMM_WORLD, status=0x7fff94444f20) failed
dequeue_and_set_error(865): Communication error with rank 1
rank 1 in job 1  dfw-camx_55000   caused collective abort of all ranks
  exit status of rank 1: killed by signal 9

Same situation, successful runs for smaller resource runs and for up to 7 processes.  Error at more than 7.  Here is the mpich command I am using to run from my job file…

cat << ieof > nodes
dfw-camx:1
dfw-camx-n1:1
dfw-camx-n2:1
dfw-camx-n3:1
dfw-camx-n4:1
dfw-camx-n5:1
dfw-camx-n6:1
dfw-camx-n7:1
ieof
set NUMPROCS = 8
set RING = `wc -l nodes | awk '{print $1}'`
mpdboot -n $RING -f nodes –verbose

if( ! { mpiexec -machinefile nodes -np $NUMPROCS $EXEC } ) then
   mpdallexit
   exit
endif


For a successful run the NUMPROCS has to be < = 7.

Any help is much appreciated.

Thank You
Abhishek
………………………………………………………………………………………………….
Abhishek Bhat, PhD, EPI,
Senior Consultant


From: Seo, Sangmin [mailto:sseo at anl.gov]
Sent: Friday, September 12, 2014 1:11 PM
To: <discuss at mpich.org<mailto:discuss at mpich.org>>
Subject: Re: [mpich-discuss] Error Running MPICH for Photochemical Modeling

Hi Abhishek,

Can you try with the recent MPICH release to see if the same error happens? You can download the recent release, 3.1.2, from http://www.mpich.org/downloads/.

Thanks,
Sangmin


On Sep 12, 2014, at 12:59 PM, Abhishek Bhat <abhat at trinityconsultants.com<mailto:abhat at trinityconsultants.com>> wrote:





I am running a photochemical modeling on Linux cluster (CentOS_64 bit) with 1 master and 8 slave nodes with quad core (intel i7) on each node.  I have two scenarios, in first scenario, I am running less data intensive run on all 8 nodes (NUMPROCS = 9) and the run will go fine.  When running same configuration for a more intense run, I am getting following error.

Fatal error in MPI_Recv: Other MPI error, error stack:
MPI_Recv(187).....................: MPI_Recv(buf=0x7fff989d53b0, count=644490, MPI_REAL, src=1, tag=14131, MPI_COMM_WORLD, status=0x7fff995d96a0) failed
MPIDI_CH3I_Progress(150)..........:
MPID_nem_mpich2_blocking_recv(948):
MPID_nem_tcp_connpoll(1720).......:
state_commrdy_handler(1556).......:
MPID_nem_tcp_recv_handler(1446)...: socket closed
rank 1 in job 1  dfw-camx_55000   caused collective abort of all ranks
  exit status of rank 1: killed by signal 9

If I run the program with smaller nodes (smaller than 7 NUMPROCS) the run goes fine.

It appears that the rank 1 (my first node) is collectively causing all the ranks, but I could identify why.  I tried following solutions –

1.       Increased master memory to 32 gb
2.       Increased all nodes memory to 32 gb
3.       Exchanged the rank 1 to different node in the parallel.

In all situations, I am getting this error.  Surprisingly, when I am running smaller (less data intensive runs), I am not getting this error even if I increase the NUMPROCS to 32 processes.

Any help will be highly appreciated.

I am running mpich 1.4

Thank You
Abhishek
………………………………………………………………………………………………….
Abhishek Bhat, PhD, EPI,
Senior Consultant

Trinity Consultants
12770 Merit Drive, Suite 900  |  Dallas, Texas 75251
Office:  972-661-8100|  Mobile:  806-281-7617
Email:  abhat at trinityconsultants.com<mailto:abhat at trinityconsultants.com>  |  LinkedIn: www.linkedin.com/in/abhattrinityconsultants<http://www.linkedin.com/in/abhattrinityconsultants>

Stay current on environmental issues.  Subscribe<http://www.trinityconsultants.com/Subscribe/> today to receive Trinity's free Environmental Quarterly<http://www.trinityconsultants.com/EnvironmentalQuarterly/>.
Learn about Trinity’s courses<http://www.trinityconsultants.com/Training/> for environmental professionals.

<image001.gif><http://www.linkedin.com/company/trinity-consultants>    <image002.gif><http://www.facebook.com/TrinityConsults>    <image003.gif><http://twitter.com/trinityconsults>    <image004.gif><http://www.youtube.com/trinityconsultants>

<image005.jpg>


_________________________________________________________________________

The information transmitted is intended only for the person or entity to
which it is addressed and may contain confidential and/or privileged
material. Any review, retransmission, dissemination or other use of, or
taking of any action in reliance upon, this information by persons or
entities other than the intended recipient is prohibited. If you received
this in error, please contact the sender and delete the material from any
computer.
_________________________________________________________________________
_______________________________________________
discuss mailing list     discuss at mpich.org<mailto:discuss at mpich.org>
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss


_________________________________________________________________________

The information transmitted is intended only for the person or entity to
which it is addressed and may contain confidential and/or privileged
material. Any review, retransmission, dissemination or other use of, or
taking of any action in reliance upon, this information by persons or
entities other than the intended recipient is prohibited. If you received
this in error, please contact the sender and delete the material from any
computer.
_________________________________________________________________________
_______________________________________________
discuss mailing list     discuss at mpich.org<mailto:discuss at mpich.org>
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss


_________________________________________________________________________

The information transmitted is intended only for the person or entity to
which it is addressed and may contain confidential and/or privileged
material. Any review, retransmission, dissemination or other use of, or
taking of any action in reliance upon, this information by persons or
entities other than the intended recipient is prohibited. If you received
this in error, please contact the sender and delete the material from any
computer.
_________________________________________________________________________


_________________________________________________________________________

The information transmitted is intended only for the person or entity to
which it is addressed and may contain confidential and/or privileged
material. Any review, retransmission, dissemination or other use of, or
taking of any action in reliance upon, this information by persons or
entities other than the intended recipient is prohibited. If you received
this in error, please contact the sender and delete the material from any
computer.
_________________________________________________________________________
<dmesg3.txt><messages.txt>

_______________________________________________
discuss mailing list     discuss at mpich.org<mailto:discuss at mpich.org>
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss


_________________________________________________________________________

The information transmitted is intended only for the person or entity to
which it is addressed and may contain confidential and/or privileged
material. Any review, retransmission, dissemination or other use of, or
taking of any action in reliance upon, this information by persons or
entities other than the intended recipient is prohibited. If you received
this in error, please contact the sender and delete the material from any
computer.
_________________________________________________________________________


_________________________________________________________________________

The information transmitted is intended only for the person or entity to
which it is addressed and may contain confidential and/or privileged
material. Any review, retransmission, dissemination or other use of, or
taking of any action in reliance upon, this information by persons or
entities other than the intended recipient is prohibited. If you received
this in error, please contact the sender and delete the material from any
computer.
_________________________________________________________________________

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20140917/2c310a6c/attachment.html>
-------------- next part --------------
_______________________________________________
discuss mailing list     discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss


More information about the discuss mailing list