[mpich-discuss] Error Running MPICH for Photochemical Modeling

Abhishek Bhat abhat at trinityconsultants.com
Fri Sep 12 12:59:53 CDT 2014


I am running a photochemical modeling on Linux cluster (CentOS_64 bit) with 1 master and 8 slave nodes with quad core (intel i7) on each node.  I have two scenarios, in first scenario, I am running less data intensive run on all 8 nodes (NUMPROCS = 9) and the run will go fine.  When running same configuration for a more intense run, I am getting following error.

Fatal error in MPI_Recv: Other MPI error, error stack:
MPI_Recv(187).....................: MPI_Recv(buf=0x7fff989d53b0, count=644490, MPI_REAL, src=1, tag=14131, MPI_COMM_WORLD, status=0x7fff995d96a0) failed
MPIDI_CH3I_Progress(150)..........:
MPID_nem_mpich2_blocking_recv(948):
MPID_nem_tcp_connpoll(1720).......:
state_commrdy_handler(1556).......:
MPID_nem_tcp_recv_handler(1446)...: socket closed
rank 1 in job 1  dfw-camx_55000   caused collective abort of all ranks
  exit status of rank 1: killed by signal 9

If I run the program with smaller nodes (smaller than 7 NUMPROCS) the run goes fine.

It appears that the rank 1 (my first node) is collectively causing all the ranks, but I could identify why.  I tried following solutions -


1.       Increased master memory to 32 gb

2.       Increased all nodes memory to 32 gb

3.       Exchanged the rank 1 to different node in the parallel.

In all situations, I am getting this error.  Surprisingly, when I am running smaller (less data intensive runs), I am not getting this error even if I increase the NUMPROCS to 32 processes.

Any help will be highly appreciated.

I am running mpich 1.4

Thank You
Abhishek
................................................................................................................
Abhishek Bhat, PhD, EPI,
Senior Consultant

Trinity Consultants
12770 Merit Drive, Suite 900  |  Dallas, Texas 75251
Office:  972-661-8100|  Mobile:  806-281-7617
Email:  abhat at trinityconsultants.com<mailto:abhat at trinityconsultants.com>  |  LinkedIn:  www.linkedin.com/in/abhattrinityconsultants<http://www.linkedin.com/in/abhattrinityconsultants>

Stay current on environmental issues.  Subscribe<http://www.trinityconsultants.com/Subscribe/> today to receive Trinity's free Environmental Quarterly<http://www.trinityconsultants.com/EnvironmentalQuarterly/>.
Learn about Trinity's courses<http://www.trinityconsultants.com/Training/> for environmental professionals.

[LinkedIn icon_23p]<http://www.linkedin.com/company/trinity-consultants>    [Facebook icon_23p] <http://www.facebook.com/TrinityConsults>     [Twitter icon_23p] <http://twitter.com/trinityconsults>     [YouTube icon_23p] <http://www.youtube.com/trinityconsultants>

[https://corporate.trinityconsultants.com/Departments/Marketing/Community%20Shared%20Library/Logos/TCI_40%20Yr%20Logo.jpg]


-- 
_________________________________________________________________________

The information transmitted is intended only for the person or entity to
which it is addressed and may contain confidential and/or privileged
material. Any review, retransmission, dissemination or other use of, or
taking of any action in reliance upon, this information by persons or
entities other than the intended recipient is prohibited. If you received
this in error, please contact the sender and delete the material from any
computer.
_________________________________________________________________________
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20140912/7f0c3c39/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image001.gif
Type: image/gif
Size: 1166 bytes
Desc: image001.gif
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20140912/7f0c3c39/attachment.gif>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image002.gif
Type: image/gif
Size: 1137 bytes
Desc: image002.gif
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20140912/7f0c3c39/attachment-0001.gif>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image003.gif
Type: image/gif
Size: 1174 bytes
Desc: image003.gif
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20140912/7f0c3c39/attachment-0002.gif>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image004.gif
Type: image/gif
Size: 1162 bytes
Desc: image004.gif
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20140912/7f0c3c39/attachment-0003.gif>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image005.jpg
Type: image/jpeg
Size: 6004 bytes
Desc: image005.jpg
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20140912/7f0c3c39/attachment.jpg>


More information about the discuss mailing list