<html>
<head>
<meta content="text/html; charset=ISO-8859-1"
http-equiv="Content-Type">
</head>
<body bgcolor="#FFFFFF" text="#000000">
<div class="moz-cite-prefix"><br>
Hi Ron,<br>
<br>
It looks like at least one of your processes is receiving too many
unexpected messages, leading to get out of memory. Unexpected
messages are those not matching a posted receive on the receiver
side. You may check with the application developers to make them
review the algorithm or look for any possible bug.<br>
<br>
Antonio<br>
<br>
<br>
On 05/22/2014 08:51 PM, Ron Palmer wrote:<br>
</div>
<blockquote cite="mid:537EA9B5.9030408@pgcgroup.com.au" type="cite">All,
<br>
I have been trying to run an inversion process (magsen3d) that
calculates a large matrix, perhaps in the order of 5-10GB. It
takes about 3 days on my cluster however, it fails towards the
end, when it is trying to bring in the bits and pieces calculated
from the three computers that shares the load (3 rack computers,
each 2 CPUs). I have attached a screenshot of the output at
failure, including the command that starts the process. All output
after "0 5 10 ... 95 100" are error messages of some sort or
other.
<br>
<br>
This problem may be with the actual application (magsen3d) or
hardware on these computers and not with MPICH2, I have no idea
but I will ask the programmers as well (and they may also be on
this list). I have cut back on number of nodes (threads/cores) I
used as there were previously errors in /var/log about out of
memory. However, this time there are no messages in any of those
logs on any of these three computers.
<br>
<br>
If any of you could help me assess whether this is a problem with
MPICH2 or something else, then that would be a huge help to me. I
am running Centos on two of the computers including turner, and RH
on the third. Let me know if there are any additional info you may
need.
<br>
<br>
Thanks,
<br>
Ron
<br>
<br>
<br>
<br>
<fieldset class="mimeAttachmentHeader"></fieldset>
<br>
<pre wrap="">_______________________________________________
discuss mailing list <a class="moz-txt-link-abbreviated" href="mailto:discuss@mpich.org">discuss@mpich.org</a>
To manage subscription options or unsubscribe:
<a class="moz-txt-link-freetext" href="https://lists.mpich.org/mailman/listinfo/discuss">https://lists.mpich.org/mailman/listinfo/discuss</a></pre>
</blockquote>
<br>
<br>
<pre class="moz-signature" cols="72">--
Antonio J. Peña
Postdoctoral Appointee
Mathematics and Computer Science Division
Argonne National Laboratory
9700 South Cass Avenue, Bldg. 240, Of. 3148
Argonne, IL 60439-4847
<a class="moz-txt-link-abbreviated" href="mailto:apenya@mcs.anl.gov">apenya@mcs.anl.gov</a>
<a class="moz-txt-link-abbreviated" href="http://www.mcs.anl.gov/~apenya">www.mcs.anl.gov/~apenya</a></pre>
</body>
</html>