[mpich-discuss] process failing...

"Antonio J. Peña" apenya at mcs.anl.gov
Fri May 23 12:05:45 CDT 2014


Hi Ron,

It looks like at least one of your processes is receiving too many 
unexpected messages, leading to get out of memory. Unexpected messages 
are those not matching a posted receive on the receiver side. You may 
check with the application developers to make them review the algorithm 
or look for any possible bug.

   Antonio


On 05/22/2014 08:51 PM, Ron Palmer wrote:
> All,
> I have been trying to run an inversion process (magsen3d) that 
> calculates a large matrix, perhaps in the order of 5-10GB. It takes 
> about 3 days on my cluster however, it fails towards the end, when it 
> is trying to bring in the bits and pieces calculated from the three 
> computers that shares the load (3 rack computers, each 2 CPUs). I  
> have attached a screenshot of the output at failure, including the 
> command that starts the process. All output after "0 5 10 ... 95 100" 
> are error messages of some sort or other.
>
> This problem may be with the actual application (magsen3d) or hardware 
> on these computers and not with MPICH2, I have no idea but I will ask 
> the programmers as well (and they may also be on this list). I have 
> cut back on number of nodes (threads/cores) I used as there were 
> previously errors in /var/log about out of memory. However, this time 
> there are no messages in any of those logs on any of these three 
> computers.
>
> If any of you could help me assess whether this is a problem with 
> MPICH2 or something else, then that would be a huge help to me. I am 
> running Centos on two of the computers including turner, and RH on the 
> third. Let me know if there are any additional info you may need.
>
> Thanks,
> Ron
>
>
>
>
> _______________________________________________
> discuss mailing list     discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss


-- 
Antonio J. Peña
Postdoctoral Appointee
Mathematics and Computer Science Division
Argonne National Laboratory
9700 South Cass Avenue, Bldg. 240, Of. 3148
Argonne, IL 60439-4847
apenya at mcs.anl.gov
www.mcs.anl.gov/~apenya

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20140523/f8413bc2/attachment.html>


More information about the discuss mailing list