[mpich-discuss] process failing...

Ron Palmer ron.palmer at pgcgroup.com.au
Thu May 22 20:51:49 CDT 2014

I have been trying to run an inversion process (magsen3d) that 
calculates a large matrix, perhaps in the order of 5-10GB. It takes 
about 3 days on my cluster however, it fails towards the end, when it is 
trying to bring in the bits and pieces calculated from the three 
computers that shares the load (3 rack computers, each 2 CPUs). I  have 
attached a screenshot of the output at failure, including the command 
that starts the process. All output after "0 5 10 ... 95 100" are error 
messages of some sort or other.

This problem may be with the actual application (magsen3d) or hardware 
on these computers and not with MPICH2, I have no idea but I will ask 
the programmers as well (and they may also be on this list). I have cut 
back on number of nodes (threads/cores) I used as there were previously 
errors in /var/log about out of memory. However, this time there are no 
messages in any of those logs on any of these three computers.

If any of you could help me assess whether this is a problem with MPICH2 
or something else, then that would be a huge help to me. I am running 
Centos on two of the computers including turner, and RH on the third. 
Let me know if there are any additional info you may need.


-------------- next part --------------
A non-text attachment was scrubbed...
Name: possibleMPICH2error.jpg
Type: image/jpeg
Size: 236127 bytes
Desc: not available
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20140523/f9e06ad9/attachment.jpg>

More information about the discuss mailing list