[mpich-discuss] process failing...
ron.palmer at pgcgroup.com.au
Thu May 22 20:51:49 CDT 2014
I have been trying to run an inversion process (magsen3d) that
calculates a large matrix, perhaps in the order of 5-10GB. It takes
about 3 days on my cluster however, it fails towards the end, when it is
trying to bring in the bits and pieces calculated from the three
computers that shares the load (3 rack computers, each 2 CPUs). I have
attached a screenshot of the output at failure, including the command
that starts the process. All output after "0 5 10 ... 95 100" are error
messages of some sort or other.
This problem may be with the actual application (magsen3d) or hardware
on these computers and not with MPICH2, I have no idea but I will ask
the programmers as well (and they may also be on this list). I have cut
back on number of nodes (threads/cores) I used as there were previously
errors in /var/log about out of memory. However, this time there are no
messages in any of those logs on any of these three computers.
If any of you could help me assess whether this is a problem with MPICH2
or something else, then that would be a huge help to me. I am running
Centos on two of the computers including turner, and RH on the third.
Let me know if there are any additional info you may need.
-------------- next part --------------
A non-text attachment was scrubbed...
Size: 236127 bytes
Desc: not available
More information about the discuss