[mpich-discuss] Need help troubleshooting

mark dimitsas.markos at gmail.com
Mon Jun 30 05:20:00 CDT 2014


  Hello.
I need some help troubleshooting a program i wrote. For a combination of 
data and nodes the program runs fine, but for others not. For example i 
use 2d arrays for data collections and divide them into the nodes. If 
the number of the lines in the array are 320 and the nodes are 16 ( 8 
physical nodes with multi-threading) the program runs fine. But if the 
lines in the array are 50 and the nodes 16 the program fails, but again 
if the nodes are 2 or 4 the program runs ok.
  Is there a way to define the exact spot where the code is failing? 
Also, some examples would do wonders. Thanks

Ps-1: The errors that the program returns are in the form of:
rank 25 in job 29  Calliope_50394   caused collective abort of all ranks 
- exit status of rank 25: killed by signal 11

Ps-2: I wrote other programs in MPI that worked, and the only difference 
is, that in this program i use loops like these:

   for(i=id*n/p; i< (id+1)*n/p; i++){..... (where id are the id's of the 
nodes, n is the data collection and p are the number of nodes)

to parse the data accordingly.



More information about the discuss mailing list