[mpich-discuss] crash mpiexec

Fri Nov 16 07:59:51 CST 2012

Hello, 

Thanks for your response. 

For the master node on Windows I copy past the core dump analysis.
It seems there is something wrong with the function: PMPI_Wtime.

I had some problems on generating the core dump file on the ARM side. 
I can on a normal linux x86 system but not on the CARMA.
This concerns the 4 slave nodes. 
Anyway, I guess that the crash comes from the master node.

As I said in my previous email the example I run is a trivial test case. 
May be this crash comes from something wrong on mpiexec working on heterogeneous clusters (Windows/Linux and x86/ARM)?

Thanks
Luigi

Core:
Windows XP Version 2600 (Service Pack 3) MP (4 procs) Free x86 compatible
Product: WinNt, suite: SingleUserTS
Machine Name:
Debug session time: Fri Nov  9 12:08:23.000 2012 (UTC - 5:00)
System Uptime: 0 days 0:05:01.858
Process Uptime: 0 days 0:00:01.000
  Kernel time: 0 days 0:00:00.000
  User time: 0 days 0:00:00.000
TRIAGER: Could not open triage file : e:\dump_analysis\program\triage\oca.ini, error 2
TRIAGER: Could not open triage file : e:\dump_analysis\program\winxp\triage.ini, error 2
TRIAGER: Could not open triage file : e:\dump_analysis\program\triage\user.ini, error 2
*** ERROR: Symbol file could not be found.  Defaulted to export symbols for mpich2.dll - 
*** ERROR: Module load completed but symbols could not be loaded for irsmsample.exe
*******************************************************************************
*                                                                             *
*                        Exception Analysis                                   *
*                                                                             *
*******************************************************************************

TRIAGER: Could not open triage file : e:\dump_analysis\program\triage\guids.ini, error 2
TRIAGER: Could not open triage file : e:\dump_analysis\program\triage\modclass.ini, error 2

FAULTING_IP: 
mpich2!PMPI_Wtime+d9a6
0049a1c6 8b4804          mov     ecx,dword ptr [eax+4]

EXCEPTION_RECORD:  ffffffff -- (.exr 0xffffffffffffffff)
ExceptionAddress: 0049a1c6 (mpich2!PMPI_Wtime+0x0000d9a6)
   ExceptionCode: c0000005 (Access violation)
  ExceptionFlags: 00000000
NumberParameters: 2
   Parameter[0]: 00000000
   Parameter[1]: 00000004
Attempt to read from address 00000004

DEFAULT_BUCKET_ID:  NULL_CLASS_PTR_READ

PROCESS_NAME:  irsmsample.exe

ERROR_CODE: (NTSTATUS) 0xc0000005 - The instruction at "0x%08lx" referenced memory at "0x%08lx". The memory could not be "%s".

EXCEPTION_CODE: (NTSTATUS) 0xc0000005 - The instruction at "0x%08lx" referenced memory at "0x%08lx". The memory could not be "%s".

EXCEPTION_PARAMETER1:  00000000

EXCEPTION_PARAMETER2:  00000004

READ_ADDRESS:  00000004 

FOLLOWUP_IP: 
mpich2!PMPI_Wtime+d9a6
0049a1c6 8b4804          mov     ecx,dword ptr [eax+4]

NTGLOBALFLAG:  0

APPLICATION_VERIFIER_FLAGS:  0

FAULTING_THREAD:  00000c0c

PRIMARY_PROBLEM_CLASS:  NULL_CLASS_PTR_READ

BUGCHECK_STR:  APPLICATION_FAULT_NULL_CLASS_PTR_READ

LAST_CONTROL_TRANSFER:  from 004013ae to 0049a1c6

STACK_TEXT:  
WARNING: Stack unwind information not available. Following frames may be wrong.
0012ff7c 004013ae 00000003 00384e20 00383a40 mpich2!PMPI_Wtime+0xd9a6
0012ffc0 7c817067 719c71e0 ffffffff 7ffd7000 irsmsample+0x13ae
0012fff0 00000000 00401555 00000000 00905a4d kernel32!BaseProcessStart+0x23

STACK_COMMAND:  ~0s; .ecxr ; kb

SYMBOL_STACK_INDEX:  0

SYMBOL_NAME:  mpich2!PMPI_Wtime+d9a6

FOLLOWUP_NAME:  MachineOwner

MODULE_NAME: mpich2

IMAGE_NAME:  mpich2.dll

DEBUG_FLR_IMAGE_TIMESTAMP:  4e5fe70e

FAILURE_BUCKET_ID:  NULL_CLASS_PTR_READ_c0000005_mpich2.dll!PMPI_Wtime

BUCKET_ID:  APPLICATION_FAULT_NULL_CLASS_PTR_READ_mpich2!PMPI_Wtime+d9a6

WATSON_STAGEONE_URL:  http://watson.microsoft.com/StageOne/irsmsample_exe/0_0_0_0/509d3866/mpich2_dll/0_0_0_0/4e5fe70e/c0000005/0008a1c6.htm?Retriage=1

Followup: MachineOwner

-----Message d'origine-----
De : mpich-discuss-bounces at mcs.anl.gov [mailto:mpich-discuss-bounces at mcs.anl.gov] De la part de Calin Iaru
Envoyé : jeudi 8 novembre 2012 11:05
À : mpich-discuss at mcs.anl.gov
Objet : Re: [mpich-discuss] crash mpiexec

This error code indicates an access violation inside MPI_Finalize(). I 
suggest you look at the core file.

if(myid == 0) {
    struct rlimit rl;
    if(getrlimit(RLIMIT_CORE, &rl) == 0) {
        if(rl.rlim_cur == 0) {
            rl.rlim_cur = rl.rlim_max;
            setrlimit(RLIMIT_CORE,&rl);
        }
    }
}

From: NARDI Luigi
Sent: Wednesday, November 07, 2012 2:08 PM
To: mpich-discuss at mcs.anl.gov
Subject: [mpich-discuss] crash mpiexec

Hello,

I have an error using mpiexec (MPICH2 1.4.1p). Hope somebody may help.

The crash is random, i.e. the same executable may crash or not.

Context:

5 nodes heterogeneous cluster:

4 nodes with CARMA (CUDA on ARM) on Ubuntu 11.4: the carrier board basically 
consists of an ARM Cortex A9 processor and a Quadro 1000M NVIDIA GPU card.

1 node with one XEON E5620 processor on Windows XP + cygwin.

Standard Ethernet network.

Names of the 5 nodes:

lnardi

carma1

carma2

carma3

carma4

The command line on the master node lnardi (Windows node) is:

mpiexec -channel sock -n 1 -host lnardi a.out :

-n 1 -host carma1 -path /home/lnardi/ a.out :

-n 1 -host carma2 -path /home/lnardi/ a.out :

-n 1 -host carma3 -path /home/lnardi/ a.out :

-n 1 –host carma4 -path /home/lnardi/ a.out

Notice that the same sample runs on a full linux cluster with the following 
characteristics: MVAPICH2-1.8a1p1 (mpirun) + MELLANOX infiniband + XEON 
X5675 + NVIDIA GPUs M2090 + Red Hat Enterprise Linux Server release 6.2.

I was running a more complicated code but I have reproduced the error on a 
trivial code:

#include <mpi.h>

#include <stdio.h>

#include <string.h>

#define BUFSIZE 128

#define TAG 0

int main(int argc, char *argv[])

{

   char idstr[32];

   char buff[BUFSIZE];

   int numprocs;

   int myid;

   int i;

   MPI_Status stat;

   MPI_Init(&argc,&argv);

   MPI_Comm_size(MPI_COMM_WORLD,&numprocs);

   MPI_Comm_rank(MPI_COMM_WORLD,&myid);

   if(myid == 0)

   {

      printf("%d: We have %d processors\n", myid, numprocs);

      for(i=1;i<numprocs;i++)

      {

         sprintf(buff, "Hello %d! ", i);

         MPI_Send(buff, BUFSIZE, MPI_CHAR, i, TAG, MPI_COMM_WORLD);

      }

      for(i=1;i<numprocs;i++)

      {

         MPI_Recv(buff, BUFSIZE, MPI_CHAR, i, TAG, MPI_COMM_WORLD, &stat);

         printf("%d: %s\n", myid, buff);

      }

   }

   else

   {

      MPI_Recv(buff, BUFSIZE, MPI_CHAR, 0, TAG, MPI_COMM_WORLD, &stat);

      sprintf(idstr, "Processor %d ", myid);

      strncat(buff, idstr, BUFSIZE-1);

      strncat(buff, "reporting for duty\n", BUFSIZE-1);

      MPI_Send(buff, BUFSIZE, MPI_CHAR, 0, TAG, MPI_COMM_WORLD);

   }

   MPI_Finalize();

   return 0;

}

The error:

0: We have 5 processors

0: Hello 1! Processor 1 reporting for duty

0: Hello 2! Processor 2 reporting for duty

0: Hello 3! Processor 3 reporting for duty

0: Hello 4! Processor 4 reporting for duty

job aborted:

rank: node: exit code[: error message]

0: lnardi: -1073741819: process 0 exited without calling finalize

1: carma1: -2

2: carma2: -2

3: carma3: -2

4: carma4: -2

I guess the problem comes from either the sock channel or mpiexec or ARM.

What do you think about?

Thanks

Dr Luigi Nardi

*******************************

This e-mail contains information for the intended recipient only. It may 
contain proprietary material or confidential information. If you are not the 
intended recipient you are not authorised to distribute, copy or use this 
e-mail or any attachment to it. Murex cannot guarantee that it is virus free 
and accepts no responsibility for any loss or damage arising from its use. 
If you have received this e-mail in error please notify immediately the 
sender and delete the original email received, any attachments and all 
copies from your system.

_______________________________________________
mpich-discuss mailing list     mpich-discuss at mcs.anl.gov
To manage subscription options or unsubscribe:
https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss 

_______________________________________________
mpich-discuss mailing list     mpich-discuss at mcs.anl.gov
To manage subscription options or unsubscribe:
https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
*******************************

This e-mail contains information for the intended recipient only. It may contain proprietary material or confidential information. If you are not the intended recipient you are not authorised to distribute, copy or use this e-mail or any attachment to it. Murex cannot guarantee that it is virus free and accepts no responsibility for any loss or damage arising from its use. If you have received this e-mail in error please notify immediately the sender and delete the original email received, any attachments and all copies from your system.