[mpich-discuss] crash mpiexec
NARDI Luigi
Luigi.NARDI at murex.com
Fri Nov 16 07:59:51 CST 2012
Hello,
Thanks for your response.
For the master node on Windows I copy past the core dump analysis.
It seems there is something wrong with the function: PMPI_Wtime.
I had some problems on generating the core dump file on the ARM side.
I can on a normal linux x86 system but not on the CARMA.
This concerns the 4 slave nodes.
Anyway, I guess that the crash comes from the master node.
As I said in my previous email the example I run is a trivial test case.
May be this crash comes from something wrong on mpiexec working on heterogeneous clusters (Windows/Linux and x86/ARM)?
Thanks
Luigi
Core:
Windows XP Version 2600 (Service Pack 3) MP (4 procs) Free x86 compatible
Product: WinNt, suite: SingleUserTS
Machine Name:
Debug session time: Fri Nov 9 12:08:23.000 2012 (UTC - 5:00)
System Uptime: 0 days 0:05:01.858
Process Uptime: 0 days 0:00:01.000
Kernel time: 0 days 0:00:00.000
User time: 0 days 0:00:00.000
TRIAGER: Could not open triage file : e:\dump_analysis\program\triage\oca.ini, error 2
TRIAGER: Could not open triage file : e:\dump_analysis\program\winxp\triage.ini, error 2
TRIAGER: Could not open triage file : e:\dump_analysis\program\triage\user.ini, error 2
*** ERROR: Symbol file could not be found. Defaulted to export symbols for mpich2.dll -
*** ERROR: Module load completed but symbols could not be loaded for irsmsample.exe
*******************************************************************************
* *
* Exception Analysis *
* *
*******************************************************************************
TRIAGER: Could not open triage file : e:\dump_analysis\program\triage\guids.ini, error 2
TRIAGER: Could not open triage file : e:\dump_analysis\program\triage\modclass.ini, error 2
FAULTING_IP:
mpich2!PMPI_Wtime+d9a6
0049a1c6 8b4804 mov ecx,dword ptr [eax+4]
EXCEPTION_RECORD: ffffffff -- (.exr 0xffffffffffffffff)
ExceptionAddress: 0049a1c6 (mpich2!PMPI_Wtime+0x0000d9a6)
ExceptionCode: c0000005 (Access violation)
ExceptionFlags: 00000000
NumberParameters: 2
Parameter[0]: 00000000
Parameter[1]: 00000004
Attempt to read from address 00000004
DEFAULT_BUCKET_ID: NULL_CLASS_PTR_READ
PROCESS_NAME: irsmsample.exe
ERROR_CODE: (NTSTATUS) 0xc0000005 - The instruction at "0x%08lx" referenced memory at "0x%08lx". The memory could not be "%s".
EXCEPTION_CODE: (NTSTATUS) 0xc0000005 - The instruction at "0x%08lx" referenced memory at "0x%08lx". The memory could not be "%s".
EXCEPTION_PARAMETER1: 00000000
EXCEPTION_PARAMETER2: 00000004
READ_ADDRESS: 00000004
FOLLOWUP_IP:
mpich2!PMPI_Wtime+d9a6
0049a1c6 8b4804 mov ecx,dword ptr [eax+4]
NTGLOBALFLAG: 0
APPLICATION_VERIFIER_FLAGS: 0
FAULTING_THREAD: 00000c0c
PRIMARY_PROBLEM_CLASS: NULL_CLASS_PTR_READ
BUGCHECK_STR: APPLICATION_FAULT_NULL_CLASS_PTR_READ
LAST_CONTROL_TRANSFER: from 004013ae to 0049a1c6
STACK_TEXT:
WARNING: Stack unwind information not available. Following frames may be wrong.
0012ff7c 004013ae 00000003 00384e20 00383a40 mpich2!PMPI_Wtime+0xd9a6
0012ffc0 7c817067 719c71e0 ffffffff 7ffd7000 irsmsample+0x13ae
0012fff0 00000000 00401555 00000000 00905a4d kernel32!BaseProcessStart+0x23
STACK_COMMAND: ~0s; .ecxr ; kb
SYMBOL_STACK_INDEX: 0
SYMBOL_NAME: mpich2!PMPI_Wtime+d9a6
FOLLOWUP_NAME: MachineOwner
MODULE_NAME: mpich2
IMAGE_NAME: mpich2.dll
DEBUG_FLR_IMAGE_TIMESTAMP: 4e5fe70e
FAILURE_BUCKET_ID: NULL_CLASS_PTR_READ_c0000005_mpich2.dll!PMPI_Wtime
BUCKET_ID: APPLICATION_FAULT_NULL_CLASS_PTR_READ_mpich2!PMPI_Wtime+d9a6
WATSON_STAGEONE_URL: http://watson.microsoft.com/StageOne/irsmsample_exe/0_0_0_0/509d3866/mpich2_dll/0_0_0_0/4e5fe70e/c0000005/0008a1c6.htm?Retriage=1
Followup: MachineOwner
-----Message d'origine-----
De : mpich-discuss-bounces at mcs.anl.gov [mailto:mpich-discuss-bounces at mcs.anl.gov] De la part de Calin Iaru
Envoyé : jeudi 8 novembre 2012 11:05
À : mpich-discuss at mcs.anl.gov
Objet : Re: [mpich-discuss] crash mpiexec
This error code indicates an access violation inside MPI_Finalize(). I
suggest you look at the core file.
if(myid == 0) {
struct rlimit rl;
if(getrlimit(RLIMIT_CORE, &rl) == 0) {
if(rl.rlim_cur == 0) {
rl.rlim_cur = rl.rlim_max;
setrlimit(RLIMIT_CORE,&rl);
}
}
}
From: NARDI Luigi
Sent: Wednesday, November 07, 2012 2:08 PM
To: mpich-discuss at mcs.anl.gov
Subject: [mpich-discuss] crash mpiexec
Hello,
I have an error using mpiexec (MPICH2 1.4.1p). Hope somebody may help.
The crash is random, i.e. the same executable may crash or not.
Context:
5 nodes heterogeneous cluster:
4 nodes with CARMA (CUDA on ARM) on Ubuntu 11.4: the carrier board basically
consists of an ARM Cortex A9 processor and a Quadro 1000M NVIDIA GPU card.
1 node with one XEON E5620 processor on Windows XP + cygwin.
Standard Ethernet network.
Names of the 5 nodes:
lnardi
carma1
carma2
carma3
carma4
The command line on the master node lnardi (Windows node) is:
mpiexec -channel sock -n 1 -host lnardi a.out :
-n 1 -host carma1 -path /home/lnardi/ a.out :
-n 1 -host carma2 -path /home/lnardi/ a.out :
-n 1 -host carma3 -path /home/lnardi/ a.out :
-n 1 –host carma4 -path /home/lnardi/ a.out
Notice that the same sample runs on a full linux cluster with the following
characteristics: MVAPICH2-1.8a1p1 (mpirun) + MELLANOX infiniband + XEON
X5675 + NVIDIA GPUs M2090 + Red Hat Enterprise Linux Server release 6.2.
I was running a more complicated code but I have reproduced the error on a
trivial code:
#include <mpi.h>
#include <stdio.h>
#include <string.h>
#define BUFSIZE 128
#define TAG 0
int main(int argc, char *argv[])
{
char idstr[32];
char buff[BUFSIZE];
int numprocs;
int myid;
int i;
MPI_Status stat;
MPI_Init(&argc,&argv);
MPI_Comm_size(MPI_COMM_WORLD,&numprocs);
MPI_Comm_rank(MPI_COMM_WORLD,&myid);
if(myid == 0)
{
printf("%d: We have %d processors\n", myid, numprocs);
for(i=1;i<numprocs;i++)
{
sprintf(buff, "Hello %d! ", i);
MPI_Send(buff, BUFSIZE, MPI_CHAR, i, TAG, MPI_COMM_WORLD);
}
for(i=1;i<numprocs;i++)
{
MPI_Recv(buff, BUFSIZE, MPI_CHAR, i, TAG, MPI_COMM_WORLD, &stat);
printf("%d: %s\n", myid, buff);
}
}
else
{
MPI_Recv(buff, BUFSIZE, MPI_CHAR, 0, TAG, MPI_COMM_WORLD, &stat);
sprintf(idstr, "Processor %d ", myid);
strncat(buff, idstr, BUFSIZE-1);
strncat(buff, "reporting for duty\n", BUFSIZE-1);
MPI_Send(buff, BUFSIZE, MPI_CHAR, 0, TAG, MPI_COMM_WORLD);
}
MPI_Finalize();
return 0;
}
The error:
0: We have 5 processors
0: Hello 1! Processor 1 reporting for duty
0: Hello 2! Processor 2 reporting for duty
0: Hello 3! Processor 3 reporting for duty
0: Hello 4! Processor 4 reporting for duty
job aborted:
rank: node: exit code[: error message]
0: lnardi: -1073741819: process 0 exited without calling finalize
1: carma1: -2
2: carma2: -2
3: carma3: -2
4: carma4: -2
I guess the problem comes from either the sock channel or mpiexec or ARM.
What do you think about?
Thanks
Dr Luigi Nardi
*******************************
This e-mail contains information for the intended recipient only. It may
contain proprietary material or confidential information. If you are not the
intended recipient you are not authorised to distribute, copy or use this
e-mail or any attachment to it. Murex cannot guarantee that it is virus free
and accepts no responsibility for any loss or damage arising from its use.
If you have received this e-mail in error please notify immediately the
sender and delete the original email received, any attachments and all
copies from your system.
_______________________________________________
mpich-discuss mailing list mpich-discuss at mcs.anl.gov
To manage subscription options or unsubscribe:
https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
_______________________________________________
mpich-discuss mailing list mpich-discuss at mcs.anl.gov
To manage subscription options or unsubscribe:
https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
*******************************
This e-mail contains information for the intended recipient only. It may contain proprietary material or confidential information. If you are not the intended recipient you are not authorised to distribute, copy or use this e-mail or any attachment to it. Murex cannot guarantee that it is virus free and accepts no responsibility for any loss or damage arising from its use. If you have received this e-mail in error please notify immediately the sender and delete the original email received, any attachments and all copies from your system.
More information about the discuss
mailing list