From nazanin.mirshokraei at gmail.com Sat Nov 29 21:00:06 2014 From: nazanin.mirshokraei at gmail.com (=?UTF-8?B?2YbYp9iy2YbbjNmG?=) Date: Sat, 29 Nov 2014 19:00:06 -0800 Subject: [mpich-discuss] mpich run error In-Reply-To: References: Message-ID: please give me more detail . you mean to compile mpich-3.0.4 with what flags? ./configure --prefix=/home/nazanin/program_install/mpich-3.0.4 make make install do u mean to add another flag to the command ? thanks nazi On Sat, Nov 29, 2014 at 8:53 AM, Junchao Zhang wrote: > Try to compile your program with mpicc or mpicxx > On Nov 29, 2014 8:15 AM, "??????" wrote: > >> thank u for your reply .... can you please tell me what is the reason? it >> is my first time running a mpi .... >> i compiled a program with cpp which tells to use : >> export USE_MPI=on # distributed-memory parallelism >> export USE_MPIF90=on # compile with mpif90 script >> export which_MPI=mpich # compile with MPICH library >> #export which_MPI=mpich2 # compile with MPICH2 library >> # export which_MPI=openmpi # compile with OpenMPI library >> >> #export USE_OpenMP=on # shared-memory parallelism >> >> # export FORT=ifort >> export FORT=gfortran >> #export FORT=pgi >> >> #export USE_DEBUG= # use Fortran debugging flags >> export USE_LARGE=on # activate 64-bit compilation >> #export USE_NETCDF4=on # compile with NetCDF-4 library >> #export USE_PARALLEL_IO=on # Parallel I/O with Netcdf-4/HDF5 >> >> #export USE_MY_LIBS=on # use my library paths below >> >> >> >> and then it gives me a file(coawstM) and the manual tells me to run that >> file in this way : >> mpirun -np 2 ./coawstM swan_only.in >> >> what will be the reason? is it because mpich was not compiled truely or >> the coawstM is not working for a reason? >> thank you >> >> On Sat, Nov 29, 2014 at 6:07 AM, Junchao Zhang >> wrote: >> >>> As indicated by the message, you passed an invalid communicator to MPI_Comm_rank() >>> in your program. >>> >>> --Junchao Zhang >>> >>> On Sat, Nov 29, 2014 at 4:32 AM, ?????? >>> wrote: >>> >>>> hi does any one know about this error : >>>> Fatal error in PMPI_Comm_rank: Invalid communicator, error stack: >>>> PMPI_Comm_rank(108): MPI_Comm_rank(comm=0x0, rank=0x7fff24a405bc) failed >>>> PMPI_Comm_rank(66).: Invalid communicator >>>> Fatal error in PMPI_Comm_rank: Invalid communicator, error stack: >>>> PMPI_Comm_rank(108): MPI_Comm_rank(comm=0x0, rank=0x7fff63e1598c) failed >>>> PMPI_Comm_rank(66).: Invalid communicator >>>> >>>> it is when i run mpich like : >>>> mpirun -np 2 ./coawstM swan_only.in >>>> >>>> i don't know if this is a problem in my executable file (coawstM) or >>>> this is from mpich or any other hardware issues or anything else >>>> thank you for your help >>>> nazi >>>> >>>> _______________________________________________ >>>> discuss mailing list discuss at mpich.org >>>> To manage subscription options or unsubscribe: >>>> https://lists.mpich.org/mailman/listinfo/discuss >>>> >>> >>> >>> _______________________________________________ >>> discuss mailing list discuss at mpich.org >>> To manage subscription options or unsubscribe: >>> https://lists.mpich.org/mailman/listinfo/discuss >>> >> >> >> _______________________________________________ >> discuss mailing list discuss at mpich.org >> To manage subscription options or unsubscribe: >> https://lists.mpich.org/mailman/listinfo/discuss >> > > _______________________________________________ > discuss mailing list discuss at mpich.org > To manage subscription options or unsubscribe: > https://lists.mpich.org/mailman/listinfo/discuss > -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- _______________________________________________ discuss mailing list discuss at mpich.org To manage subscription options or unsubscribe: https://lists.mpich.org/mailman/listinfo/discuss From jczhang at mcs.anl.gov Sat Nov 29 10:53:58 2014 From: jczhang at mcs.anl.gov (Junchao Zhang) Date: Sat, 29 Nov 2014 10:53:58 -0600 Subject: [mpich-discuss] mpich run error In-Reply-To: References: Message-ID: Try to compile your program with mpicc or mpicxx On Nov 29, 2014 8:15 AM, "??????" wrote: > thank u for your reply .... can you please tell me what is the reason? it > is my first time running a mpi .... > i compiled a program with cpp which tells to use : > export USE_MPI=on # distributed-memory parallelism > export USE_MPIF90=on # compile with mpif90 script > export which_MPI=mpich # compile with MPICH library > #export which_MPI=mpich2 # compile with MPICH2 library > # export which_MPI=openmpi # compile with OpenMPI library > > #export USE_OpenMP=on # shared-memory parallelism > > # export FORT=ifort > export FORT=gfortran > #export FORT=pgi > > #export USE_DEBUG= # use Fortran debugging flags > export USE_LARGE=on # activate 64-bit compilation > #export USE_NETCDF4=on # compile with NetCDF-4 library > #export USE_PARALLEL_IO=on # Parallel I/O with Netcdf-4/HDF5 > > #export USE_MY_LIBS=on # use my library paths below > > > > and then it gives me a file(coawstM) and the manual tells me to run that > file in this way : > mpirun -np 2 ./coawstM swan_only.in > > what will be the reason? is it because mpich was not compiled truely or > the coawstM is not working for a reason? > thank you > > On Sat, Nov 29, 2014 at 6:07 AM, Junchao Zhang > wrote: > >> As indicated by the message, you passed an invalid communicator to MPI_Comm_rank() >> in your program. >> >> --Junchao Zhang >> >> On Sat, Nov 29, 2014 at 4:32 AM, ?????? >> wrote: >> >>> hi does any one know about this error : >>> Fatal error in PMPI_Comm_rank: Invalid communicator, error stack: >>> PMPI_Comm_rank(108): MPI_Comm_rank(comm=0x0, rank=0x7fff24a405bc) failed >>> PMPI_Comm_rank(66).: Invalid communicator >>> Fatal error in PMPI_Comm_rank: Invalid communicator, error stack: >>> PMPI_Comm_rank(108): MPI_Comm_rank(comm=0x0, rank=0x7fff63e1598c) failed >>> PMPI_Comm_rank(66).: Invalid communicator >>> >>> it is when i run mpich like : >>> mpirun -np 2 ./coawstM swan_only.in >>> >>> i don't know if this is a problem in my executable file (coawstM) or >>> this is from mpich or any other hardware issues or anything else >>> thank you for your help >>> nazi >>> >>> _______________________________________________ >>> discuss mailing list discuss at mpich.org >>> To manage subscription options or unsubscribe: >>> https://lists.mpich.org/mailman/listinfo/discuss >>> >> >> >> _______________________________________________ >> discuss mailing list discuss at mpich.org >> To manage subscription options or unsubscribe: >> https://lists.mpich.org/mailman/listinfo/discuss >> > > > _______________________________________________ > discuss mailing list discuss at mpich.org > To manage subscription options or unsubscribe: > https://lists.mpich.org/mailman/listinfo/discuss > -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- _______________________________________________ discuss mailing list discuss at mpich.org To manage subscription options or unsubscribe: https://lists.mpich.org/mailman/listinfo/discuss From rndfax at yandex.ru Sat Nov 29 14:14:10 2014 From: rndfax at yandex.ru (Kuleshov Aleksey) Date: Sat, 29 Nov 2014 23:14:10 +0300 Subject: [mpich-discuss] [patch] fix bug in NEWMAD and MXM netmods Message-ID: <996211417292050@web23h.yandex.ru> diff --git a/src/mpid/ch3/channels/nemesis/netmod/mxm/mxm_poll.c b/src/mpid/ch3/channels/nemesis/netmod/mxm/mxm_poll.c index e8bddc3..752a1f9 100644 --- a/src/mpid/ch3/channels/nemesis/netmod/mxm/mxm_poll.c +++ b/src/mpid/ch3/channels/nemesis/netmod/mxm/mxm_poll.c @@ -482,8 +482,8 @@ static int _mxm_process_rdtype(MPID_Request ** rreq_p, MPI_Datatype datatype, *iov_count = n_iov; } else { - int packsize = 0; - MPIR_Pack_size_impl(rreq->dev.user_count, rreq->dev.datatype, (MPI_Aint *) & packsize); + MPI_Aint packsize = 0; + MPIR_Pack_size_impl(rreq->dev.user_count, rreq->dev.datatype, &packsize); rreq->dev.tmpbuf = MPIU_Malloc((size_t) packsize); MPIU_Assert(rreq->dev.tmpbuf); rreq->dev.tmpbuf_sz = packsize; diff --git a/src/mpid/ch3/channels/nemesis/netmod/newmad/newmad_poll.c b/src/mpid/ch3/channels/nemesis/netmod/newmad/newmad_poll.c index 2dba872..5a32515 100644 --- a/src/mpid/ch3/channels/nemesis/netmod/newmad/newmad_poll.c +++ b/src/mpid/ch3/channels/nemesis/netmod/newmad/newmad_poll.c @@ -575,7 +575,7 @@ int MPID_nem_newmad_process_rdtype(MPID_Request **rreq_p, MPID_Datatype * dt_ptr } else { - int packsize = 0; + MPI_Aint packsize = 0; MPIR_Pack_size_impl(rreq->dev.user_count, rreq->dev.datatype, &packsize); rreq->dev.tmpbuf = MPIU_Malloc((size_t) packsize); MPIU_Assert(rreq->dev.tmpbuf); _______________________________________________ discuss mailing list discuss at mpich.org To manage subscription options or unsubscribe: https://lists.mpich.org/mailman/listinfo/discuss From jczhang at mcs.anl.gov Sat Nov 29 08:07:02 2014 From: jczhang at mcs.anl.gov (Junchao Zhang) Date: Sat, 29 Nov 2014 08:07:02 -0600 Subject: [mpich-discuss] mpich run error In-Reply-To: References: Message-ID: As indicated by the message, you passed an invalid communicator to MPI_Comm_rank() in your program. --Junchao Zhang On Sat, Nov 29, 2014 at 4:32 AM, ?????? wrote: > hi does any one know about this error : > Fatal error in PMPI_Comm_rank: Invalid communicator, error stack: > PMPI_Comm_rank(108): MPI_Comm_rank(comm=0x0, rank=0x7fff24a405bc) failed > PMPI_Comm_rank(66).: Invalid communicator > Fatal error in PMPI_Comm_rank: Invalid communicator, error stack: > PMPI_Comm_rank(108): MPI_Comm_rank(comm=0x0, rank=0x7fff63e1598c) failed > PMPI_Comm_rank(66).: Invalid communicator > > it is when i run mpich like : > mpirun -np 2 ./coawstM swan_only.in > > i don't know if this is a problem in my executable file (coawstM) or this > is from mpich or any other hardware issues or anything else > thank you for your help > nazi > > _______________________________________________ > discuss mailing list discuss at mpich.org > To manage subscription options or unsubscribe: > https://lists.mpich.org/mailman/listinfo/discuss > -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- _______________________________________________ discuss mailing list discuss at mpich.org To manage subscription options or unsubscribe: https://lists.mpich.org/mailman/listinfo/discuss From nazanin.mirshokraei at gmail.com Sat Nov 29 08:15:10 2014 From: nazanin.mirshokraei at gmail.com (=?UTF-8?B?2YbYp9iy2YbbjNmG?=) Date: Sat, 29 Nov 2014 06:15:10 -0800 Subject: [mpich-discuss] mpich run error In-Reply-To: References: Message-ID: thank u for your reply .... can you please tell me what is the reason? it is my first time running a mpi .... i compiled a program with cpp which tells to use : export USE_MPI=on # distributed-memory parallelism export USE_MPIF90=on # compile with mpif90 script export which_MPI=mpich # compile with MPICH library #export which_MPI=mpich2 # compile with MPICH2 library # export which_MPI=openmpi # compile with OpenMPI library #export USE_OpenMP=on # shared-memory parallelism # export FORT=ifort export FORT=gfortran #export FORT=pgi #export USE_DEBUG= # use Fortran debugging flags export USE_LARGE=on # activate 64-bit compilation #export USE_NETCDF4=on # compile with NetCDF-4 library #export USE_PARALLEL_IO=on # Parallel I/O with Netcdf-4/HDF5 #export USE_MY_LIBS=on # use my library paths below and then it gives me a file(coawstM) and the manual tells me to run that file in this way : mpirun -np 2 ./coawstM swan_only.in what will be the reason? is it because mpich was not compiled truely or the coawstM is not working for a reason? thank you On Sat, Nov 29, 2014 at 6:07 AM, Junchao Zhang wrote: > As indicated by the message, you passed an invalid communicator to MPI_Comm_rank() > in your program. > > --Junchao Zhang > > On Sat, Nov 29, 2014 at 4:32 AM, ?????? > wrote: > >> hi does any one know about this error : >> Fatal error in PMPI_Comm_rank: Invalid communicator, error stack: >> PMPI_Comm_rank(108): MPI_Comm_rank(comm=0x0, rank=0x7fff24a405bc) failed >> PMPI_Comm_rank(66).: Invalid communicator >> Fatal error in PMPI_Comm_rank: Invalid communicator, error stack: >> PMPI_Comm_rank(108): MPI_Comm_rank(comm=0x0, rank=0x7fff63e1598c) failed >> PMPI_Comm_rank(66).: Invalid communicator >> >> it is when i run mpich like : >> mpirun -np 2 ./coawstM swan_only.in >> >> i don't know if this is a problem in my executable file (coawstM) or this >> is from mpich or any other hardware issues or anything else >> thank you for your help >> nazi >> >> _______________________________________________ >> discuss mailing list discuss at mpich.org >> To manage subscription options or unsubscribe: >> https://lists.mpich.org/mailman/listinfo/discuss >> > > > _______________________________________________ > discuss mailing list discuss at mpich.org > To manage subscription options or unsubscribe: > https://lists.mpich.org/mailman/listinfo/discuss > -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- _______________________________________________ discuss mailing list discuss at mpich.org To manage subscription options or unsubscribe: https://lists.mpich.org/mailman/listinfo/discuss From jczhang at mcs.anl.gov Sat Nov 29 08:04:24 2014 From: jczhang at mcs.anl.gov (Junchao Zhang) Date: Sat, 29 Nov 2014 08:04:24 -0600 Subject: [mpich-discuss] (no subject) In-Reply-To: References: Message-ID: Don't know what you meant by "add a piece "B" of code aims to read a .txt file nothing appear". The code you added in B in a normal file read/write. You can easily debug it with gdb. --Junchao Zhang On Fri, Nov 28, 2014 at 6:29 PM, Chafik sanaa wrote: > when I execute the "A" program the result is displayed , but when i trying > to add a piece "B" of code aims to read a .txt file nothing appear: > > program A: > --------------------------------------- > #include > #include > #include > #include > #include "math.h" > #include "mpi.h" > > #define MAX_LIGNE 20 /* au maximum, 10 lignes */ > #define MAX_COL 20 /* au maximum 8 colonnes */ > #define NOM_A_LIRE "fichier1.txt" /*matrice a lire*/ > typedef int Matrice[MAX_LIGNE][MAX_COL]; > int main(int argc, char** argv) > { > int taskid, ntasks; > int ierr, i, itask; > int sendcounts[2048], displs[2048], recvcount; > double **sendbuff, *recvbuff, buffsum, buffsums[2048]; > double inittime, totaltime; > const int nbr_etat = 10; > double tab[nbr_etat]; > > > > for (int i = 0; i < nbr_etat; i++) > tab[i] = i; > > for (int i = 0; i < nbr_etat; i++) > printf("%0.0f ", tab[i]); > printf("\n"); > int nbr_elm[2] = { 4, 6 }; > int dpl[2] = { 0, 4 }; > > MPI_Init(&argc, &argv); > MPI_Comm_rank(MPI_COMM_WORLD, &taskid); > MPI_Comm_size(MPI_COMM_WORLD, &ntasks); > > recvbuff = (double *)malloc(sizeof(double)*nbr_etat); > if (taskid == 0) > { > sendbuff = (double **)malloc(sizeof(double *)*ntasks);// on execute pour > deux proc > sendbuff[0] = (double *)malloc(sizeof(double)*ntasks*nbr_etat); > for (i = 1; i < ntasks; i++) > { > sendbuff[i] = sendbuff[i - 1] + nbr_etat; > } > } > else > { > sendbuff = (double **)malloc(sizeof(double *)* 1); > sendbuff[0] = (double *)malloc(sizeof(double)* 1); > } > > if (taskid == 0){ > > srand((unsigned)time(NULL) + taskid); > for (itask = 0; itask { > int k; > displs[itask] = itask*nbr_etat; > int s = dpl[itask]; > sendcounts[itask] = nbr_elm[itask]; > > > for (i = 0; i { > k = i + s; > sendbuff[itask][i] = tab[k]; > printf("+ %0.0f ", sendbuff[itask][i]); > } > printf("\n"); > } > } > > recvcount = nbr_elm[taskid]; > > inittime = MPI_Wtime(); > > ierr = MPI_Scatterv(sendbuff[0], sendcounts, displs, MPI_DOUBLE, > recvbuff, recvcount, MPI_DOUBLE, > 0, MPI_COMM_WORLD); > > totaltime = MPI_Wtime() - inittime; > > printf("\n >>>> \n"); > buffsum = 0.0; > printf("\n > %d < \n", taskid); > for (i = 0; i { > buffsum = buffsum + recvbuff[i]; > printf("* %0.0f ", recvbuff[i]); > } > printf("\n"); > printf("(%d) %0.3f ", taskid, buffsum); > > ierr = MPI_Gather(&buffsum, 1, MPI_DOUBLE, > buffsums, 1, MPI_DOUBLE, > 0, MPI_COMM_WORLD); > if (taskid == 0) > { > printf("\n"); > printf("##########################################################\n\n"); > printf(" --> APReS COMMUNICATION <-- \n\n"); > for (itask = 0; itask { > printf("Processus %d : Somme du vecteur re?u : %0.0f\n", > itask, buffsums[itask]); > } > printf("\n"); > printf("##########################################################\n\n"); > printf(" Temps total de communication : %f secondes\n\n", totaltime); > printf("##########################################################\n\n"); > } > > /* Lib?ration de la m?moire */ > if (taskid == 0) > { > free(sendbuff[0]); > free(sendbuff); > } > else > { > free(sendbuff[0]); > free(sendbuff); > free(recvbuff); > } > > /* Finalisation de MPI */ > MPI_Finalize(); > } > > program B: > ------------------------- > #include > #include > #include > #include > #include "math.h" > #include "mpi.h" > > #define MAX_LIGNE 20 /* au maximum, 10 lignes */ > #define MAX_COL 20 /* au maximum 8 colonnes */ > #define NOM_A_LIRE "fichier1.txt" /*matrice a lire*/ > typedef int Matrice[MAX_LIGNE][MAX_COL]; > int main(int argc, char** argv) > { > int taskid, ntasks; > int ierr, i, itask; > int sendcounts[2048], displs[2048], recvcount; > double **sendbuff, *recvbuff, buffsum, buffsums[2048]; > double inittime, totaltime; > const int nbr_etat = 10; > double tab[nbr_etat]; > > > > for (int i = 0; i < nbr_etat; i++) > tab[i] = i; > > for (int i = 0; i < nbr_etat; i++) > printf("%0.0f ", tab[i]); > printf("\n"); > int nbr_elm[2] = { 4, 6 }; > int dpl[2] = { 0, 4 }; > > MPI_Init(&argc, &argv); > MPI_Comm_rank(MPI_COMM_WORLD, &taskid); > MPI_Comm_size(MPI_COMM_WORLD, &ntasks); > > recvbuff = (double *)malloc(sizeof(double)*nbr_etat); > if (taskid == 0) > { > sendbuff = (double **)malloc(sizeof(double *)*ntasks);// on execute pour > deux proc > sendbuff[0] = (double *)malloc(sizeof(double)*ntasks*nbr_etat); > for (i = 1; i < ntasks; i++) > { > sendbuff[i] = sendbuff[i - 1] + nbr_etat; > } > } > else > { > sendbuff = (double **)malloc(sizeof(double *)* 1); > sendbuff[0] = (double *)malloc(sizeof(double)* 1); > } > > if (taskid == 0){ > > srand((unsigned)time(NULL) + taskid); > for (itask = 0; itask { > int k; > displs[itask] = itask*nbr_etat; > int s = dpl[itask]; > sendcounts[itask] = nbr_elm[itask]; > > > for (i = 0; i { > k = i + s; > sendbuff[itask][i] = tab[k]; > printf("+ %0.0f ", sendbuff[itask][i]); > } > printf("\n"); > } > } > > recvcount = nbr_elm[taskid]; > > inittime = MPI_Wtime(); > > ierr = MPI_Scatterv(sendbuff[0], sendcounts, displs, MPI_DOUBLE, > recvbuff, recvcount, MPI_DOUBLE, > 0, MPI_COMM_WORLD); > > totaltime = MPI_Wtime() - inittime; > > printf("\n >>>> \n"); > buffsum = 0.0; > printf("\n > %d < \n", taskid); > for (i = 0; i { > buffsum = buffsum + recvbuff[i]; > printf("* %0.0f ", recvbuff[i]); > } > printf("\n"); > printf("(%d) %0.3f ", taskid, buffsum); > > ierr = MPI_Gather(&buffsum, 1, MPI_DOUBLE, > buffsums, 1, MPI_DOUBLE, > 0, MPI_COMM_WORLD); > if (taskid == 0) > { > printf("\n"); > printf("##########################################################\n\n"); > printf(" --> APReS COMMUNICATION <-- \n\n"); > for (itask = 0; itask { > printf("Processus %d : Somme du vecteur re?u : %0.0f\n", > itask, buffsums[itask]); > } > printf("\n"); > printf("##########################################################\n\n"); > printf(" Temps total de communication : %f secondes\n\n", totaltime); > printf("##########################################################\n\n"); > } > > /*===============================================================*/ > /* Lib?ration de la m?moire */ > if (taskid == 0) > { > free(sendbuff[0]); > free(sendbuff); > } > else > { > free(sendbuff[0]); > free(sendbuff); > free(recvbuff); > } > > /*===============================================================*/ > /* Finalisation de MPI */ > MPI_Finalize(); > Matrice A; > int m, n; > /* ouvrir le fichier Matrices.dta sur le disque r?seau pour la lecture */ > FILE * aLire = fopen(NOM_A_LIRE, "r"); > if (aLire == NULL) > { > printf("Le fichier n'existe pas"); > } > else > { > fscanf(aLire, "%d%d\n", &m, &n); > ////void lire(FILE * aLire, Matrice mat, int nbLigne, int nbCol) > for (int i = 0; i < m; i++) { > for (int j = 0; j < n; j++) > fscanf(aLire, "%d", &A[i][j]); > fscanf(aLire, "\n"); > } > ////void afficher(Matrice mat, char nom[], int nbLigne, int nbCol) > > printf("\nContenu de la matrice de %d ligne(s) et %d colonne(s) :\n", m, > n); > for (int i = 0; i < m; i++) { > for (int j = 0; j < n; j++) > printf("%d ", A[i][j]); > printf("\n"); > } > printf("\n"); > > > } > fclose(aLire); > > } > > -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- _______________________________________________ discuss mailing list discuss at mpich.org To manage subscription options or unsubscribe: https://lists.mpich.org/mailman/listinfo/discuss From nazanin.mirshokraei at gmail.com Sat Nov 29 04:32:32 2014 From: nazanin.mirshokraei at gmail.com (=?UTF-8?B?2YbYp9iy2YbbjNmG?=) Date: Sat, 29 Nov 2014 02:32:32 -0800 Subject: [mpich-discuss] mpich run error Message-ID: hi does any one know about this error : Fatal error in PMPI_Comm_rank: Invalid communicator, error stack: PMPI_Comm_rank(108): MPI_Comm_rank(comm=0x0, rank=0x7fff24a405bc) failed PMPI_Comm_rank(66).: Invalid communicator Fatal error in PMPI_Comm_rank: Invalid communicator, error stack: PMPI_Comm_rank(108): MPI_Comm_rank(comm=0x0, rank=0x7fff63e1598c) failed PMPI_Comm_rank(66).: Invalid communicator it is when i run mpich like : mpirun -np 2 ./coawstM swan_only.in i don't know if this is a problem in my executable file (coawstM) or this is from mpich or any other hardware issues or anything else thank you for your help nazi -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- _______________________________________________ discuss mailing list discuss at mpich.org To manage subscription options or unsubscribe: https://lists.mpich.org/mailman/listinfo/discuss From san.chafik at gmail.com Fri Nov 28 18:29:51 2014 From: san.chafik at gmail.com (Chafik sanaa) Date: Sat, 29 Nov 2014 01:29:51 +0100 Subject: [mpich-discuss] (no subject) Message-ID: when I execute the "A" program the result is displayed , but when i trying to add a piece "B" of code aims to read a .txt file nothing appear: program A: --------------------------------------- #include #include #include #include #include "math.h" #include "mpi.h" #define MAX_LIGNE 20 /* au maximum, 10 lignes */ #define MAX_COL 20 /* au maximum 8 colonnes */ #define NOM_A_LIRE "fichier1.txt" /*matrice a lire*/ typedef int Matrice[MAX_LIGNE][MAX_COL]; int main(int argc, char** argv) { int taskid, ntasks; int ierr, i, itask; int sendcounts[2048], displs[2048], recvcount; double **sendbuff, *recvbuff, buffsum, buffsums[2048]; double inittime, totaltime; const int nbr_etat = 10; double tab[nbr_etat]; for (int i = 0; i < nbr_etat; i++) tab[i] = i; for (int i = 0; i < nbr_etat; i++) printf("%0.0f ", tab[i]); printf("\n"); int nbr_elm[2] = { 4, 6 }; int dpl[2] = { 0, 4 }; MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &taskid); MPI_Comm_size(MPI_COMM_WORLD, &ntasks); recvbuff = (double *)malloc(sizeof(double)*nbr_etat); if (taskid == 0) { sendbuff = (double **)malloc(sizeof(double *)*ntasks);// on execute pour deux proc sendbuff[0] = (double *)malloc(sizeof(double)*ntasks*nbr_etat); for (i = 1; i < ntasks; i++) { sendbuff[i] = sendbuff[i - 1] + nbr_etat; } } else { sendbuff = (double **)malloc(sizeof(double *)* 1); sendbuff[0] = (double *)malloc(sizeof(double)* 1); } if (taskid == 0){ srand((unsigned)time(NULL) + taskid); for (itask = 0; itask>>> \n"); buffsum = 0.0; printf("\n > %d < \n", taskid); for (i = 0; i APReS COMMUNICATION <-- \n\n"); for (itask = 0; itask #include #include #include #include "math.h" #include "mpi.h" #define MAX_LIGNE 20 /* au maximum, 10 lignes */ #define MAX_COL 20 /* au maximum 8 colonnes */ #define NOM_A_LIRE "fichier1.txt" /*matrice a lire*/ typedef int Matrice[MAX_LIGNE][MAX_COL]; int main(int argc, char** argv) { int taskid, ntasks; int ierr, i, itask; int sendcounts[2048], displs[2048], recvcount; double **sendbuff, *recvbuff, buffsum, buffsums[2048]; double inittime, totaltime; const int nbr_etat = 10; double tab[nbr_etat]; for (int i = 0; i < nbr_etat; i++) tab[i] = i; for (int i = 0; i < nbr_etat; i++) printf("%0.0f ", tab[i]); printf("\n"); int nbr_elm[2] = { 4, 6 }; int dpl[2] = { 0, 4 }; MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &taskid); MPI_Comm_size(MPI_COMM_WORLD, &ntasks); recvbuff = (double *)malloc(sizeof(double)*nbr_etat); if (taskid == 0) { sendbuff = (double **)malloc(sizeof(double *)*ntasks);// on execute pour deux proc sendbuff[0] = (double *)malloc(sizeof(double)*ntasks*nbr_etat); for (i = 1; i < ntasks; i++) { sendbuff[i] = sendbuff[i - 1] + nbr_etat; } } else { sendbuff = (double **)malloc(sizeof(double *)* 1); sendbuff[0] = (double *)malloc(sizeof(double)* 1); } if (taskid == 0){ srand((unsigned)time(NULL) + taskid); for (itask = 0; itask>>> \n"); buffsum = 0.0; printf("\n > %d < \n", taskid); for (i = 0; i APReS COMMUNICATION <-- \n\n"); for (itask = 0; itask -------------- next part -------------- _______________________________________________ discuss mailing list discuss at mpich.org To manage subscription options or unsubscribe: https://lists.mpich.org/mailman/listinfo/discuss From san.chafik at gmail.com Fri Nov 28 14:20:52 2014 From: san.chafik at gmail.com (Chafik sanaa) Date: Fri, 28 Nov 2014 21:20:52 +0100 Subject: [mpich-discuss] the SCATTERV function In-Reply-To: References: Message-ID: I was able to correct, thanks #include #include #include #include #include "math.h" #include "mpi.h" int main(int argc, char** argv) { int taskid, ntasks; int ierr, i, itask; int sendcounts[2048], displs[2048], recvcount; double **sendbuff, *recvbuff, buffsum, buffsums[2048]; double inittime, totaltime; const int nbr_etat = 10; double tab[nbr_etat]; for (int i = 0; i < nbr_etat; i++) tab[i] = i; for (int i = 0; i < nbr_etat; i++) printf("%0.0f ", tab[i]); printf("\n"); int nbr_elm[2] = { 4, 6 }; int dpl[2] = { 0, 4 }; MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &taskid); MPI_Comm_size(MPI_COMM_WORLD, &ntasks); recvbuff = (double *)malloc(sizeof(double)*nbr_etat); if (taskid == 0) { sendbuff = (double **)malloc(sizeof(double *)*ntasks);// on execute pour deux proc sendbuff[0] = (double *)malloc(sizeof(double)*ntasks*nbr_etat); for (i = 1; i < ntasks; i++) { sendbuff[i] = sendbuff[i - 1] + nbr_etat; } } else { sendbuff = (double **)malloc(sizeof(double *)* 1); sendbuff[0] = (double *)malloc(sizeof(double)* 1); } if (taskid == 0){ srand((unsigned)time(NULL) + taskid); for (itask = 0; itask>>> \n"); buffsum = 0.0; printf("\n > %d < \n", taskid); for (i = 0; i: > Hi > when i execute this program (I use two processes) i have a error in the > display part : > * RESULT OF EXECUTION: > 0 1 2 3 4 5 6 7 8 9 > >> (1,6) << > > >>>> > > > 1 < > * -6277438562204192500000000000000000000000000000000000000000000000000 * > -6277438562204192500000000000000000000000000000000000000000000000000 * > -6277438562204192500000000000000000000000000000000000000000000000000 * > -6277438562204192500000000000000000000000000000000000000000000000000 * > -6277438562204192500000000000000000000000000000000000000000000000000 * > -6277438562204192500000000000000000000000000000000000000000000000000 > 0 1 2 3 4 5 6 7 8 9 > 0 1 2 3 > 4 5 6 7 8 9 > >> (0,4) << > > >>>> > > > 0 < > * 0 * 1 * 2 * 3 > * PROGRAM > #include > #include > #include > #include > #include "math.h" > #include "mpi.h" > > int main(int argc, char** argv) > { > int taskid, ntasks; > int ierr, i, itask; > int sendcounts[2048], displs[2048], recvcount; > double **sendbuff, *recvbuff, buffsum, buffsums[2048]; > double inittime, totaltime; > const int nbr_etat = 10; > double tab[nbr_etat]; > > for (int i = 0; i < nbr_etat; i++) > tab[i] = i; > > for (int i = 0; i < nbr_etat; i++) > printf("%0.0f ", tab[i]); > printf("\n"); > int nbr_elm[2] = { 4, 6 }; > int dpl[2] = { 0, 4 }; > > MPI_Init(&argc, &argv); > MPI_Comm_rank(MPI_COMM_WORLD, &taskid); > MPI_Comm_size(MPI_COMM_WORLD, &ntasks); > > > recvbuff = (double *)malloc(sizeof(double)*nbr_etat); > if (taskid == 0) > { > sendbuff = (double **)malloc(sizeof(double *)*ntasks);// on execute pour > deux proc > sendbuff[0] = (double *)malloc(sizeof(double)*ntasks*nbr_etat); > for (i = 1; i < ntasks; i++) > { > sendbuff[i] = sendbuff[i - 1] + nbr_etat; > } > } > else > { > sendbuff = (double **)malloc(sizeof(double *)* 1); > sendbuff[0] = (double *)malloc(sizeof(double)* 1); > } > > if (taskid == 0){ > > > > > srand((unsigned)time(NULL) + taskid); > for (itask = 0; itask { > int k; > displs[itask] = dpl[itask]; > int s = displs[itask]; > sendcounts[itask] = nbr_elm[itask]; > > for (i = 0; i { > k = i + s; > sendbuff[itask][i] = tab[k]; > printf("%0.0f ", sendbuff[itask][i]); > } > printf("\n"); > } > > > } > > > recvcount = nbr_elm[taskid]; > > inittime = MPI_Wtime(); > > ierr = MPI_Scatterv(sendbuff[0], sendcounts, displs, MPI_DOUBLE, > recvbuff, recvcount, MPI_DOUBLE, > 0, MPI_COMM_WORLD); > > totaltime = MPI_Wtime() - inittime; > > printf("\n >>>> \n"); > buffsum = 0.0; > printf("\n > %d < \n",taskid); > for (i = 0; i { > printf("* %0.0f ", recvbuff[i]); > } > > printf("\n"); > if (taskid == 0){ > free(sendbuff[0]); > free(sendbuff); > } > else{ > free(sendbuff[0]); > free(sendbuff); > free(recvbuff); > } > > /*===============================================================*/ > /* Finalisation de MPI */ > MPI_Finalize(); > > } > > > 2014-11-28 20:53 GMT+01:00 Chafik sanaa : > Hi > when i execute this program (I use two processes) i have a error in the > display part : > * RESULT OF EXECUTION: > 0 1 2 3 4 5 6 7 8 9 > >> (1,6) << > > >>>> > > > 1 < > * -6277438562204192500000000000000000000000000000000000000000000000000 * > -6277438562204192500000000000000000000000000000000000000000000000000 * > -6277438562204192500000000000000000000000000000000000000000000000000 * > -6277438562204192500000000000000000000000000000000000000000000000000 * > -6277438562204192500000000000000000000000000000000000000000000000000 * > -6277438562204192500000000000000000000000000000000000000000000000000 > 0 1 2 3 4 5 6 7 8 9 > 0 1 2 3 > 4 5 6 7 8 9 > >> (0,4) << > > >>>> > > > 0 < > * 0 * 1 * 2 * 3 > * PROGRAM > #include > #include > #include > #include > #include "math.h" > #include "mpi.h" > > int main(int argc, char** argv) > { > int taskid, ntasks; > int ierr, i, itask; > int sendcounts[2048], displs[2048], recvcount; > double **sendbuff, *recvbuff, buffsum, buffsums[2048]; > double inittime, totaltime; > const int nbr_etat = 10; > double tab[nbr_etat]; > > for (int i = 0; i < nbr_etat; i++) > tab[i] = i; > > for (int i = 0; i < nbr_etat; i++) > printf("%0.0f ", tab[i]); > printf("\n"); > int nbr_elm[2] = { 4, 6 }; > int dpl[2] = { 0, 4 }; > > MPI_Init(&argc, &argv); > MPI_Comm_rank(MPI_COMM_WORLD, &taskid); > MPI_Comm_size(MPI_COMM_WORLD, &ntasks); > > > recvbuff = (double *)malloc(sizeof(double)*nbr_etat); > if (taskid == 0) > { > sendbuff = (double **)malloc(sizeof(double *)*ntasks);// on execute pour > deux proc > sendbuff[0] = (double *)malloc(sizeof(double)*ntasks*nbr_etat); > for (i = 1; i < ntasks; i++) > { > sendbuff[i] = sendbuff[i - 1] + nbr_etat; > } > } > else > { > sendbuff = (double **)malloc(sizeof(double *)* 1); > sendbuff[0] = (double *)malloc(sizeof(double)* 1); > } > > if (taskid == 0){ > > > > > srand((unsigned)time(NULL) + taskid); > for (itask = 0; itask { > int k; > displs[itask] = dpl[itask]; > int s = displs[itask]; > sendcounts[itask] = nbr_elm[itask]; > > for (i = 0; i { > k = i + s; > sendbuff[itask][i] = tab[k]; > printf("%0.0f ", sendbuff[itask][i]); > } > printf("\n"); > } > > > } > > > recvcount = nbr_elm[taskid]; > > inittime = MPI_Wtime(); > > ierr = MPI_Scatterv(sendbuff[0], sendcounts, displs, MPI_DOUBLE, > recvbuff, recvcount, MPI_DOUBLE, > 0, MPI_COMM_WORLD); > > totaltime = MPI_Wtime() - inittime; > > printf("\n >>>> \n"); > buffsum = 0.0; > printf("\n > %d < \n",taskid); > for (i = 0; i { > printf("* %0.0f ", recvbuff[i]); > } > > printf("\n"); > if (taskid == 0){ > free(sendbuff[0]); > free(sendbuff); > } > else{ > free(sendbuff[0]); > free(sendbuff); > free(recvbuff); > } > > /*===============================================================*/ > /* Finalisation de MPI */ > MPI_Finalize(); > > } > > -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- _______________________________________________ discuss mailing list discuss at mpich.org To manage subscription options or unsubscribe: https://lists.mpich.org/mailman/listinfo/discuss From san.chafik at gmail.com Fri Nov 28 13:53:02 2014 From: san.chafik at gmail.com (Chafik sanaa) Date: Fri, 28 Nov 2014 20:53:02 +0100 Subject: [mpich-discuss] the SCATTERV function Message-ID: Hi when i execute this program (I use two processes) i have a error in the display part : * RESULT OF EXECUTION: 0 1 2 3 4 5 6 7 8 9 >> (1,6) << >>>> > 1 < * -6277438562204192500000000000000000000000000000000000000000000000000 * -6277438562204192500000000000000000000000000000000000000000000000000 * -6277438562204192500000000000000000000000000000000000000000000000000 * -6277438562204192500000000000000000000000000000000000000000000000000 * -6277438562204192500000000000000000000000000000000000000000000000000 * -6277438562204192500000000000000000000000000000000000000000000000000 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 >> (0,4) << >>>> > 0 < * 0 * 1 * 2 * 3 * PROGRAM #include #include #include #include #include "math.h" #include "mpi.h" int main(int argc, char** argv) { int taskid, ntasks; int ierr, i, itask; int sendcounts[2048], displs[2048], recvcount; double **sendbuff, *recvbuff, buffsum, buffsums[2048]; double inittime, totaltime; const int nbr_etat = 10; double tab[nbr_etat]; for (int i = 0; i < nbr_etat; i++) tab[i] = i; for (int i = 0; i < nbr_etat; i++) printf("%0.0f ", tab[i]); printf("\n"); int nbr_elm[2] = { 4, 6 }; int dpl[2] = { 0, 4 }; MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &taskid); MPI_Comm_size(MPI_COMM_WORLD, &ntasks); recvbuff = (double *)malloc(sizeof(double)*nbr_etat); if (taskid == 0) { sendbuff = (double **)malloc(sizeof(double *)*ntasks);// on execute pour deux proc sendbuff[0] = (double *)malloc(sizeof(double)*ntasks*nbr_etat); for (i = 1; i < ntasks; i++) { sendbuff[i] = sendbuff[i - 1] + nbr_etat; } } else { sendbuff = (double **)malloc(sizeof(double *)* 1); sendbuff[0] = (double *)malloc(sizeof(double)* 1); } if (taskid == 0){ srand((unsigned)time(NULL) + taskid); for (itask = 0; itask>>> \n"); buffsum = 0.0; printf("\n > %d < \n",taskid); for (i = 0; i -------------- next part -------------- _______________________________________________ discuss mailing list discuss at mpich.org To manage subscription options or unsubscribe: https://lists.mpich.org/mailman/listinfo/discuss From balaji at anl.gov Thu Nov 27 09:12:02 2014 From: balaji at anl.gov (Balaji, Pavan) Date: Thu, 27 Nov 2014 15:12:02 +0000 Subject: [mpich-ibm] [mpich-discuss] Problem with MPICH3/OpenPA on IBM P755 In-Reply-To: <8D58A4B5E6148C419C6AD6334962375DDD9AE388@UWMBX04.uw.lu.se> References: <8D58A4B5E6148C419C6AD6334962375DDD9AE388@UWMBX04.uw.lu.se> Message-ID: <9A806065-1C4A-4FA6-9739-1DD956B921EA@anl.gov> Hmm. That's not a good sign. OPA's configure seems to think that it can use some inbuilt atomic capability while the compiler/hardware is clearly not supporting it. Can you send us the OPA config.log (src/openpa/config.log) so we can look into it? Also cc'ed the IBM folks. IBM folks, any thoughts? -- Pavan > On Nov 27, 2014, at 6:11 AM, Victor Vysotskiy wrote: > > Hi, > > I am trying to compile the latest ('v3.2a2-24-g4ad367d0') nightly snapshot on the IBM P775 machine. The MPICH3 is configured with the following options: > > export OBJECT_MODE=64 > ./configure --enable-f77 --enable-fc --enable-cxx --enable-smpcoll --with-thread-package=pthreads --with-pami=/opt/ibmhpc/pecurrent/ppe.pami --with-pami-lib=/opt/ibmhpc/pecurrent/ppe.pami/lib --with-pami-include=/opt/ibmhpc/pecurrent/ppe.pami/include64 --enable-mpe --enable-error-messages=all CC="xlc_r -q64 -qmaxmem=-1" CPP=/usr/ccs/lib/cpp CXX="xlC_r -q64 -qmaxmem=-1" F77="xlf_r -q64 -qmaxmem=-1" FC="xlf90_r -q64 -qmaxmem=-1" CFLAGS="-q64 -qmaxmem=-1" CXXFLAGS="-q64 -qmaxmem=-1" FCFLAGS="-qmaxmem=-1 -q64" FFLAGS="-q64 -qmaxmem=-1" OBJECT_MODE=64 AR="ar -X 64" --with-file-system=bg+bglockless > > No problem with compiling it, but there is a problem with running OpenPA tests: > > Testing simple integer load-linked/store-conditional functionality -SKIP- > LL/SC not available > Testing simple pointer load-linked/store-conditional functionality -SKIP- > LL/SC not available > Testing integer LL/SC ABA -SKIP- > LL/SC not available > Testing pointer LL/SC ABA -SKIP- > LL/SC not available > Testing integer load/store with 1 thread *FAILED* > at test_primitives.c:371 in test_threaded_loadstore_int()... > Testing pointer load/store with 1 thread *FAILED* > at test_primitives.c:556 in test_threaded_loadstore_ptr()... > Testing add with 1 thread *FAILED* > at test_primitives.c:750 in test_threaded_add()... > Testing incr and decr with 1 thread PASSED > Testing decr and test with 1 thread *FAILED* > at test_primitives.c:1093 in test_threaded_decr_and_test()... > Testing fetch and add with 1 thread *FAILED* > at test_primitives.c:1344 in test_threaded_faa()... > Testing fetch and add return values with 1 thread *FAILED* > at test_primitives.c:1490 in test_threaded_faa_ret()... > Testing fetch and incr/decr with 1 thread PASSED > Testing fetch and incr return values with 1 thread *FAILED* > at test_primitives.c:1796 in test_threaded_fai_ret()... > Testing fetch and decr return values with 1 thread *FAILED* > at test_primitives.c:1953 in test_threaded_fad_ret()... > Testing integer compare-and-swap with 1 thread *FAILED* > at test_primitives.c:2191 in test_threaded_cas_int()... > Testing pointer compare-and-swap with 1 thread *FAILED* > at test_primitives.c:2346 in test_threaded_cas_ptr()... > Testing grouped integer compare-and-swap with 1 thread *FAILED* > at test_primitives.c:2506 in test_grouped_cas_int()... > Testing grouped pointer compare-and-swap with 1 thread *FAILED* > at test_primitives.c:2683 in test_grouped_cas_ptr()... > Testing integer compare-and-swap fairness with 1 thread *FAILED* > at test_primitives.c:2890 in test_threaded_cas_int_fairness()... > Testing pointer compare-and-swap fairness with 1 thread *FAILED* > at test_primitives.c:3087 in test_threaded_cas_ptr_fairness()... > Testing integer swap with 1 thread *FAILED* > at test_primitives.c:3341 in test_threaded_swap_int()... > Testing pointer swap with 1 thread *FAILED* > at test_primitives.c:3496 in test_threaded_swap_ptr()... > Testing integer LL/SC stack -SKIP- > LL/SC not available > Testing pointer LL/SC stack -SKIP- > LL/SC not available > Testing integer LL/SC stack -SKIP- > LL/SC not available > Testing pointer LL/SC stack -SKIP- > LL/SC not available > Testing integer LL/SC stack -SKIP- > LL/SC not available > Testing pointer LL/SC stack -SKIP- > LL/SC not available > Testing integer LL/SC stack -SKIP- > LL/SC not available > Testing pointer LL/SC stack -SKIP- > LL/SC not available > ***** 16 PRIMITIVES TESTS FAILED! ***** > > Apparently, a bunch of tests failed. Is there anything to worry about? Or, I can simply skip these failed test and can use the compiled MPICH3 for production? > > The software stack used: > AIX v7.1.0.0 > XLC/XLF compiler v14.01.0000.0008 > POE v1-1.2.0.3 > > Best, > Victor. > _______________________________________________ > discuss mailing list discuss at mpich.org > To manage subscription options or unsubscribe: > https://lists.mpich.org/mailman/listinfo/discuss -- Pavan Balaji ?? http://www.mcs.anl.gov/~balaji _______________________________________________ ibm mailing list ibm at lists.mpich.org https://lists.mpich.org/mailman/listinfo/ibm From jan.bierbaum at tudos.org Fri Nov 28 11:07:08 2014 From: jan.bierbaum at tudos.org (Jan Bierbaum) Date: Fri, 28 Nov 2014 18:07:08 +0100 Subject: [mpich-discuss] f77 bindings and profiling In-Reply-To: <6F4D5A685397B940825208C64CF853A7477FD77E@HALAS.anl.gov> References: <6F4D5A685397B940825208C64CF853A7477F881F@HALAS.anl.gov> <08621548-8BFB-4458-9936-97765341598D@mcs.anl.gov> <6F4D5A685397B940825208C64CF853A7477F8875@HALAS.anl.gov> , <7DD38ACE-B6AE-405B-9B6E-826A4D0461C9@mcs.anl.gov> <6F4D5A685397B940825208C64CF853A7477F88C7@HALAS.anl.gov>, <546C6EEE.70506@tudos.org> <6F4D5A685397B940825208C64CF853A7477FD77E@HALAS.anl.gov> Message-ID: <5478ABBC.4080705@tudos.org> On 19.11.2014 16:43, Isaila, Florin D. wrote: > Thanks Jan, this works indeed for me with mpich-3.1 (by using mpif77 > compiler, I think this is what you meant). However, it does not work > with mpich-3.1.3. You're right. When I try to use MPICH 3.1.3, I get the very same problem - the profiling library is not linked into the binary. Looking at the two fortran wrapper libraries, they do indeed define different symbols: | $ nm ${MPICH_3_1}/lib/libfmpich.a | grep -i 'MPI_Init_*$' | 0000000000000000 W MPI_INIT | U MPI_Init | 0000000000000000 W mpi_init | 0000000000000000 T mpi_init_ | 0000000000000000 W mpi_init__ | $ nm ${MPICH_3_1_3}/lib/libmpifort.a | grep -i 'MPI_Init_*$' | 0000000000000000 W MPI_INIT | 0000000000000000 W PMPI_INIT | U PMPI_Init | 0000000000000000 W mpi_init | 0000000000000000 W mpi_init_ | 0000000000000000 W mpi_init__ | 0000000000000000 W pmpi_init | 0000000000000000 T pmpi_init_ | 0000000000000000 W pmpi_init__ It doesn't really make sense, though. PMPI is supposed to work by redefining the 'MPI_*' functions. So a Fortran wrapper should refer to these functions as the old version did; 'MPI_Init' is the only undefined reference there. The new version, however, refers to 'PMPI_Init' and thus circumvents the profiler. Is this a bug or are we missing something here? Regards, Jan _______________________________________________ discuss mailing list discuss at mpich.org To manage subscription options or unsubscribe: https://lists.mpich.org/mailman/listinfo/discuss From victor.vysotskiy at teokem.lu.se Thu Nov 27 06:11:06 2014 From: victor.vysotskiy at teokem.lu.se (Victor Vysotskiy) Date: Thu, 27 Nov 2014 12:11:06 +0000 Subject: [mpich-discuss] Problem with MPICH3/OpenPA on IBM P755 Message-ID: <8D58A4B5E6148C419C6AD6334962375DDD9AE388@UWMBX04.uw.lu.se> Hi, I am trying to compile the latest ('v3.2a2-24-g4ad367d0') nightly snapshot on the IBM P775 machine. The MPICH3 is configured with the following options: export OBJECT_MODE=64 ./configure --enable-f77 --enable-fc --enable-cxx --enable-smpcoll --with-thread-package=pthreads --with-pami=/opt/ibmhpc/pecurrent/ppe.pami --with-pami-lib=/opt/ibmhpc/pecurrent/ppe.pami/lib --with-pami-include=/opt/ibmhpc/pecurrent/ppe.pami/include64 --enable-mpe --enable-error-messages=all CC="xlc_r -q64 -qmaxmem=-1" CPP=/usr/ccs/lib/cpp CXX="xlC_r -q64 -qmaxmem=-1" F77="xlf_r -q64 -qmaxmem=-1" FC="xlf90_r -q64 -qmaxmem=-1" CFLAGS="-q64 -qmaxmem=-1" CXXFLAGS="-q64 -qmaxmem=-1" FCFLAGS="-qmaxmem=-1 -q64" FFLAGS="-q64 -qmaxmem=-1" OBJECT_MODE=64 AR="ar -X 64" --with-file-system=bg+bglockless No problem with compiling it, but there is a problem with running OpenPA tests: Testing simple integer load-linked/store-conditional functionality -SKIP- LL/SC not available Testing simple pointer load-linked/store-conditional functionality -SKIP- LL/SC not available Testing integer LL/SC ABA -SKIP- LL/SC not available Testing pointer LL/SC ABA -SKIP- LL/SC not available Testing integer load/store with 1 thread *FAILED* at test_primitives.c:371 in test_threaded_loadstore_int()... Testing pointer load/store with 1 thread *FAILED* at test_primitives.c:556 in test_threaded_loadstore_ptr()... Testing add with 1 thread *FAILED* at test_primitives.c:750 in test_threaded_add()... Testing incr and decr with 1 thread PASSED Testing decr and test with 1 thread *FAILED* at test_primitives.c:1093 in test_threaded_decr_and_test()... Testing fetch and add with 1 thread *FAILED* at test_primitives.c:1344 in test_threaded_faa()... Testing fetch and add return values with 1 thread *FAILED* at test_primitives.c:1490 in test_threaded_faa_ret()... Testing fetch and incr/decr with 1 thread PASSED Testing fetch and incr return values with 1 thread *FAILED* at test_primitives.c:1796 in test_threaded_fai_ret()... Testing fetch and decr return values with 1 thread *FAILED* at test_primitives.c:1953 in test_threaded_fad_ret()... Testing integer compare-and-swap with 1 thread *FAILED* at test_primitives.c:2191 in test_threaded_cas_int()... Testing pointer compare-and-swap with 1 thread *FAILED* at test_primitives.c:2346 in test_threaded_cas_ptr()... Testing grouped integer compare-and-swap with 1 thread *FAILED* at test_primitives.c:2506 in test_grouped_cas_int()... Testing grouped pointer compare-and-swap with 1 thread *FAILED* at test_primitives.c:2683 in test_grouped_cas_ptr()... Testing integer compare-and-swap fairness with 1 thread *FAILED* at test_primitives.c:2890 in test_threaded_cas_int_fairness()... Testing pointer compare-and-swap fairness with 1 thread *FAILED* at test_primitives.c:3087 in test_threaded_cas_ptr_fairness()... Testing integer swap with 1 thread *FAILED* at test_primitives.c:3341 in test_threaded_swap_int()... Testing pointer swap with 1 thread *FAILED* at test_primitives.c:3496 in test_threaded_swap_ptr()... Testing integer LL/SC stack -SKIP- LL/SC not available Testing pointer LL/SC stack -SKIP- LL/SC not available Testing integer LL/SC stack -SKIP- LL/SC not available Testing pointer LL/SC stack -SKIP- LL/SC not available Testing integer LL/SC stack -SKIP- LL/SC not available Testing pointer LL/SC stack -SKIP- LL/SC not available Testing integer LL/SC stack -SKIP- LL/SC not available Testing pointer LL/SC stack -SKIP- LL/SC not available ***** 16 PRIMITIVES TESTS FAILED! ***** Apparently, a bunch of tests failed. Is there anything to worry about? Or, I can simply skip these failed test and can use the compiled MPICH3 for production? The software stack used: AIX v7.1.0.0 XLC/XLF compiler v14.01.0000.0008 POE v1-1.2.0.3 Best, Victor. _______________________________________________ discuss mailing list discuss at mpich.org To manage subscription options or unsubscribe: https://lists.mpich.org/mailman/listinfo/discuss From fereshtehkomijani at gmail.com Thu Nov 27 04:45:24 2014 From: fereshtehkomijani at gmail.com (fereshteh komijani) Date: Thu, 27 Nov 2014 14:15:24 +0330 Subject: [mpich-discuss] mpich error In-Reply-To: <7421FA5E-9CC2-4844-9522-86B42670BBC2@anl.gov> References: <7421FA5E-9CC2-4844-9522-86B42670BBC2@anl.gov> Message-ID: dar masiri ke swan_only.in dar an hast gharar ghirid va : mpirun -np 2 ./ coawstG swan_only.in mpirun -np 2 ./ coawstM swan_only.in On Wed, Nov 26, 2014 at 5:30 PM, Bland, Wesley B. wrote: > I think the error description itself it pretty accurate. MPICH can?t find > the executable in the location you specified. You need to make sure that > the executable is in the correct place. > > On Nov 26, 2014, at 1:51 AM, ?????? > wrote: > > hi > i am using mpich 3.0.4 and it is my first time running it > and i do my run like : > > mpirun -np 2 ./Projects/g/swan_only.in > > and i receive this error . will u please help me how to solve this error? > [proxy:0:0 at nazanin-VirtualBox] HYDU_create_process > (./utils/launch/launch.c:75): execvp error on file ./Projects/g/ > swan_only.in (No such file or directory) > > > =================================================================================== > = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES > = EXIT CODE: 255 > = CLEANING UP REMAINING PROCESSES > = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES > > =================================================================================== > > _______________________________________________ > discuss mailing list discuss at mpich.org > To manage subscription options or unsubscribe: > https://lists.mpich.org/mailman/listinfo/discuss > > > > _______________________________________________ > discuss mailing list discuss at mpich.org > To manage subscription options or unsubscribe: > https://lists.mpich.org/mailman/listinfo/discuss > -- ***Angel*** -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- _______________________________________________ discuss mailing list discuss at mpich.org To manage subscription options or unsubscribe: https://lists.mpich.org/mailman/listinfo/discuss From yang.shuchi at gmail.com Wed Nov 26 18:43:01 2014 From: yang.shuchi at gmail.com (Shuchi Yang) Date: Wed, 26 Nov 2014 17:43:01 -0700 Subject: [mpich-discuss] code from different version of LINUX In-Reply-To: References: Message-ID: Yes, I can get right results, but it does not always happen. It happen sometimes, sometimes it is OK. It is really weird. On Wed, Nov 26, 2014 at 2:26 PM, Junchao Zhang wrote: > Do you run your program on one node with 20 processes, and the result is > correct except the time is slower? > If yes, then it seems only one CPU of the node is used, which is weird. > > > > > --Junchao Zhang > > On Wed, Nov 26, 2014 at 1:22 PM, Shuchi Yang > wrote: > >> I am doing CFD simulation. >> With MPI, I can split the computational domain to different parts so that >> each process works on different part. In this case, we can reduce the total >> computational time. >> >> When I tried to run at another different system, it looks all the cpu are >> working on the whole computational domain. It looks like every CPU is >> working on the whole computational domain so that the computational >> efficiency is very low. >> >> >> >> On Wed, Nov 26, 2014 at 11:45 AM, Junchao Zhang >> wrote: >> >>> I guess it is not an MPI problem. When you say "every CPU works on all >>> the data", you need a clear idea of what is the data decomposition in your >>> code. >>> >>> >>> --Junchao Zhang >>> >>> On Wed, Nov 26, 2014 at 11:28 AM, Shuchi Yang >>> wrote: >>> >>>> Thanks for your reply. I am trying it in the way you mentioned. >>>> But I met one problem is that on my original machine, I can run the >>>> code with 20 CPUs so that each CPU works on part of the job. But at the new >>>> machine, it starts the process with 20 CPUS, but every CPU works on all the >>>> data, so that it looks like it is running 20 times the job at same time. Is >>>> that because of MPI problem? >>>> Thanks, >>>> >>>> Shuchi >>>> >>>> On Wed, Nov 26, 2014 at 9:50 AM, Junchao Zhang >>>> wrote: >>>> >>>>> You can copy MPI libraries installed on the Ubuntu machine to the Suse >>>>> machine, then add that path to LD_LIBRARY_PATH on the Suse. >>>>> >>>>> --Junchao Zhang >>>>> >>>>> On Wed, Nov 26, 2014 at 9:55 AM, Shuchi Yang >>>>> wrote: >>>>> >>>>>> I met some problem. >>>>>> The question is that >>>>>> I compile a fortran code at ubuntu, but I need run the code at Suse >>>>>> Linux, I was always told >>>>>> * error while loading shared libraries: libmpifort.so.12: cannot >>>>>> open shared object file: No such file or directory* >>>>>> >>>>>> Furthermore, will this be a problem, I mean, if I compile the code >>>>>> with mpich-gcc and run the code at another type of Linux? >>>>>> >>>>>> Thanks, >>>>>> >>>>>> Shuchi >>>>>> >>>>>> _______________________________________________ >>>>>> discuss mailing list discuss at mpich.org >>>>>> To manage subscription options or unsubscribe: >>>>>> https://lists.mpich.org/mailman/listinfo/discuss >>>>>> >>>>> >>>>> >>>>> _______________________________________________ >>>>> discuss mailing list discuss at mpich.org >>>>> To manage subscription options or unsubscribe: >>>>> https://lists.mpich.org/mailman/listinfo/discuss >>>>> >>>> >>>> >>>> _______________________________________________ >>>> discuss mailing list discuss at mpich.org >>>> To manage subscription options or unsubscribe: >>>> https://lists.mpich.org/mailman/listinfo/discuss >>>> >>> >>> >>> _______________________________________________ >>> discuss mailing list discuss at mpich.org >>> To manage subscription options or unsubscribe: >>> https://lists.mpich.org/mailman/listinfo/discuss >>> >> >> >> _______________________________________________ >> discuss mailing list discuss at mpich.org >> To manage subscription options or unsubscribe: >> https://lists.mpich.org/mailman/listinfo/discuss >> > > > _______________________________________________ > discuss mailing list discuss at mpich.org > To manage subscription options or unsubscribe: > https://lists.mpich.org/mailman/listinfo/discuss > -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- _______________________________________________ discuss mailing list discuss at mpich.org To manage subscription options or unsubscribe: https://lists.mpich.org/mailman/listinfo/discuss From jczhang at mcs.anl.gov Wed Nov 26 17:27:48 2014 From: jczhang at mcs.anl.gov (Junchao Zhang) Date: Wed, 26 Nov 2014 17:27:48 -0600 Subject: [mpich-discuss] having problem running MPICH on multiple nodes In-Reply-To: References: <5474EAA0.9090800@mcs.anl.gov> <0A759D1E-AF0A-4841-BFA6-88408D304C29@anl.gov> <409CBA24-04C9-471E-B855-F30AABF103DB@anl.gov> <5D5D143D-A28F-404E-A8F9-017019338A1E@anl.gov> <5475F0F2.9040709@mcs.anl.gov> Message-ID: I have no idea. You may try to trace all events as said at http://wiki.mpich.org/mpich/index.php/Debug_Event_Logging From ahassani at cis.uab.edu Wed Nov 26 16:25:05 2014 From: ahassani at cis.uab.edu (Amin Hassani) Date: Wed, 26 Nov 2014 16:25:05 -0600 Subject: [mpich-discuss] having problem running MPICH on multiple nodes In-Reply-To: <5475F0F2.9040709@mcs.anl.gov> References: <5474EAA0.9090800@mcs.anl.gov> <0A759D1E-AF0A-4841-BFA6-88408D304C29@anl.gov> <409CBA24-04C9-471E-B855-F30AABF103DB@anl.gov> <5D5D143D-A28F-404E-A8F9-017019338A1E@anl.gov> <5475F0F2.9040709@mcs.anl.gov> Message-ID: I disabled the whole firewall in those machines but, still get the same problem. connection refuse. I run the program in another set of totally different machines that we have, but still same problem. Any other thought where can be the problem? Thanks. Amin Hassani, CIS department at UAB, Birmingham, AL, USA. On Wed, Nov 26, 2014 at 9:25 AM, Kenneth Raffenetti wrote: > The connection refused makes me think a firewall is getting in the way. Is > TCP communication limited to specific ports on the cluster? If so, you can > use this envvar to enforce a range of ports in MPICH. > > MPIR_CVAR_CH3_PORT_RANGE > Description: The MPIR_CVAR_CH3_PORT_RANGE environment variable allows > you to specify the range of TCP ports to be used by the process manager and > the MPICH library. The format of this variable is :. To specify > any available port, use 0:0. > Default: {0,0} > > > On 11/25/2014 11:50 PM, Amin Hassani wrote: > >> Tried with the new configure too. same problem :( >> >> $ mpirun -hostfile hosts-hydra -np 2 test_dup >> Fatal error in MPI_Send: Unknown error class, error stack: >> MPI_Send(174)..............: MPI_Send(buf=0x7fffd90c76c8, count=1, >> MPI_INT, dest=1, tag=0, MPI_COMM_WORLD) failed >> MPID_nem_tcp_connpoll(1832): Communication error with rank 1: Connection >> refused >> >> ============================================================ >> ======================= >> = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES >> = PID 5459 RUNNING AT oakmnt-0-a >> = EXIT CODE: 1 >> = CLEANING UP REMAINING PROCESSES >> = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES >> ============================================================ >> ======================= >> [proxy:0:1 at oakmnt-0-b] HYD_pmcd_pmip_control_cmd_cb >> (../../../../src/pm/hydra/pm/pmiserv/pmip_cb.c:885): assert (!closed) >> failed >> [proxy:0:1 at oakmnt-0-b] HYDT_dmxu_poll_wait_for_event >> (../../../../src/pm/hydra/tools/demux/demux_poll.c:76): callback >> returned error status >> [proxy:0:1 at oakmnt-0-b] main >> (../../../../src/pm/hydra/pm/pmiserv/pmip.c:206): demux engine error >> waiting for event >> [mpiexec at oakmnt-0-a] HYDT_bscu_wait_for_completion >> (../../../../src/pm/hydra/tools/bootstrap/utils/bscu_wait.c:76): one of >> the processes terminated badly; aborting >> [mpiexec at oakmnt-0-a] HYDT_bsci_wait_for_completion >> (../../../../src/pm/hydra/tools/bootstrap/src/bsci_wait.c:23): launcher >> returned error waiting for completion >> [mpiexec at oakmnt-0-a] HYD_pmci_wait_for_completion >> (../../../../src/pm/hydra/pm/pmiserv/pmiserv_pmci.c:218): launcher >> returned error waiting for completion >> [mpiexec at oakmnt-0-a] main >> (../../../../src/pm/hydra/ui/mpich/mpiexec.c:344): process manager error >> waiting for completion >> >> >> Amin Hassani, >> CIS department at UAB, >> Birmingham, AL, USA. >> >> On Tue, Nov 25, 2014 at 11:44 PM, Lu, Huiwei > > wrote: >> >> So the error only happens when there is communication. >> >> It may be caused by IB as your guessed before. Could you try to >> reconfigure MPICH using "./configure --with-device=ch3:nemesis:tcp? >> and try again? >> >> ? >> Huiwei >> >> > On Nov 25, 2014, at 11:23 PM, Amin Hassani > > wrote: >> > >> > Yes it works. >> > output: >> > >> > $ mpirun -hostfile hosts-hydra -np 2 test >> > rank 1 >> > rank 0 >> > >> > >> > Amin Hassani, >> > CIS department at UAB, >> > Birmingham, AL, USA. >> > >> > On Tue, Nov 25, 2014 at 11:20 PM, Lu, Huiwei >> > wrote: >> > Could you try to run the following simple code to see if it works? >> > >> > #include >> > #include >> > int main(int argc, char** argv) >> > { >> > int rank, size; >> > MPI_Init(&argc, &argv); >> > MPI_Comm_rank(MPI_COMM_WORLD, &rank); >> > printf("rank %d\n", rank); >> > MPI_Finalize(); >> > return 0; >> > } >> > >> > ? >> > Huiwei >> > >> > > On Nov 25, 2014, at 11:11 PM, Amin Hassani >> > wrote: >> > > >> > > No, I checked. Also I always install my MPI's in >> /nethome/students/ahassani/usr/mpi. I never install them in >> /nethome/students/ahassani/usr. So MPI files will never get there. >> Even if put the /usr/mpi/bin in front of /usr/bin, it won't affect >> anything. There has never been any mpi installed in /usr/bin. >> > > >> > > Thank you. >> > > _______________________________________________ >> > > discuss mailing list discuss at mpich.org > > >> > > To manage subscription options or unsubscribe: >> > > https://lists.mpich.org/mailman/listinfo/discuss >> > >> > _______________________________________________ >> > discuss mailing list discuss at mpich.org >> > To manage subscription options or unsubscribe: >> > https://lists.mpich.org/mailman/listinfo/discuss >> > >> > _______________________________________________ >> > discuss mailing list discuss at mpich.org >> > To manage subscription options or unsubscribe: >> > https://lists.mpich.org/mailman/listinfo/discuss >> >> _______________________________________________ >> discuss mailing list discuss at mpich.org >> To manage subscription options or unsubscribe: >> https://lists.mpich.org/mailman/listinfo/discuss >> >> >> >> >> _______________________________________________ >> discuss mailing list discuss at mpich.org >> To manage subscription options or unsubscribe: >> https://lists.mpich.org/mailman/listinfo/discuss >> >> _______________________________________________ > discuss mailing list discuss at mpich.org > To manage subscription options or unsubscribe: > https://lists.mpich.org/mailman/listinfo/discuss > -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- _______________________________________________ discuss mailing list discuss at mpich.org To manage subscription options or unsubscribe: https://lists.mpich.org/mailman/listinfo/discuss From jczhang at mcs.anl.gov Wed Nov 26 15:26:34 2014 From: jczhang at mcs.anl.gov (Junchao Zhang) Date: Wed, 26 Nov 2014 15:26:34 -0600 Subject: [mpich-discuss] code from different version of LINUX In-Reply-To: References: Message-ID: Do you run your program on one node with 20 processes, and the result is correct except the time is slower? If yes, then it seems only one CPU of the node is used, which is weird. --Junchao Zhang On Wed, Nov 26, 2014 at 1:22 PM, Shuchi Yang wrote: > I am doing CFD simulation. > With MPI, I can split the computational domain to different parts so that > each process works on different part. In this case, we can reduce the total > computational time. > > When I tried to run at another different system, it looks all the cpu are > working on the whole computational domain. It looks like every CPU is > working on the whole computational domain so that the computational > efficiency is very low. > > > > On Wed, Nov 26, 2014 at 11:45 AM, Junchao Zhang > wrote: > >> I guess it is not an MPI problem. When you say "every CPU works on all >> the data", you need a clear idea of what is the data decomposition in your >> code. >> >> >> --Junchao Zhang >> >> On Wed, Nov 26, 2014 at 11:28 AM, Shuchi Yang >> wrote: >> >>> Thanks for your reply. I am trying it in the way you mentioned. >>> But I met one problem is that on my original machine, I can run the code >>> with 20 CPUs so that each CPU works on part of the job. But at the new >>> machine, it starts the process with 20 CPUS, but every CPU works on all the >>> data, so that it looks like it is running 20 times the job at same time. Is >>> that because of MPI problem? >>> Thanks, >>> >>> Shuchi >>> >>> On Wed, Nov 26, 2014 at 9:50 AM, Junchao Zhang >>> wrote: >>> >>>> You can copy MPI libraries installed on the Ubuntu machine to the Suse >>>> machine, then add that path to LD_LIBRARY_PATH on the Suse. >>>> >>>> --Junchao Zhang >>>> >>>> On Wed, Nov 26, 2014 at 9:55 AM, Shuchi Yang >>>> wrote: >>>> >>>>> I met some problem. >>>>> The question is that >>>>> I compile a fortran code at ubuntu, but I need run the code at Suse >>>>> Linux, I was always told >>>>> * error while loading shared libraries: libmpifort.so.12: cannot >>>>> open shared object file: No such file or directory* >>>>> >>>>> Furthermore, will this be a problem, I mean, if I compile the code >>>>> with mpich-gcc and run the code at another type of Linux? >>>>> >>>>> Thanks, >>>>> >>>>> Shuchi >>>>> >>>>> _______________________________________________ >>>>> discuss mailing list discuss at mpich.org >>>>> To manage subscription options or unsubscribe: >>>>> https://lists.mpich.org/mailman/listinfo/discuss >>>>> >>>> >>>> >>>> _______________________________________________ >>>> discuss mailing list discuss at mpich.org >>>> To manage subscription options or unsubscribe: >>>> https://lists.mpich.org/mailman/listinfo/discuss >>>> >>> >>> >>> _______________________________________________ >>> discuss mailing list discuss at mpich.org >>> To manage subscription options or unsubscribe: >>> https://lists.mpich.org/mailman/listinfo/discuss >>> >> >> >> _______________________________________________ >> discuss mailing list discuss at mpich.org >> To manage subscription options or unsubscribe: >> https://lists.mpich.org/mailman/listinfo/discuss >> > > > _______________________________________________ > discuss mailing list discuss at mpich.org > To manage subscription options or unsubscribe: > https://lists.mpich.org/mailman/listinfo/discuss > -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- _______________________________________________ discuss mailing list discuss at mpich.org To manage subscription options or unsubscribe: https://lists.mpich.org/mailman/listinfo/discuss From raffenet at mcs.anl.gov Wed Nov 26 13:21:27 2014 From: raffenet at mcs.anl.gov (Kenneth Raffenetti) Date: Wed, 26 Nov 2014 13:21:27 -0600 Subject: [mpich-discuss] code from different version of LINUX In-Reply-To: References: Message-ID: <54762837.8020005@mcs.anl.gov> It sounds like MPI is not being properly initialized and all of your processes are running as if they are the only rank. I would suggest making sure your are not mixing an mpiexec/mpirun from another MPI package with your MPICH library. A safer bet would be to copy over the mpiexec your built with MPICH and ensure it is the first one found in your PATH. Ken On 11/26/2014 11:28 AM, Shuchi Yang wrote: > Thanks for your reply. I am trying it in the way you mentioned. > But I met one problem is that on my original machine, I can run the code > with 20 CPUs so that each CPU works on part of the job. But at the new > machine, it starts the process with 20 CPUS, but every CPU works on all > the data, so that it looks like it is running 20 times the job at same > time. Is that because of MPI problem? > Thanks, > > Shuchi > > On Wed, Nov 26, 2014 at 9:50 AM, Junchao Zhang > wrote: > > You can copy MPI libraries installed on the Ubuntu machine to the > Suse machine, then add that path to LD_LIBRARY_PATH on the Suse. > > --Junchao Zhang > > On Wed, Nov 26, 2014 at 9:55 AM, Shuchi Yang > wrote: > > I met some problem. > The question is that > I compile a fortran code at ubuntu, but I need run the code at > Suse Linux, I was always told > / error while loading shared libraries: libmpifort.so.12: > cannot open shared object file: No such file or directory/ > > Furthermore, will this be a problem, I mean, if I compile the > code with mpich-gcc and run the code at another type of Linux? > > Thanks, > > Shuchi > > _______________________________________________ > discuss mailing list discuss at mpich.org > To manage subscription options or unsubscribe: > https://lists.mpich.org/mailman/listinfo/discuss > > > > _______________________________________________ > discuss mailing list discuss at mpich.org > To manage subscription options or unsubscribe: > https://lists.mpich.org/mailman/listinfo/discuss > > > > > _______________________________________________ > discuss mailing list discuss at mpich.org > To manage subscription options or unsubscribe: > https://lists.mpich.org/mailman/listinfo/discuss > _______________________________________________ discuss mailing list discuss at mpich.org To manage subscription options or unsubscribe: https://lists.mpich.org/mailman/listinfo/discuss From yang.shuchi at gmail.com Wed Nov 26 13:22:57 2014 From: yang.shuchi at gmail.com (Shuchi Yang) Date: Wed, 26 Nov 2014 12:22:57 -0700 Subject: [mpich-discuss] code from different version of LINUX In-Reply-To: References: Message-ID: I am doing CFD simulation. With MPI, I can split the computational domain to different parts so that each process works on different part. In this case, we can reduce the total computational time. When I tried to run at another different system, it looks all the cpu are working on the whole computational domain. It looks like every CPU is working on the whole computational domain so that the computational efficiency is very low. On Wed, Nov 26, 2014 at 11:45 AM, Junchao Zhang wrote: > I guess it is not an MPI problem. When you say "every CPU works on all > the data", you need a clear idea of what is the data decomposition in your > code. > > > --Junchao Zhang > > On Wed, Nov 26, 2014 at 11:28 AM, Shuchi Yang > wrote: > >> Thanks for your reply. I am trying it in the way you mentioned. >> But I met one problem is that on my original machine, I can run the code >> with 20 CPUs so that each CPU works on part of the job. But at the new >> machine, it starts the process with 20 CPUS, but every CPU works on all the >> data, so that it looks like it is running 20 times the job at same time. Is >> that because of MPI problem? >> Thanks, >> >> Shuchi >> >> On Wed, Nov 26, 2014 at 9:50 AM, Junchao Zhang >> wrote: >> >>> You can copy MPI libraries installed on the Ubuntu machine to the Suse >>> machine, then add that path to LD_LIBRARY_PATH on the Suse. >>> >>> --Junchao Zhang >>> >>> On Wed, Nov 26, 2014 at 9:55 AM, Shuchi Yang >>> wrote: >>> >>>> I met some problem. >>>> The question is that >>>> I compile a fortran code at ubuntu, but I need run the code at Suse >>>> Linux, I was always told >>>> * error while loading shared libraries: libmpifort.so.12: cannot >>>> open shared object file: No such file or directory* >>>> >>>> Furthermore, will this be a problem, I mean, if I compile the code with >>>> mpich-gcc and run the code at another type of Linux? >>>> >>>> Thanks, >>>> >>>> Shuchi >>>> >>>> _______________________________________________ >>>> discuss mailing list discuss at mpich.org >>>> To manage subscription options or unsubscribe: >>>> https://lists.mpich.org/mailman/listinfo/discuss >>>> >>> >>> >>> _______________________________________________ >>> discuss mailing list discuss at mpich.org >>> To manage subscription options or unsubscribe: >>> https://lists.mpich.org/mailman/listinfo/discuss >>> >> >> >> _______________________________________________ >> discuss mailing list discuss at mpich.org >> To manage subscription options or unsubscribe: >> https://lists.mpich.org/mailman/listinfo/discuss >> > > > _______________________________________________ > discuss mailing list discuss at mpich.org > To manage subscription options or unsubscribe: > https://lists.mpich.org/mailman/listinfo/discuss > -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- _______________________________________________ discuss mailing list discuss at mpich.org To manage subscription options or unsubscribe: https://lists.mpich.org/mailman/listinfo/discuss From yang.shuchi at gmail.com Wed Nov 26 11:28:43 2014 From: yang.shuchi at gmail.com (Shuchi Yang) Date: Wed, 26 Nov 2014 10:28:43 -0700 Subject: [mpich-discuss] code from different version of LINUX In-Reply-To: References: Message-ID: Thanks for your reply. I am trying it in the way you mentioned. But I met one problem is that on my original machine, I can run the code with 20 CPUs so that each CPU works on part of the job. But at the new machine, it starts the process with 20 CPUS, but every CPU works on all the data, so that it looks like it is running 20 times the job at same time. Is that because of MPI problem? Thanks, Shuchi On Wed, Nov 26, 2014 at 9:50 AM, Junchao Zhang wrote: > You can copy MPI libraries installed on the Ubuntu machine to the Suse > machine, then add that path to LD_LIBRARY_PATH on the Suse. > > --Junchao Zhang > > On Wed, Nov 26, 2014 at 9:55 AM, Shuchi Yang > wrote: > >> I met some problem. >> The question is that >> I compile a fortran code at ubuntu, but I need run the code at Suse >> Linux, I was always told >> * error while loading shared libraries: libmpifort.so.12: cannot open >> shared object file: No such file or directory* >> >> Furthermore, will this be a problem, I mean, if I compile the code with >> mpich-gcc and run the code at another type of Linux? >> >> Thanks, >> >> Shuchi >> >> _______________________________________________ >> discuss mailing list discuss at mpich.org >> To manage subscription options or unsubscribe: >> https://lists.mpich.org/mailman/listinfo/discuss >> > > > _______________________________________________ > discuss mailing list discuss at mpich.org > To manage subscription options or unsubscribe: > https://lists.mpich.org/mailman/listinfo/discuss > -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- _______________________________________________ discuss mailing list discuss at mpich.org To manage subscription options or unsubscribe: https://lists.mpich.org/mailman/listinfo/discuss From jczhang at mcs.anl.gov Wed Nov 26 12:45:51 2014 From: jczhang at mcs.anl.gov (Junchao Zhang) Date: Wed, 26 Nov 2014 12:45:51 -0600 Subject: [mpich-discuss] code from different version of LINUX In-Reply-To: References: Message-ID: I guess it is not an MPI problem. When you say "every CPU works on all the data", you need a clear idea of what is the data decomposition in your code. --Junchao Zhang On Wed, Nov 26, 2014 at 11:28 AM, Shuchi Yang wrote: > Thanks for your reply. I am trying it in the way you mentioned. > But I met one problem is that on my original machine, I can run the code > with 20 CPUs so that each CPU works on part of the job. But at the new > machine, it starts the process with 20 CPUS, but every CPU works on all the > data, so that it looks like it is running 20 times the job at same time. Is > that because of MPI problem? > Thanks, > > Shuchi > > On Wed, Nov 26, 2014 at 9:50 AM, Junchao Zhang > wrote: > >> You can copy MPI libraries installed on the Ubuntu machine to the Suse >> machine, then add that path to LD_LIBRARY_PATH on the Suse. >> >> --Junchao Zhang >> >> On Wed, Nov 26, 2014 at 9:55 AM, Shuchi Yang >> wrote: >> >>> I met some problem. >>> The question is that >>> I compile a fortran code at ubuntu, but I need run the code at Suse >>> Linux, I was always told >>> * error while loading shared libraries: libmpifort.so.12: cannot open >>> shared object file: No such file or directory* >>> >>> Furthermore, will this be a problem, I mean, if I compile the code with >>> mpich-gcc and run the code at another type of Linux? >>> >>> Thanks, >>> >>> Shuchi >>> >>> _______________________________________________ >>> discuss mailing list discuss at mpich.org >>> To manage subscription options or unsubscribe: >>> https://lists.mpich.org/mailman/listinfo/discuss >>> >> >> >> _______________________________________________ >> discuss mailing list discuss at mpich.org >> To manage subscription options or unsubscribe: >> https://lists.mpich.org/mailman/listinfo/discuss >> > > > _______________________________________________ > discuss mailing list discuss at mpich.org > To manage subscription options or unsubscribe: > https://lists.mpich.org/mailman/listinfo/discuss > -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- _______________________________________________ discuss mailing list discuss at mpich.org To manage subscription options or unsubscribe: https://lists.mpich.org/mailman/listinfo/discuss From yang.shuchi at gmail.com Wed Nov 26 09:55:52 2014 From: yang.shuchi at gmail.com (Shuchi Yang) Date: Wed, 26 Nov 2014 08:55:52 -0700 Subject: [mpich-discuss] code from different version of LINUX Message-ID: I met some problem. The question is that I compile a fortran code at ubuntu, but I need run the code at Suse Linux, I was always told * error while loading shared libraries: libmpifort.so.12: cannot open shared object file: No such file or directory* Furthermore, will this be a problem, I mean, if I compile the code with mpich-gcc and run the code at another type of Linux? Thanks, Shuchi -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- _______________________________________________ discuss mailing list discuss at mpich.org To manage subscription options or unsubscribe: https://lists.mpich.org/mailman/listinfo/discuss From jczhang at mcs.anl.gov Wed Nov 26 10:50:15 2014 From: jczhang at mcs.anl.gov (Junchao Zhang) Date: Wed, 26 Nov 2014 10:50:15 -0600 Subject: [mpich-discuss] code from different version of LINUX In-Reply-To: References: Message-ID: You can copy MPI libraries installed on the Ubuntu machine to the Suse machine, then add that path to LD_LIBRARY_PATH on the Suse. --Junchao Zhang On Wed, Nov 26, 2014 at 9:55 AM, Shuchi Yang wrote: > I met some problem. > The question is that > I compile a fortran code at ubuntu, but I need run the code at Suse Linux, > I was always told > * error while loading shared libraries: libmpifort.so.12: cannot open > shared object file: No such file or directory* > > Furthermore, will this be a problem, I mean, if I compile the code with > mpich-gcc and run the code at another type of Linux? > > Thanks, > > Shuchi > > _______________________________________________ > discuss mailing list discuss at mpich.org > To manage subscription options or unsubscribe: > https://lists.mpich.org/mailman/listinfo/discuss > -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- _______________________________________________ discuss mailing list discuss at mpich.org To manage subscription options or unsubscribe: https://lists.mpich.org/mailman/listinfo/discuss From raffenet at mcs.anl.gov Wed Nov 26 09:25:38 2014 From: raffenet at mcs.anl.gov (Kenneth Raffenetti) Date: Wed, 26 Nov 2014 09:25:38 -0600 Subject: [mpich-discuss] having problem running MPICH on multiple nodes In-Reply-To: References: <5474EAA0.9090800@mcs.anl.gov> <0A759D1E-AF0A-4841-BFA6-88408D304C29@anl.gov> <409CBA24-04C9-471E-B855-F30AABF103DB@anl.gov> <5D5D143D-A28F-404E-A8F9-017019338A1E@anl.gov> Message-ID: <5475F0F2.9040709@mcs.anl.gov> The connection refused makes me think a firewall is getting in the way. Is TCP communication limited to specific ports on the cluster? If so, you can use this envvar to enforce a range of ports in MPICH. MPIR_CVAR_CH3_PORT_RANGE Description: The MPIR_CVAR_CH3_PORT_RANGE environment variable allows you to specify the range of TCP ports to be used by the process manager and the MPICH library. The format of this variable is :. To specify any available port, use 0:0. Default: {0,0} On 11/25/2014 11:50 PM, Amin Hassani wrote: > Tried with the new configure too. same problem :( > > $ mpirun -hostfile hosts-hydra -np 2 test_dup > Fatal error in MPI_Send: Unknown error class, error stack: > MPI_Send(174)..............: MPI_Send(buf=0x7fffd90c76c8, count=1, > MPI_INT, dest=1, tag=0, MPI_COMM_WORLD) failed > MPID_nem_tcp_connpoll(1832): Communication error with rank 1: Connection > refused > > =================================================================================== > = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES > = PID 5459 RUNNING AT oakmnt-0-a > = EXIT CODE: 1 > = CLEANING UP REMAINING PROCESSES > = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES > =================================================================================== > [proxy:0:1 at oakmnt-0-b] HYD_pmcd_pmip_control_cmd_cb > (../../../../src/pm/hydra/pm/pmiserv/pmip_cb.c:885): assert (!closed) failed > [proxy:0:1 at oakmnt-0-b] HYDT_dmxu_poll_wait_for_event > (../../../../src/pm/hydra/tools/demux/demux_poll.c:76): callback > returned error status > [proxy:0:1 at oakmnt-0-b] main > (../../../../src/pm/hydra/pm/pmiserv/pmip.c:206): demux engine error > waiting for event > [mpiexec at oakmnt-0-a] HYDT_bscu_wait_for_completion > (../../../../src/pm/hydra/tools/bootstrap/utils/bscu_wait.c:76): one of > the processes terminated badly; aborting > [mpiexec at oakmnt-0-a] HYDT_bsci_wait_for_completion > (../../../../src/pm/hydra/tools/bootstrap/src/bsci_wait.c:23): launcher > returned error waiting for completion > [mpiexec at oakmnt-0-a] HYD_pmci_wait_for_completion > (../../../../src/pm/hydra/pm/pmiserv/pmiserv_pmci.c:218): launcher > returned error waiting for completion > [mpiexec at oakmnt-0-a] main > (../../../../src/pm/hydra/ui/mpich/mpiexec.c:344): process manager error > waiting for completion > > > Amin Hassani, > CIS department at UAB, > Birmingham, AL, USA. > > On Tue, Nov 25, 2014 at 11:44 PM, Lu, Huiwei > wrote: > > So the error only happens when there is communication. > > It may be caused by IB as your guessed before. Could you try to > reconfigure MPICH using "./configure --with-device=ch3:nemesis:tcp? > and try again? > > ? > Huiwei > > > On Nov 25, 2014, at 11:23 PM, Amin Hassani > wrote: > > > > Yes it works. > > output: > > > > $ mpirun -hostfile hosts-hydra -np 2 test > > rank 1 > > rank 0 > > > > > > Amin Hassani, > > CIS department at UAB, > > Birmingham, AL, USA. > > > > On Tue, Nov 25, 2014 at 11:20 PM, Lu, Huiwei > > wrote: > > Could you try to run the following simple code to see if it works? > > > > #include > > #include > > int main(int argc, char** argv) > > { > > int rank, size; > > MPI_Init(&argc, &argv); > > MPI_Comm_rank(MPI_COMM_WORLD, &rank); > > printf("rank %d\n", rank); > > MPI_Finalize(); > > return 0; > > } > > > > ? > > Huiwei > > > > > On Nov 25, 2014, at 11:11 PM, Amin Hassani > > wrote: > > > > > > No, I checked. Also I always install my MPI's in > /nethome/students/ahassani/usr/mpi. I never install them in > /nethome/students/ahassani/usr. So MPI files will never get there. > Even if put the /usr/mpi/bin in front of /usr/bin, it won't affect > anything. There has never been any mpi installed in /usr/bin. > > > > > > Thank you. > > > _______________________________________________ > > > discuss mailing list discuss at mpich.org > > > To manage subscription options or unsubscribe: > > > https://lists.mpich.org/mailman/listinfo/discuss > > > > _______________________________________________ > > discuss mailing list discuss at mpich.org > > To manage subscription options or unsubscribe: > > https://lists.mpich.org/mailman/listinfo/discuss > > > > _______________________________________________ > > discuss mailing list discuss at mpich.org > > To manage subscription options or unsubscribe: > > https://lists.mpich.org/mailman/listinfo/discuss > > _______________________________________________ > discuss mailing list discuss at mpich.org > To manage subscription options or unsubscribe: > https://lists.mpich.org/mailman/listinfo/discuss > > > > > _______________________________________________ > discuss mailing list discuss at mpich.org > To manage subscription options or unsubscribe: > https://lists.mpich.org/mailman/listinfo/discuss > _______________________________________________ discuss mailing list discuss at mpich.org To manage subscription options or unsubscribe: https://lists.mpich.org/mailman/listinfo/discuss From nazanin.mirshokraei at gmail.com Wed Nov 26 00:51:57 2014 From: nazanin.mirshokraei at gmail.com (=?UTF-8?B?2YbYp9iy2YbbjNmG?=) Date: Tue, 25 Nov 2014 22:51:57 -0800 Subject: [mpich-discuss] mpich error Message-ID: hi i am using mpich 3.0.4 and it is my first time running it and i do my run like : mpirun -np 2 ./Projects/g/swan_only.in and i receive this error . will u please help me how to solve this error? [proxy:0:0 at nazanin-VirtualBox] HYDU_create_process (./utils/launch/launch.c:75): execvp error on file ./Projects/g/swan_only.in (No such file or directory) =================================================================================== = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES = EXIT CODE: 255 = CLEANING UP REMAINING PROCESSES = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES =================================================================================== -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- _______________________________________________ discuss mailing list discuss at mpich.org To manage subscription options or unsubscribe: https://lists.mpich.org/mailman/listinfo/discuss From wbland at anl.gov Wed Nov 26 08:00:32 2014 From: wbland at anl.gov (Bland, Wesley B.) Date: Wed, 26 Nov 2014 14:00:32 +0000 Subject: [mpich-discuss] mpich error In-Reply-To: References: Message-ID: <7421FA5E-9CC2-4844-9522-86B42670BBC2@anl.gov> I think the error description itself it pretty accurate. MPICH can?t find the executable in the location you specified. You need to make sure that the executable is in the correct place. On Nov 26, 2014, at 1:51 AM, ?????? > wrote: hi i am using mpich 3.0.4 and it is my first time running it and i do my run like : mpirun -np 2 ./Projects/g/swan_only.in and i receive this error . will u please help me how to solve this error? [proxy:0:0 at nazanin-VirtualBox] HYDU_create_process (./utils/launch/launch.c:75): execvp error on file ./Projects/g/swan_only.in (No such file or directory) =================================================================================== = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES = EXIT CODE: 255 = CLEANING UP REMAINING PROCESSES = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES =================================================================================== _______________________________________________ discuss mailing list discuss at mpich.org To manage subscription options or unsubscribe: https://lists.mpich.org/mailman/listinfo/discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- _______________________________________________ discuss mailing list discuss at mpich.org To manage subscription options or unsubscribe: https://lists.mpich.org/mailman/listinfo/discuss From ahassani at cis.uab.edu Tue Nov 25 23:50:22 2014 From: ahassani at cis.uab.edu (Amin Hassani) Date: Tue, 25 Nov 2014 23:50:22 -0600 Subject: [mpich-discuss] having problem running MPICH on multiple nodes In-Reply-To: <5D5D143D-A28F-404E-A8F9-017019338A1E@anl.gov> References: <5474EAA0.9090800@mcs.anl.gov> <0A759D1E-AF0A-4841-BFA6-88408D304C29@anl.gov> <409CBA24-04C9-471E-B855-F30AABF103DB@anl.gov> <5D5D143D-A28F-404E-A8F9-017019338A1E@anl.gov> Message-ID: Tried with the new configure too. same problem :( $ mpirun -hostfile hosts-hydra -np 2 test_dup Fatal error in MPI_Send: Unknown error class, error stack: MPI_Send(174)..............: MPI_Send(buf=0x7fffd90c76c8, count=1, MPI_INT, dest=1, tag=0, MPI_COMM_WORLD) failed MPID_nem_tcp_connpoll(1832): Communication error with rank 1: Connection refused =================================================================================== = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES = PID 5459 RUNNING AT oakmnt-0-a = EXIT CODE: 1 = CLEANING UP REMAINING PROCESSES = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES =================================================================================== [proxy:0:1 at oakmnt-0-b] HYD_pmcd_pmip_control_cmd_cb (../../../../src/pm/hydra/pm/pmiserv/pmip_cb.c:885): assert (!closed) failed [proxy:0:1 at oakmnt-0-b] HYDT_dmxu_poll_wait_for_event (../../../../src/pm/hydra/tools/demux/demux_poll.c:76): callback returned error status [proxy:0:1 at oakmnt-0-b] main (../../../../src/pm/hydra/pm/pmiserv/pmip.c:206): demux engine error waiting for event [mpiexec at oakmnt-0-a] HYDT_bscu_wait_for_completion (../../../../src/pm/hydra/tools/bootstrap/utils/bscu_wait.c:76): one of the processes terminated badly; aborting [mpiexec at oakmnt-0-a] HYDT_bsci_wait_for_completion (../../../../src/pm/hydra/tools/bootstrap/src/bsci_wait.c:23): launcher returned error waiting for completion [mpiexec at oakmnt-0-a] HYD_pmci_wait_for_completion (../../../../src/pm/hydra/pm/pmiserv/pmiserv_pmci.c:218): launcher returned error waiting for completion [mpiexec at oakmnt-0-a] main (../../../../src/pm/hydra/ui/mpich/mpiexec.c:344): process manager error waiting for completion Amin Hassani, CIS department at UAB, Birmingham, AL, USA. On Tue, Nov 25, 2014 at 11:44 PM, Lu, Huiwei wrote: > So the error only happens when there is communication. > > It may be caused by IB as your guessed before. Could you try to > reconfigure MPICH using "./configure --with-device=ch3:nemesis:tcp? and try > again? > > ? > Huiwei > > > On Nov 25, 2014, at 11:23 PM, Amin Hassani wrote: > > > > Yes it works. > > output: > > > > $ mpirun -hostfile hosts-hydra -np 2 test > > rank 1 > > rank 0 > > > > > > Amin Hassani, > > CIS department at UAB, > > Birmingham, AL, USA. > > > > On Tue, Nov 25, 2014 at 11:20 PM, Lu, Huiwei > wrote: > > Could you try to run the following simple code to see if it works? > > > > #include > > #include > > int main(int argc, char** argv) > > { > > int rank, size; > > MPI_Init(&argc, &argv); > > MPI_Comm_rank(MPI_COMM_WORLD, &rank); > > printf("rank %d\n", rank); > > MPI_Finalize(); > > return 0; > > } > > > > ? > > Huiwei > > > > > On Nov 25, 2014, at 11:11 PM, Amin Hassani > wrote: > > > > > > No, I checked. Also I always install my MPI's in > /nethome/students/ahassani/usr/mpi. I never install them in > /nethome/students/ahassani/usr. So MPI files will never get there. Even if > put the /usr/mpi/bin in front of /usr/bin, it won't affect anything. There > has never been any mpi installed in /usr/bin. > > > > > > Thank you. > > > _______________________________________________ > > > discuss mailing list discuss at mpich.org > > > To manage subscription options or unsubscribe: > > > https://lists.mpich.org/mailman/listinfo/discuss > > > > _______________________________________________ > > discuss mailing list discuss at mpich.org > > To manage subscription options or unsubscribe: > > https://lists.mpich.org/mailman/listinfo/discuss > > > > _______________________________________________ > > discuss mailing list discuss at mpich.org > > To manage subscription options or unsubscribe: > > https://lists.mpich.org/mailman/listinfo/discuss > > _______________________________________________ > discuss mailing list discuss at mpich.org > To manage subscription options or unsubscribe: > https://lists.mpich.org/mailman/listinfo/discuss > -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- _______________________________________________ discuss mailing list discuss at mpich.org To manage subscription options or unsubscribe: https://lists.mpich.org/mailman/listinfo/discuss From huiweilu at mcs.anl.gov Tue Nov 25 23:44:10 2014 From: huiweilu at mcs.anl.gov (Lu, Huiwei) Date: Wed, 26 Nov 2014 05:44:10 +0000 Subject: [mpich-discuss] having problem running MPICH on multiple nodes In-Reply-To: References: <5474EAA0.9090800@mcs.anl.gov> <0A759D1E-AF0A-4841-BFA6-88408D304C29@anl.gov> <409CBA24-04C9-471E-B855-F30AABF103DB@anl.gov> Message-ID: <5D5D143D-A28F-404E-A8F9-017019338A1E@anl.gov> So the error only happens when there is communication. It may be caused by IB as your guessed before. Could you try to reconfigure MPICH using "./configure --with-device=ch3:nemesis:tcp? and try again? ? Huiwei > On Nov 25, 2014, at 11:23 PM, Amin Hassani wrote: > > Yes it works. > output: > > $ mpirun -hostfile hosts-hydra -np 2 test > rank 1 > rank 0 > > > Amin Hassani, > CIS department at UAB, > Birmingham, AL, USA. > > On Tue, Nov 25, 2014 at 11:20 PM, Lu, Huiwei wrote: > Could you try to run the following simple code to see if it works? > > #include > #include > int main(int argc, char** argv) > { > int rank, size; > MPI_Init(&argc, &argv); > MPI_Comm_rank(MPI_COMM_WORLD, &rank); > printf("rank %d\n", rank); > MPI_Finalize(); > return 0; > } > > ? > Huiwei > > > On Nov 25, 2014, at 11:11 PM, Amin Hassani wrote: > > > > No, I checked. Also I always install my MPI's in /nethome/students/ahassani/usr/mpi. I never install them in /nethome/students/ahassani/usr. So MPI files will never get there. Even if put the /usr/mpi/bin in front of /usr/bin, it won't affect anything. There has never been any mpi installed in /usr/bin. > > > > Thank you. > > _______________________________________________ > > discuss mailing list discuss at mpich.org > > To manage subscription options or unsubscribe: > > https://lists.mpich.org/mailman/listinfo/discuss > > _______________________________________________ > discuss mailing list discuss at mpich.org > To manage subscription options or unsubscribe: > https://lists.mpich.org/mailman/listinfo/discuss > > _______________________________________________ > discuss mailing list discuss at mpich.org > To manage subscription options or unsubscribe: > https://lists.mpich.org/mailman/listinfo/discuss _______________________________________________ discuss mailing list discuss at mpich.org To manage subscription options or unsubscribe: https://lists.mpich.org/mailman/listinfo/discuss From huiweilu at mcs.anl.gov Tue Nov 25 23:20:27 2014 From: huiweilu at mcs.anl.gov (Lu, Huiwei) Date: Wed, 26 Nov 2014 05:20:27 +0000 Subject: [mpich-discuss] having problem running MPICH on multiple nodes In-Reply-To: References: <5474EAA0.9090800@mcs.anl.gov> <0A759D1E-AF0A-4841-BFA6-88408D304C29@anl.gov> Message-ID: <409CBA24-04C9-471E-B855-F30AABF103DB@anl.gov> Could you try to run the following simple code to see if it works? #include #include int main(int argc, char** argv) { int rank, size; MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &rank); printf("rank %d\n", rank); MPI_Finalize(); return 0; } ? Huiwei > On Nov 25, 2014, at 11:11 PM, Amin Hassani wrote: > > No, I checked. Also I always install my MPI's in /nethome/students/ahassani/usr/mpi. I never install them in /nethome/students/ahassani/usr. So MPI files will never get there. Even if put the /usr/mpi/bin in front of /usr/bin, it won't affect anything. There has never been any mpi installed in /usr/bin. > > Thank you. > _______________________________________________ > discuss mailing list discuss at mpich.org > To manage subscription options or unsubscribe: > https://lists.mpich.org/mailman/listinfo/discuss _______________________________________________ discuss mailing list discuss at mpich.org To manage subscription options or unsubscribe: https://lists.mpich.org/mailman/listinfo/discuss From ahassani at cis.uab.edu Tue Nov 25 23:23:13 2014 From: ahassani at cis.uab.edu (Amin Hassani) Date: Tue, 25 Nov 2014 23:23:13 -0600 Subject: [mpich-discuss] having problem running MPICH on multiple nodes In-Reply-To: <409CBA24-04C9-471E-B855-F30AABF103DB@anl.gov> References: <5474EAA0.9090800@mcs.anl.gov> <0A759D1E-AF0A-4841-BFA6-88408D304C29@anl.gov> <409CBA24-04C9-471E-B855-F30AABF103DB@anl.gov> Message-ID: Yes it works. output: $ mpirun -hostfile hosts-hydra -np 2 test rank 1 rank 0 Amin Hassani, CIS department at UAB, Birmingham, AL, USA. On Tue, Nov 25, 2014 at 11:20 PM, Lu, Huiwei wrote: > Could you try to run the following simple code to see if it works? > > #include > #include > int main(int argc, char** argv) > { > int rank, size; > MPI_Init(&argc, &argv); > MPI_Comm_rank(MPI_COMM_WORLD, &rank); > printf("rank %d\n", rank); > MPI_Finalize(); > return 0; > } > > ? > Huiwei > > > On Nov 25, 2014, at 11:11 PM, Amin Hassani wrote: > > > > No, I checked. Also I always install my MPI's in > /nethome/students/ahassani/usr/mpi. I never install them in > /nethome/students/ahassani/usr. So MPI files will never get there. Even if > put the /usr/mpi/bin in front of /usr/bin, it won't affect anything. There > has never been any mpi installed in /usr/bin. > > > > Thank you. > > _______________________________________________ > > discuss mailing list discuss at mpich.org > > To manage subscription options or unsubscribe: > > https://lists.mpich.org/mailman/listinfo/discuss > > _______________________________________________ > discuss mailing list discuss at mpich.org > To manage subscription options or unsubscribe: > https://lists.mpich.org/mailman/listinfo/discuss > -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- _______________________________________________ discuss mailing list discuss at mpich.org To manage subscription options or unsubscribe: https://lists.mpich.org/mailman/listinfo/discuss From ahassani at cis.uab.edu Tue Nov 25 23:11:10 2014 From: ahassani at cis.uab.edu (Amin Hassani) Date: Tue, 25 Nov 2014 23:11:10 -0600 Subject: [mpich-discuss] having problem running MPICH on multiple nodes In-Reply-To: References: <5474EAA0.9090800@mcs.anl.gov> <0A759D1E-AF0A-4841-BFA6-88408D304C29@anl.gov> Message-ID: No, I checked. Also I always install my MPI's in /nethome/students/ahassani/usr/mpi. I never install them in /nethome/students/ahassani/usr. So MPI files will never get there. Even if put the /usr/mpi/bin in front of /usr/bin, it won't affect anything. There has never been any mpi installed in /usr/bin. Thank you. -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- _______________________________________________ discuss mailing list discuss at mpich.org To manage subscription options or unsubscribe: https://lists.mpich.org/mailman/listinfo/discuss From huiweilu at mcs.anl.gov Tue Nov 25 23:08:05 2014 From: huiweilu at mcs.anl.gov (Lu, Huiwei) Date: Wed, 26 Nov 2014 05:08:05 +0000 Subject: [mpich-discuss] having problem running MPICH on multiple nodes In-Reply-To: References: <5474EAA0.9090800@mcs.anl.gov> <0A759D1E-AF0A-4841-BFA6-88408D304C29@anl.gov> Message-ID: <3787C969-C5DF-4881-AD0C-FF8C9F9D910C@anl.gov> You may try to put /nethome/students/ahassani/usr/mpi/lib and /nethome/students/ahassani/usr/mpi/bin to the very front of LD_LIBRARY_PATH and PATH. ? Huiwei > On Nov 25, 2014, at 11:06 PM, Lu, Huiwei wrote: > > Is there a chance that some old mpi libraries sits in /nethome/students/ahassani/usr/lib? > Or some old mpirun sits in /nethome/students/ahassani/usr/bin? > > ? > Huiwei > >> On Nov 25, 2014, at 10:58 PM, Amin Hassani wrote: >> >> >> Here you go! >> >> host machine: >> ~{ahassani at vulcan13:~/usr/bin}~{Tue Nov 25 10:56 PM}~ >> $ echo $LD_LIBRARY_PATH >> /nethome/students/ahassani/usr/lib:/nethome/students/ahassani/usr/mpi/lib: >> ~{ahassani at vulcan13:~/usr/bin}~{Tue Nov 25 10:56 PM}~ >> $ echo $PATH >> /nethome/students/ahassani/usr/bin:/nethome/students/ahassani/usr/mpi/bin:/usr/local/bin:/usr/bin:/bin:/usr/local/games:/usr/games:/sbin:/usr/sbin:/usr/local/sbin:/opt/matlab-R2013a/bin >> >> oakmnt-0-a: >> $ echo $LD_LIBRARY_PATH >> /nethome/students/ahassani/usr/lib:/nethome/students/ahassani/usr/mpi/lib: >> ~{ahassani at oakmnt-0-a:~/usr/bin}~{Tue Nov 25 10:56 PM}~ >> $ echo $PATH >> /nethome/students/ahassani/usr/bin:/nethome/students/ahassani/usr/mpi/bin:/usr/local/bin:/usr/bin:/bin:/usr/local/games:/usr/games:/sbin:/usr/sbin:/usr/local/sbin >> >> oakmnt-0-b: >> ~{ahassani at oakmnt-0-b:~}~{Tue Nov 25 10:56 PM}~ >> $ echo $LD_LIBRARY_PATH >> /nethome/students/ahassani/usr/lib:/nethome/students/ahassani/usr/mpi/lib: >> ~{ahassani at oakmnt-0-b:~}~{Tue Nov 25 10:56 PM}~ >> $ echo $PATH >> /nethome/students/ahassani/usr/bin:/nethome/students/ahassani/usr/mpi/bin:/usr/local/bin:/usr/bin:/bin:/usr/local/games:/usr/games:/sbin:/usr/sbin:/usr/local/sbin >> >> >> Amin Hassani, >> CIS department at UAB, >> Birmingham, AL, USA. >> >> On Tue, Nov 25, 2014 at 10:55 PM, Lu, Huiwei wrote: >> So your ssh connection is correct. And we confirmed the code itself is correct before. The problem may be somewhere else. >> >> Could you check the PATH and LD_LIBRARY_PATH on these three machines (oakmnt-0-a, oakmnt-0-b, and the host machine) to make sure they are the same? So that mpirun is using the same library on these machines. >> >> ? >> Huiwei >> >>> On Nov 25, 2014, at 10:33 PM, Amin Hassani wrote: >>> >>> Here you go! >>> >>> $ mpirun -hostfile hosts-hydra -np 2 hostname >>> oakmnt-0-a >>> oakmnt-0-b >>> >>> Thanks. >>> >>> Amin Hassani, >>> CIS department at UAB, >>> Birmingham, AL, USA. >>> >>> On Tue, Nov 25, 2014 at 10:31 PM, Lu, Huiwei wrote: >>> I can run your simplest code on my machine without a problem. So I guess there is some problem in cluster connection. Could you give me the output of the following? >>> >>> $ mpirun -hostfile hosts-hydra -np 2 hostname >>> >>> ? >>> Huiwei >>> >>>> On Nov 25, 2014, at 10:24 PM, Amin Hassani wrote: >>>> >>>> Hi, >>>> >>>> the code that I gave you had more stuff in it that I didn't want to distract you. here is the simpler send/recv test that I just ran and it failed. >>>> >>>> which mpirun: specific directory that I install my MPIs >>>> /nethome/students/ahassani/usr/mpi/bin/mpirun >>>> >>>> mpirun with no argument: >>>> $ mpirun >>>> [mpiexec at oakmnt-0-a] set_default_values (../../../../src/pm/hydra/ui/mpich/utils.c:1528): no executable provided >>>> [mpiexec at oakmnt-0-a] HYD_uii_mpx_get_parameters (../../../../src/pm/hydra/ui/mpich/utils.c:1739): setting default values failed >>>> [mpiexec at oakmnt-0-a] main (../../../../src/pm/hydra/ui/mpich/mpiexec.c:153): error parsing parameters >>>> >>>> >>>> >>>> #include >>>> #include >>>> #include >>>> #include >>>> #include >>>> >>>> int skip = 10; >>>> int iter = 30; >>>> >>>> int main(int argc, char** argv) >>>> { >>>> int rank, size; >>>> int i, j, k; >>>> double t1, t2; >>>> int rc; >>>> >>>> MPI_Init(&argc, &argv); >>>> MPI_Comm world = MPI_COMM_WORLD, newworld, newworld2; >>>> MPI_Comm_rank(world, &rank); >>>> MPI_Comm_size(world, &size); >>>> int a = 0, b = 1; >>>> if(rank == 0){ >>>> MPI_Send(&a, 1, MPI_INT, 1, 0, world); >>>> }else{ >>>> MPI_Recv(&b, 1, MPI_INT, 0, 0, world, MPI_STATUS_IGNORE); >>>> } >>>> >>>> printf("b is %d\n", b); >>>> MPI_Finalize(); >>>> >>>> return 0; >>>> } >>>> >>>> Thank you. >>>> >>>> >>>> Amin Hassani, >>>> CIS department at UAB, >>>> Birmingham, AL, USA. >>>> >>>> On Tue, Nov 25, 2014 at 10:20 PM, Lu, Huiwei wrote: >>>> Hi, Amin, >>>> >>>> Could you quickly give us the output of the following command: "which mpirun" >>>> >>>> Also, your simplest code couldn?t compile correctly: "error: ?t_avg? undeclared (first use in this function)?. Can you fix it? >>>> >>>> ? >>>> Huiwei >>>> >>>>> On Nov 25, 2014, at 2:58 PM, Amin Hassani wrote: >>>>> >>>>> This is the simplest code I have that doesn't run. >>>>> >>>>> >>>>> #include >>>>> #include >>>>> #include >>>>> #include >>>>> #include >>>>> >>>>> int main(int argc, char** argv) >>>>> { >>>>> int rank, size; >>>>> int i, j, k; >>>>> double t1, t2; >>>>> int rc; >>>>> >>>>> MPI_Init(&argc, &argv); >>>>> MPI_Comm world = MPI_COMM_WORLD, newworld, newworld2; >>>>> MPI_Comm_rank(world, &rank); >>>>> MPI_Comm_size(world, &size); >>>>> >>>>> t2 = 1; >>>>> MPI_Allreduce(&t2, &t_avg, 1, MPI_DOUBLE, MPI_SUM, world); >>>>> t_avg = t_avg / size; >>>>> >>>>> MPI_Finalize(); >>>>> >>>>> return 0; >>>>> }? >>>>> >>>>> Amin Hassani, >>>>> CIS department at UAB, >>>>> Birmingham, AL, USA. >>>>> >>>>> On Tue, Nov 25, 2014 at 2:46 PM, "Antonio J. Pe?a" wrote: >>>>> >>>>> Hi Amin, >>>>> >>>>> Can you share with us a minimal piece of code with which you can reproduce this issue? >>>>> >>>>> Thanks, >>>>> Antonio >>>>> >>>>> >>>>> >>>>> On 11/25/2014 12:52 PM, Amin Hassani wrote: >>>>>> Hi, >>>>>> >>>>>> I am having problem running MPICH, on multiple nodes. When I run an multiple MPI processes on one node, it totally works, but when I try to run on multiple nodes, it fails with the error below. >>>>>> My machines have Debian OS, Both infiniband and TCP interconnects. I'm guessing it has something do to with the TCP network, but I can run openmpi on these machines with no problem. But for some reason I cannot run MPICH on multiple nodes. Please let me know if more info is needed from my side. I'm guessing there are some configuration that I am missing. I used MPICH 3.1.3 for this test. I googled this problem but couldn't find any solution. >>>>>> >>>>>> ?In my MPI program, I am doing a simple allreduce over MPI_COMM_WORLD?. >>>>>> >>>>>> ?my host file (hosts-hydra) is something like this: >>>>>> oakmnt-0-a:1 >>>>>> oakmnt-0-b:1 ? >>>>>> >>>>>> ?I get this error:? >>>>>> >>>>>> $ mpirun -hostfile hosts-hydra -np 2 test_dup >>>>>> Assertion failed in file ../src/mpi/coll/helper_fns.c at line 490: status->MPI_TAG == recvtag >>>>>> Assertion failed in file ../src/mpi/coll/helper_fns.c at line 490: status->MPI_TAG == recvtag >>>>>> internal ABORT - process 1 >>>>>> internal ABORT - process 0 >>>>>> >>>>>> =================================================================================== >>>>>> = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES >>>>>> = PID 30744 RUNNING AT oakmnt-0-b >>>>>> = EXIT CODE: 1 >>>>>> = CLEANING UP REMAINING PROCESSES >>>>>> = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES >>>>>> =================================================================================== >>>>>> [mpiexec at vulcan13] HYDU_sock_read (../../../../src/pm/hydra/utils/sock/sock.c:239): read error (Bad file descriptor) >>>>>> [mpiexec at vulcan13] control_cb (../../../../src/pm/hydra/pm/pmiserv/pmiserv_cb.c:199): unable to read command from proxy >>>>>> [mpiexec at vulcan13] HYDT_dmxu_poll_wait_for_event (../../../../src/pm/hydra/tools/demux/demux_poll.c:76): callback returned error status >>>>>> [mpiexec at vulcan13] HYD_pmci_wait_for_completion (../../../../src/pm/hydra/pm/pmiserv/pmiserv_pmci.c:198): error waiting for event >>>>>> [mpiexec at vulcan13] main (../../../../src/pm/hydra/ui/mpich/mpiexec.c:344): process manager error waiting for completion >>>>>> >>>>>> Thanks. >>>>>> Amin Hassani, >>>>>> CIS department at UAB, >>>>>> Birmingham, AL, USA. >>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> discuss mailing list >>>>>> discuss at mpich.org >>>>>> >>>>>> To manage subscription options or unsubscribe: >>>>>> >>>>>> https://lists.mpich.org/mailman/listinfo/discuss >>>>> >>>>> >>>>> -- >>>>> Antonio J. Pe?a >>>>> Postdoctoral Appointee >>>>> Mathematics and Computer Science Division >>>>> Argonne National Laboratory >>>>> 9700 South Cass Avenue, Bldg. 240, Of. 3148 >>>>> Argonne, IL 60439-4847 >>>>> >>>>> apenya at mcs.anl.gov >>>>> www.mcs.anl.gov/~apenya >>>>> >>>>> _______________________________________________ >>>>> discuss mailing list discuss at mpich.org >>>>> To manage subscription options or unsubscribe: >>>>> https://lists.mpich.org/mailman/listinfo/discuss >>>>> >>>>> _______________________________________________ >>>>> discuss mailing list discuss at mpich.org >>>>> To manage subscription options or unsubscribe: >>>>> https://lists.mpich.org/mailman/listinfo/discuss >>>> >>>> _______________________________________________ >>>> discuss mailing list discuss at mpich.org >>>> To manage subscription options or unsubscribe: >>>> https://lists.mpich.org/mailman/listinfo/discuss >>>> >>>> _______________________________________________ >>>> discuss mailing list discuss at mpich.org >>>> To manage subscription options or unsubscribe: >>>> https://lists.mpich.org/mailman/listinfo/discuss >>> >>> _______________________________________________ >>> discuss mailing list discuss at mpich.org >>> To manage subscription options or unsubscribe: >>> https://lists.mpich.org/mailman/listinfo/discuss >>> >>> _______________________________________________ >>> discuss mailing list discuss at mpich.org >>> To manage subscription options or unsubscribe: >>> https://lists.mpich.org/mailman/listinfo/discuss >> >> _______________________________________________ >> discuss mailing list discuss at mpich.org >> To manage subscription options or unsubscribe: >> https://lists.mpich.org/mailman/listinfo/discuss >> >> _______________________________________________ >> discuss mailing list discuss at mpich.org >> To manage subscription options or unsubscribe: >> https://lists.mpich.org/mailman/listinfo/discuss > > _______________________________________________ > discuss mailing list discuss at mpich.org > To manage subscription options or unsubscribe: > https://lists.mpich.org/mailman/listinfo/discuss _______________________________________________ discuss mailing list discuss at mpich.org To manage subscription options or unsubscribe: https://lists.mpich.org/mailman/listinfo/discuss From huiweilu at mcs.anl.gov Tue Nov 25 23:06:36 2014 From: huiweilu at mcs.anl.gov (Lu, Huiwei) Date: Wed, 26 Nov 2014 05:06:36 +0000 Subject: [mpich-discuss] having problem running MPICH on multiple nodes In-Reply-To: References: <5474EAA0.9090800@mcs.anl.gov> <0A759D1E-AF0A-4841-BFA6-88408D304C29@anl.gov> Message-ID: Is there a chance that some old mpi libraries sits in /nethome/students/ahassani/usr/lib? Or some old mpirun sits in /nethome/students/ahassani/usr/bin? ? Huiwei > On Nov 25, 2014, at 10:58 PM, Amin Hassani wrote: > > > Here you go! > > host machine: > ~{ahassani at vulcan13:~/usr/bin}~{Tue Nov 25 10:56 PM}~ > $ echo $LD_LIBRARY_PATH > /nethome/students/ahassani/usr/lib:/nethome/students/ahassani/usr/mpi/lib: > ~{ahassani at vulcan13:~/usr/bin}~{Tue Nov 25 10:56 PM}~ > $ echo $PATH > /nethome/students/ahassani/usr/bin:/nethome/students/ahassani/usr/mpi/bin:/usr/local/bin:/usr/bin:/bin:/usr/local/games:/usr/games:/sbin:/usr/sbin:/usr/local/sbin:/opt/matlab-R2013a/bin > > oakmnt-0-a: > $ echo $LD_LIBRARY_PATH > /nethome/students/ahassani/usr/lib:/nethome/students/ahassani/usr/mpi/lib: > ~{ahassani at oakmnt-0-a:~/usr/bin}~{Tue Nov 25 10:56 PM}~ > $ echo $PATH > /nethome/students/ahassani/usr/bin:/nethome/students/ahassani/usr/mpi/bin:/usr/local/bin:/usr/bin:/bin:/usr/local/games:/usr/games:/sbin:/usr/sbin:/usr/local/sbin > > oakmnt-0-b: > ~{ahassani at oakmnt-0-b:~}~{Tue Nov 25 10:56 PM}~ > $ echo $LD_LIBRARY_PATH > /nethome/students/ahassani/usr/lib:/nethome/students/ahassani/usr/mpi/lib: > ~{ahassani at oakmnt-0-b:~}~{Tue Nov 25 10:56 PM}~ > $ echo $PATH > /nethome/students/ahassani/usr/bin:/nethome/students/ahassani/usr/mpi/bin:/usr/local/bin:/usr/bin:/bin:/usr/local/games:/usr/games:/sbin:/usr/sbin:/usr/local/sbin > > > Amin Hassani, > CIS department at UAB, > Birmingham, AL, USA. > > On Tue, Nov 25, 2014 at 10:55 PM, Lu, Huiwei wrote: > So your ssh connection is correct. And we confirmed the code itself is correct before. The problem may be somewhere else. > > Could you check the PATH and LD_LIBRARY_PATH on these three machines (oakmnt-0-a, oakmnt-0-b, and the host machine) to make sure they are the same? So that mpirun is using the same library on these machines. > > ? > Huiwei > > > On Nov 25, 2014, at 10:33 PM, Amin Hassani wrote: > > > > Here you go! > > > > $ mpirun -hostfile hosts-hydra -np 2 hostname > > oakmnt-0-a > > oakmnt-0-b > > > > Thanks. > > > > Amin Hassani, > > CIS department at UAB, > > Birmingham, AL, USA. > > > > On Tue, Nov 25, 2014 at 10:31 PM, Lu, Huiwei wrote: > > I can run your simplest code on my machine without a problem. So I guess there is some problem in cluster connection. Could you give me the output of the following? > > > > $ mpirun -hostfile hosts-hydra -np 2 hostname > > > > ? > > Huiwei > > > > > On Nov 25, 2014, at 10:24 PM, Amin Hassani wrote: > > > > > > Hi, > > > > > > the code that I gave you had more stuff in it that I didn't want to distract you. here is the simpler send/recv test that I just ran and it failed. > > > > > > which mpirun: specific directory that I install my MPIs > > > /nethome/students/ahassani/usr/mpi/bin/mpirun > > > > > > mpirun with no argument: > > > $ mpirun > > > [mpiexec at oakmnt-0-a] set_default_values (../../../../src/pm/hydra/ui/mpich/utils.c:1528): no executable provided > > > [mpiexec at oakmnt-0-a] HYD_uii_mpx_get_parameters (../../../../src/pm/hydra/ui/mpich/utils.c:1739): setting default values failed > > > [mpiexec at oakmnt-0-a] main (../../../../src/pm/hydra/ui/mpich/mpiexec.c:153): error parsing parameters > > > > > > > > > > > > #include > > > #include > > > #include > > > #include > > > #include > > > > > > int skip = 10; > > > int iter = 30; > > > > > > int main(int argc, char** argv) > > > { > > > int rank, size; > > > int i, j, k; > > > double t1, t2; > > > int rc; > > > > > > MPI_Init(&argc, &argv); > > > MPI_Comm world = MPI_COMM_WORLD, newworld, newworld2; > > > MPI_Comm_rank(world, &rank); > > > MPI_Comm_size(world, &size); > > > int a = 0, b = 1; > > > if(rank == 0){ > > > MPI_Send(&a, 1, MPI_INT, 1, 0, world); > > > }else{ > > > MPI_Recv(&b, 1, MPI_INT, 0, 0, world, MPI_STATUS_IGNORE); > > > } > > > > > > printf("b is %d\n", b); > > > MPI_Finalize(); > > > > > > return 0; > > > } > > > > > > Thank you. > > > > > > > > > Amin Hassani, > > > CIS department at UAB, > > > Birmingham, AL, USA. > > > > > > On Tue, Nov 25, 2014 at 10:20 PM, Lu, Huiwei wrote: > > > Hi, Amin, > > > > > > Could you quickly give us the output of the following command: "which mpirun" > > > > > > Also, your simplest code couldn?t compile correctly: "error: ?t_avg? undeclared (first use in this function)?. Can you fix it? > > > > > > ? > > > Huiwei > > > > > > > On Nov 25, 2014, at 2:58 PM, Amin Hassani wrote: > > > > > > > > This is the simplest code I have that doesn't run. > > > > > > > > > > > > #include > > > > #include > > > > #include > > > > #include > > > > #include > > > > > > > > int main(int argc, char** argv) > > > > { > > > > int rank, size; > > > > int i, j, k; > > > > double t1, t2; > > > > int rc; > > > > > > > > MPI_Init(&argc, &argv); > > > > MPI_Comm world = MPI_COMM_WORLD, newworld, newworld2; > > > > MPI_Comm_rank(world, &rank); > > > > MPI_Comm_size(world, &size); > > > > > > > > t2 = 1; > > > > MPI_Allreduce(&t2, &t_avg, 1, MPI_DOUBLE, MPI_SUM, world); > > > > t_avg = t_avg / size; > > > > > > > > MPI_Finalize(); > > > > > > > > return 0; > > > > }? > > > > > > > > Amin Hassani, > > > > CIS department at UAB, > > > > Birmingham, AL, USA. > > > > > > > > On Tue, Nov 25, 2014 at 2:46 PM, "Antonio J. Pe?a" wrote: > > > > > > > > Hi Amin, > > > > > > > > Can you share with us a minimal piece of code with which you can reproduce this issue? > > > > > > > > Thanks, > > > > Antonio > > > > > > > > > > > > > > > > On 11/25/2014 12:52 PM, Amin Hassani wrote: > > > >> Hi, > > > >> > > > >> I am having problem running MPICH, on multiple nodes. When I run an multiple MPI processes on one node, it totally works, but when I try to run on multiple nodes, it fails with the error below. > > > >> My machines have Debian OS, Both infiniband and TCP interconnects. I'm guessing it has something do to with the TCP network, but I can run openmpi on these machines with no problem. But for some reason I cannot run MPICH on multiple nodes. Please let me know if more info is needed from my side. I'm guessing there are some configuration that I am missing. I used MPICH 3.1.3 for this test. I googled this problem but couldn't find any solution. > > > >> > > > >> ?In my MPI program, I am doing a simple allreduce over MPI_COMM_WORLD?. > > > >> > > > >> ?my host file (hosts-hydra) is something like this: > > > >> oakmnt-0-a:1 > > > >> oakmnt-0-b:1 ? > > > >> > > > >> ?I get this error:? > > > >> > > > >> $ mpirun -hostfile hosts-hydra -np 2 test_dup > > > >> Assertion failed in file ../src/mpi/coll/helper_fns.c at line 490: status->MPI_TAG == recvtag > > > >> Assertion failed in file ../src/mpi/coll/helper_fns.c at line 490: status->MPI_TAG == recvtag > > > >> internal ABORT - process 1 > > > >> internal ABORT - process 0 > > > >> > > > >> =================================================================================== > > > >> = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES > > > >> = PID 30744 RUNNING AT oakmnt-0-b > > > >> = EXIT CODE: 1 > > > >> = CLEANING UP REMAINING PROCESSES > > > >> = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES > > > >> =================================================================================== > > > >> [mpiexec at vulcan13] HYDU_sock_read (../../../../src/pm/hydra/utils/sock/sock.c:239): read error (Bad file descriptor) > > > >> [mpiexec at vulcan13] control_cb (../../../../src/pm/hydra/pm/pmiserv/pmiserv_cb.c:199): unable to read command from proxy > > > >> [mpiexec at vulcan13] HYDT_dmxu_poll_wait_for_event (../../../../src/pm/hydra/tools/demux/demux_poll.c:76): callback returned error status > > > >> [mpiexec at vulcan13] HYD_pmci_wait_for_completion (../../../../src/pm/hydra/pm/pmiserv/pmiserv_pmci.c:198): error waiting for event > > > >> [mpiexec at vulcan13] main (../../../../src/pm/hydra/ui/mpich/mpiexec.c:344): process manager error waiting for completion > > > >> > > > >> Thanks. > > > >> Amin Hassani, > > > >> CIS department at UAB, > > > >> Birmingham, AL, USA. > > > >> > > > >> > > > >> _______________________________________________ > > > >> discuss mailing list > > > >> discuss at mpich.org > > > >> > > > >> To manage subscription options or unsubscribe: > > > >> > > > >> https://lists.mpich.org/mailman/listinfo/discuss > > > > > > > > > > > > -- > > > > Antonio J. Pe?a > > > > Postdoctoral Appointee > > > > Mathematics and Computer Science Division > > > > Argonne National Laboratory > > > > 9700 South Cass Avenue, Bldg. 240, Of. 3148 > > > > Argonne, IL 60439-4847 > > > > > > > > apenya at mcs.anl.gov > > > > www.mcs.anl.gov/~apenya > > > > > > > > _______________________________________________ > > > > discuss mailing list discuss at mpich.org > > > > To manage subscription options or unsubscribe: > > > > https://lists.mpich.org/mailman/listinfo/discuss > > > > > > > > _______________________________________________ > > > > discuss mailing list discuss at mpich.org > > > > To manage subscription options or unsubscribe: > > > > https://lists.mpich.org/mailman/listinfo/discuss > > > > > > _______________________________________________ > > > discuss mailing list discuss at mpich.org > > > To manage subscription options or unsubscribe: > > > https://lists.mpich.org/mailman/listinfo/discuss > > > > > > _______________________________________________ > > > discuss mailing list discuss at mpich.org > > > To manage subscription options or unsubscribe: > > > https://lists.mpich.org/mailman/listinfo/discuss > > > > _______________________________________________ > > discuss mailing list discuss at mpich.org > > To manage subscription options or unsubscribe: > > https://lists.mpich.org/mailman/listinfo/discuss > > > > _______________________________________________ > > discuss mailing list discuss at mpich.org > > To manage subscription options or unsubscribe: > > https://lists.mpich.org/mailman/listinfo/discuss > > _______________________________________________ > discuss mailing list discuss at mpich.org > To manage subscription options or unsubscribe: > https://lists.mpich.org/mailman/listinfo/discuss > > _______________________________________________ > discuss mailing list discuss at mpich.org > To manage subscription options or unsubscribe: > https://lists.mpich.org/mailman/listinfo/discuss _______________________________________________ discuss mailing list discuss at mpich.org To manage subscription options or unsubscribe: https://lists.mpich.org/mailman/listinfo/discuss From ahassani at cis.uab.edu Tue Nov 25 22:58:51 2014 From: ahassani at cis.uab.edu (Amin Hassani) Date: Tue, 25 Nov 2014 22:58:51 -0600 Subject: [mpich-discuss] having problem running MPICH on multiple nodes In-Reply-To: References: <5474EAA0.9090800@mcs.anl.gov> <0A759D1E-AF0A-4841-BFA6-88408D304C29@anl.gov> Message-ID: Here you go! host machine: ~{ahassani at vulcan13:~/usr/bin}~{Tue Nov 25 10:56 PM}~ $ echo $LD_LIBRARY_PATH /nethome/students/ahassani/usr/lib:/nethome/students/ahassani/usr/mpi/lib: ~{ahassani at vulcan13:~/usr/bin}~{Tue Nov 25 10:56 PM}~ $ echo $PATH /nethome/students/ahassani/usr/bin:/nethome/students/ahassani/usr/mpi/bin:/usr/local/bin:/usr/bin:/bin:/usr/local/games:/usr/games:/sbin:/usr/sbin:/usr/local/sbin:/opt/matlab-R2013a/bin oakmnt-0-a: $ echo $LD_LIBRARY_PATH /nethome/students/ahassani/usr/lib:/nethome/students/ahassani/usr/mpi/lib: ~{ahassani at oakmnt-0-a:~/usr/bin}~{Tue Nov 25 10:56 PM}~ $ echo $PATH /nethome/students/ahassani/usr/bin:/nethome/students/ahassani/usr/mpi/bin:/usr/local/bin:/usr/bin:/bin:/usr/local/games:/usr/games:/sbin:/usr/sbin:/usr/local/sbin oakmnt-0-b: ~{ahassani at oakmnt-0-b:~}~{Tue Nov 25 10:56 PM}~ $ echo $LD_LIBRARY_PATH /nethome/students/ahassani/usr/lib:/nethome/students/ahassani/usr/mpi/lib: ~{ahassani at oakmnt-0-b:~}~{Tue Nov 25 10:56 PM}~ $ echo $PATH /nethome/students/ahassani/usr/bin:/nethome/students/ahassani/usr/mpi/bin:/usr/local/bin:/usr/bin:/bin:/usr/local/games:/usr/games:/sbin:/usr/sbin:/usr/local/sbin Amin Hassani, CIS department at UAB, Birmingham, AL, USA. On Tue, Nov 25, 2014 at 10:55 PM, Lu, Huiwei wrote: > So your ssh connection is correct. And we confirmed the code itself is > correct before. The problem may be somewhere else. > > Could you check the PATH and LD_LIBRARY_PATH on these three machines > (oakmnt-0-a, oakmnt-0-b, and the host machine) to make sure they are the > same? So that mpirun is using the same library on these machines. > > ? > Huiwei > > > On Nov 25, 2014, at 10:33 PM, Amin Hassani wrote: > > > > Here you go! > > > > $ mpirun -hostfile hosts-hydra -np 2 hostname > > oakmnt-0-a > > oakmnt-0-b > > > > Thanks. > > > > Amin Hassani, > > CIS department at UAB, > > Birmingham, AL, USA. > > > > On Tue, Nov 25, 2014 at 10:31 PM, Lu, Huiwei > wrote: > > I can run your simplest code on my machine without a problem. So I guess > there is some problem in cluster connection. Could you give me the output > of the following? > > > > $ mpirun -hostfile hosts-hydra -np 2 hostname > > > > ? > > Huiwei > > > > > On Nov 25, 2014, at 10:24 PM, Amin Hassani > wrote: > > > > > > Hi, > > > > > > the code that I gave you had more stuff in it that I didn't want to > distract you. here is the simpler send/recv test that I just ran and it > failed. > > > > > > which mpirun: specific directory that I install my MPIs > > > /nethome/students/ahassani/usr/mpi/bin/mpirun > > > > > > mpirun with no argument: > > > $ mpirun > > > [mpiexec at oakmnt-0-a] set_default_values > (../../../../src/pm/hydra/ui/mpich/utils.c:1528): no executable provided > > > [mpiexec at oakmnt-0-a] HYD_uii_mpx_get_parameters > (../../../../src/pm/hydra/ui/mpich/utils.c:1739): setting default values > failed > > > [mpiexec at oakmnt-0-a] main > (../../../../src/pm/hydra/ui/mpich/mpiexec.c:153): error parsing parameters > > > > > > > > > > > > #include > > > #include > > > #include > > > #include > > > #include > > > > > > int skip = 10; > > > int iter = 30; > > > > > > int main(int argc, char** argv) > > > { > > > int rank, size; > > > int i, j, k; > > > double t1, t2; > > > int rc; > > > > > > MPI_Init(&argc, &argv); > > > MPI_Comm world = MPI_COMM_WORLD, newworld, newworld2; > > > MPI_Comm_rank(world, &rank); > > > MPI_Comm_size(world, &size); > > > int a = 0, b = 1; > > > if(rank == 0){ > > > MPI_Send(&a, 1, MPI_INT, 1, 0, world); > > > }else{ > > > MPI_Recv(&b, 1, MPI_INT, 0, 0, world, MPI_STATUS_IGNORE); > > > } > > > > > > printf("b is %d\n", b); > > > MPI_Finalize(); > > > > > > return 0; > > > } > > > > > > Thank you. > > > > > > > > > Amin Hassani, > > > CIS department at UAB, > > > Birmingham, AL, USA. > > > > > > On Tue, Nov 25, 2014 at 10:20 PM, Lu, Huiwei > wrote: > > > Hi, Amin, > > > > > > Could you quickly give us the output of the following command: "which > mpirun" > > > > > > Also, your simplest code couldn?t compile correctly: "error: ?t_avg? > undeclared (first use in this function)?. Can you fix it? > > > > > > ? > > > Huiwei > > > > > > > On Nov 25, 2014, at 2:58 PM, Amin Hassani > wrote: > > > > > > > > This is the simplest code I have that doesn't run. > > > > > > > > > > > > #include > > > > #include > > > > #include > > > > #include > > > > #include > > > > > > > > int main(int argc, char** argv) > > > > { > > > > int rank, size; > > > > int i, j, k; > > > > double t1, t2; > > > > int rc; > > > > > > > > MPI_Init(&argc, &argv); > > > > MPI_Comm world = MPI_COMM_WORLD, newworld, newworld2; > > > > MPI_Comm_rank(world, &rank); > > > > MPI_Comm_size(world, &size); > > > > > > > > t2 = 1; > > > > MPI_Allreduce(&t2, &t_avg, 1, MPI_DOUBLE, MPI_SUM, world); > > > > t_avg = t_avg / size; > > > > > > > > MPI_Finalize(); > > > > > > > > return 0; > > > > }? > > > > > > > > Amin Hassani, > > > > CIS department at UAB, > > > > Birmingham, AL, USA. > > > > > > > > On Tue, Nov 25, 2014 at 2:46 PM, "Antonio J. Pe?a" < > apenya at mcs.anl.gov> wrote: > > > > > > > > Hi Amin, > > > > > > > > Can you share with us a minimal piece of code with which you can > reproduce this issue? > > > > > > > > Thanks, > > > > Antonio > > > > > > > > > > > > > > > > On 11/25/2014 12:52 PM, Amin Hassani wrote: > > > >> Hi, > > > >> > > > >> I am having problem running MPICH, on multiple nodes. When I run an > multiple MPI processes on one node, it totally works, but when I try to run > on multiple nodes, it fails with the error below. > > > >> My machines have Debian OS, Both infiniband and TCP interconnects. > I'm guessing it has something do to with the TCP network, but I can run > openmpi on these machines with no problem. But for some reason I cannot run > MPICH on multiple nodes. Please let me know if more info is needed from my > side. I'm guessing there are some configuration that I am missing. I used > MPICH 3.1.3 for this test. I googled this problem but couldn't find any > solution. > > > >> > > > >> ?In my MPI program, I am doing a simple allreduce over > MPI_COMM_WORLD?. > > > >> > > > >> ?my host file (hosts-hydra) is something like this: > > > >> oakmnt-0-a:1 > > > >> oakmnt-0-b:1 ? > > > >> > > > >> ?I get this error:? > > > >> > > > >> $ mpirun -hostfile hosts-hydra -np 2 test_dup > > > >> Assertion failed in file ../src/mpi/coll/helper_fns.c at line 490: > status->MPI_TAG == recvtag > > > >> Assertion failed in file ../src/mpi/coll/helper_fns.c at line 490: > status->MPI_TAG == recvtag > > > >> internal ABORT - process 1 > > > >> internal ABORT - process 0 > > > >> > > > >> > =================================================================================== > > > >> = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES > > > >> = PID 30744 RUNNING AT oakmnt-0-b > > > >> = EXIT CODE: 1 > > > >> = CLEANING UP REMAINING PROCESSES > > > >> = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES > > > >> > =================================================================================== > > > >> [mpiexec at vulcan13] HYDU_sock_read > (../../../../src/pm/hydra/utils/sock/sock.c:239): read error (Bad file > descriptor) > > > >> [mpiexec at vulcan13] control_cb > (../../../../src/pm/hydra/pm/pmiserv/pmiserv_cb.c:199): unable to read > command from proxy > > > >> [mpiexec at vulcan13] HYDT_dmxu_poll_wait_for_event > (../../../../src/pm/hydra/tools/demux/demux_poll.c:76): callback returned > error status > > > >> [mpiexec at vulcan13] HYD_pmci_wait_for_completion > (../../../../src/pm/hydra/pm/pmiserv/pmiserv_pmci.c:198): error waiting for > event > > > >> [mpiexec at vulcan13] main > (../../../../src/pm/hydra/ui/mpich/mpiexec.c:344): process manager error > waiting for completion > > > >> > > > >> Thanks. > > > >> Amin Hassani, > > > >> CIS department at UAB, > > > >> Birmingham, AL, USA. > > > >> > > > >> > > > >> _______________________________________________ > > > >> discuss mailing list > > > >> discuss at mpich.org > > > >> > > > >> To manage subscription options or unsubscribe: > > > >> > > > >> https://lists.mpich.org/mailman/listinfo/discuss > > > > > > > > > > > > -- > > > > Antonio J. Pe?a > > > > Postdoctoral Appointee > > > > Mathematics and Computer Science Division > > > > Argonne National Laboratory > > > > 9700 South Cass Avenue, Bldg. 240, Of. 3148 > > > > Argonne, IL 60439-4847 > > > > > > > > apenya at mcs.anl.gov > > > > www.mcs.anl.gov/~apenya > > > > > > > > _______________________________________________ > > > > discuss mailing list discuss at mpich.org > > > > To manage subscription options or unsubscribe: > > > > https://lists.mpich.org/mailman/listinfo/discuss > > > > > > > > _______________________________________________ > > > > discuss mailing list discuss at mpich.org > > > > To manage subscription options or unsubscribe: > > > > https://lists.mpich.org/mailman/listinfo/discuss > > > > > > _______________________________________________ > > > discuss mailing list discuss at mpich.org > > > To manage subscription options or unsubscribe: > > > https://lists.mpich.org/mailman/listinfo/discuss > > > > > > _______________________________________________ > > > discuss mailing list discuss at mpich.org > > > To manage subscription options or unsubscribe: > > > https://lists.mpich.org/mailman/listinfo/discuss > > > > _______________________________________________ > > discuss mailing list discuss at mpich.org > > To manage subscription options or unsubscribe: > > https://lists.mpich.org/mailman/listinfo/discuss > > > > _______________________________________________ > > discuss mailing list discuss at mpich.org > > To manage subscription options or unsubscribe: > > https://lists.mpich.org/mailman/listinfo/discuss > > _______________________________________________ > discuss mailing list discuss at mpich.org > To manage subscription options or unsubscribe: > https://lists.mpich.org/mailman/listinfo/discuss > -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- _______________________________________________ discuss mailing list discuss at mpich.org To manage subscription options or unsubscribe: https://lists.mpich.org/mailman/listinfo/discuss From huiweilu at mcs.anl.gov Tue Nov 25 22:55:18 2014 From: huiweilu at mcs.anl.gov (Lu, Huiwei) Date: Wed, 26 Nov 2014 04:55:18 +0000 Subject: [mpich-discuss] having problem running MPICH on multiple nodes In-Reply-To: References: <5474EAA0.9090800@mcs.anl.gov> <0A759D1E-AF0A-4841-BFA6-88408D304C29@anl.gov> Message-ID: So your ssh connection is correct. And we confirmed the code itself is correct before. The problem may be somewhere else. Could you check the PATH and LD_LIBRARY_PATH on these three machines (oakmnt-0-a, oakmnt-0-b, and the host machine) to make sure they are the same? So that mpirun is using the same library on these machines. ? Huiwei > On Nov 25, 2014, at 10:33 PM, Amin Hassani wrote: > > Here you go! > > $ mpirun -hostfile hosts-hydra -np 2 hostname > oakmnt-0-a > oakmnt-0-b > > Thanks. > > Amin Hassani, > CIS department at UAB, > Birmingham, AL, USA. > > On Tue, Nov 25, 2014 at 10:31 PM, Lu, Huiwei wrote: > I can run your simplest code on my machine without a problem. So I guess there is some problem in cluster connection. Could you give me the output of the following? > > $ mpirun -hostfile hosts-hydra -np 2 hostname > > ? > Huiwei > > > On Nov 25, 2014, at 10:24 PM, Amin Hassani wrote: > > > > Hi, > > > > the code that I gave you had more stuff in it that I didn't want to distract you. here is the simpler send/recv test that I just ran and it failed. > > > > which mpirun: specific directory that I install my MPIs > > /nethome/students/ahassani/usr/mpi/bin/mpirun > > > > mpirun with no argument: > > $ mpirun > > [mpiexec at oakmnt-0-a] set_default_values (../../../../src/pm/hydra/ui/mpich/utils.c:1528): no executable provided > > [mpiexec at oakmnt-0-a] HYD_uii_mpx_get_parameters (../../../../src/pm/hydra/ui/mpich/utils.c:1739): setting default values failed > > [mpiexec at oakmnt-0-a] main (../../../../src/pm/hydra/ui/mpich/mpiexec.c:153): error parsing parameters > > > > > > > > #include > > #include > > #include > > #include > > #include > > > > int skip = 10; > > int iter = 30; > > > > int main(int argc, char** argv) > > { > > int rank, size; > > int i, j, k; > > double t1, t2; > > int rc; > > > > MPI_Init(&argc, &argv); > > MPI_Comm world = MPI_COMM_WORLD, newworld, newworld2; > > MPI_Comm_rank(world, &rank); > > MPI_Comm_size(world, &size); > > int a = 0, b = 1; > > if(rank == 0){ > > MPI_Send(&a, 1, MPI_INT, 1, 0, world); > > }else{ > > MPI_Recv(&b, 1, MPI_INT, 0, 0, world, MPI_STATUS_IGNORE); > > } > > > > printf("b is %d\n", b); > > MPI_Finalize(); > > > > return 0; > > } > > > > Thank you. > > > > > > Amin Hassani, > > CIS department at UAB, > > Birmingham, AL, USA. > > > > On Tue, Nov 25, 2014 at 10:20 PM, Lu, Huiwei wrote: > > Hi, Amin, > > > > Could you quickly give us the output of the following command: "which mpirun" > > > > Also, your simplest code couldn?t compile correctly: "error: ?t_avg? undeclared (first use in this function)?. Can you fix it? > > > > ? > > Huiwei > > > > > On Nov 25, 2014, at 2:58 PM, Amin Hassani wrote: > > > > > > This is the simplest code I have that doesn't run. > > > > > > > > > #include > > > #include > > > #include > > > #include > > > #include > > > > > > int main(int argc, char** argv) > > > { > > > int rank, size; > > > int i, j, k; > > > double t1, t2; > > > int rc; > > > > > > MPI_Init(&argc, &argv); > > > MPI_Comm world = MPI_COMM_WORLD, newworld, newworld2; > > > MPI_Comm_rank(world, &rank); > > > MPI_Comm_size(world, &size); > > > > > > t2 = 1; > > > MPI_Allreduce(&t2, &t_avg, 1, MPI_DOUBLE, MPI_SUM, world); > > > t_avg = t_avg / size; > > > > > > MPI_Finalize(); > > > > > > return 0; > > > }? > > > > > > Amin Hassani, > > > CIS department at UAB, > > > Birmingham, AL, USA. > > > > > > On Tue, Nov 25, 2014 at 2:46 PM, "Antonio J. Pe?a" wrote: > > > > > > Hi Amin, > > > > > > Can you share with us a minimal piece of code with which you can reproduce this issue? > > > > > > Thanks, > > > Antonio > > > > > > > > > > > > On 11/25/2014 12:52 PM, Amin Hassani wrote: > > >> Hi, > > >> > > >> I am having problem running MPICH, on multiple nodes. When I run an multiple MPI processes on one node, it totally works, but when I try to run on multiple nodes, it fails with the error below. > > >> My machines have Debian OS, Both infiniband and TCP interconnects. I'm guessing it has something do to with the TCP network, but I can run openmpi on these machines with no problem. But for some reason I cannot run MPICH on multiple nodes. Please let me know if more info is needed from my side. I'm guessing there are some configuration that I am missing. I used MPICH 3.1.3 for this test. I googled this problem but couldn't find any solution. > > >> > > >> ?In my MPI program, I am doing a simple allreduce over MPI_COMM_WORLD?. > > >> > > >> ?my host file (hosts-hydra) is something like this: > > >> oakmnt-0-a:1 > > >> oakmnt-0-b:1 ? > > >> > > >> ?I get this error:? > > >> > > >> $ mpirun -hostfile hosts-hydra -np 2 test_dup > > >> Assertion failed in file ../src/mpi/coll/helper_fns.c at line 490: status->MPI_TAG == recvtag > > >> Assertion failed in file ../src/mpi/coll/helper_fns.c at line 490: status->MPI_TAG == recvtag > > >> internal ABORT - process 1 > > >> internal ABORT - process 0 > > >> > > >> =================================================================================== > > >> = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES > > >> = PID 30744 RUNNING AT oakmnt-0-b > > >> = EXIT CODE: 1 > > >> = CLEANING UP REMAINING PROCESSES > > >> = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES > > >> =================================================================================== > > >> [mpiexec at vulcan13] HYDU_sock_read (../../../../src/pm/hydra/utils/sock/sock.c:239): read error (Bad file descriptor) > > >> [mpiexec at vulcan13] control_cb (../../../../src/pm/hydra/pm/pmiserv/pmiserv_cb.c:199): unable to read command from proxy > > >> [mpiexec at vulcan13] HYDT_dmxu_poll_wait_for_event (../../../../src/pm/hydra/tools/demux/demux_poll.c:76): callback returned error status > > >> [mpiexec at vulcan13] HYD_pmci_wait_for_completion (../../../../src/pm/hydra/pm/pmiserv/pmiserv_pmci.c:198): error waiting for event > > >> [mpiexec at vulcan13] main (../../../../src/pm/hydra/ui/mpich/mpiexec.c:344): process manager error waiting for completion > > >> > > >> Thanks. > > >> Amin Hassani, > > >> CIS department at UAB, > > >> Birmingham, AL, USA. > > >> > > >> > > >> _______________________________________________ > > >> discuss mailing list > > >> discuss at mpich.org > > >> > > >> To manage subscription options or unsubscribe: > > >> > > >> https://lists.mpich.org/mailman/listinfo/discuss > > > > > > > > > -- > > > Antonio J. Pe?a > > > Postdoctoral Appointee > > > Mathematics and Computer Science Division > > > Argonne National Laboratory > > > 9700 South Cass Avenue, Bldg. 240, Of. 3148 > > > Argonne, IL 60439-4847 > > > > > > apenya at mcs.anl.gov > > > www.mcs.anl.gov/~apenya > > > > > > _______________________________________________ > > > discuss mailing list discuss at mpich.org > > > To manage subscription options or unsubscribe: > > > https://lists.mpich.org/mailman/listinfo/discuss > > > > > > _______________________________________________ > > > discuss mailing list discuss at mpich.org > > > To manage subscription options or unsubscribe: > > > https://lists.mpich.org/mailman/listinfo/discuss > > > > _______________________________________________ > > discuss mailing list discuss at mpich.org > > To manage subscription options or unsubscribe: > > https://lists.mpich.org/mailman/listinfo/discuss > > > > _______________________________________________ > > discuss mailing list discuss at mpich.org > > To manage subscription options or unsubscribe: > > https://lists.mpich.org/mailman/listinfo/discuss > > _______________________________________________ > discuss mailing list discuss at mpich.org > To manage subscription options or unsubscribe: > https://lists.mpich.org/mailman/listinfo/discuss > > _______________________________________________ > discuss mailing list discuss at mpich.org > To manage subscription options or unsubscribe: > https://lists.mpich.org/mailman/listinfo/discuss _______________________________________________ discuss mailing list discuss at mpich.org To manage subscription options or unsubscribe: https://lists.mpich.org/mailman/listinfo/discuss From ahassani at cis.uab.edu Tue Nov 25 22:40:00 2014 From: ahassani at cis.uab.edu (Amin Hassani) Date: Tue, 25 Nov 2014 22:40:00 -0600 Subject: [mpich-discuss] having problem running MPICH on multiple nodes In-Reply-To: <76E33806-4840-4F5D-AAAA-E5EB8C42F93A@anl.gov> References: <5474EAA0.9090800@mcs.anl.gov> <0A759D1E-AF0A-4841-BFA6-88408D304C29@anl.gov> <76E33806-4840-4F5D-AAAA-E5EB8C42F93A@anl.gov> Message-ID: No, not at all. I have removed openmpi on these machines. they only have mpich. But I was wondering what transport layer MPICH chooses by default. These machines have both infiniband and tcp on them. My guess is that it is trying to run on infiniband, and for some reason fails. Do you know how can I force MPICH to only use TCP? Thanks Amin Hassani, CIS department at UAB, Birmingham, AL, USA. On Tue, Nov 25, 2014 at 10:36 PM, Bland, Wesley B. wrote: > Any chance you're using MPICH on one side and Open MPI on the other? You > can get some weird situations when mixing the two. > > > > On Nov 25, 2014, at 11:33 PM, Amin Hassani wrote: > > Here you go! > > $ mpirun -hostfile hosts-hydra -np 2 hostname > oakmnt-0-a > oakmnt-0-b > > Thanks. > > Amin Hassani, > CIS department at UAB, > Birmingham, AL, USA. > > On Tue, Nov 25, 2014 at 10:31 PM, Lu, Huiwei wrote: > >> I can run your simplest code on my machine without a problem. So I guess >> there is some problem in cluster connection. Could you give me the output >> of the following? >> >> $ mpirun -hostfile hosts-hydra -np 2 hostname >> >> ? >> Huiwei >> >> > On Nov 25, 2014, at 10:24 PM, Amin Hassani >> wrote: >> > >> > Hi, >> > >> > the code that I gave you had more stuff in it that I didn't want to >> distract you. here is the simpler send/recv test that I just ran and it >> failed. >> > >> > which mpirun: specific directory that I install my MPIs >> > /nethome/students/ahassani/usr/mpi/bin/mpirun >> > >> > mpirun with no argument: >> > $ mpirun >> > [mpiexec at oakmnt-0-a] set_default_values >> (../../../../src/pm/hydra/ui/mpich/utils.c:1528): no executable provided >> > [mpiexec at oakmnt-0-a] HYD_uii_mpx_get_parameters >> (../../../../src/pm/hydra/ui/mpich/utils.c:1739): setting default values >> failed >> > [mpiexec at oakmnt-0-a] main >> (../../../../src/pm/hydra/ui/mpich/mpiexec.c:153): error parsing parameters >> > >> > >> > >> > #include >> > #include >> > #include >> > #include >> > #include >> > >> > int skip = 10; >> > int iter = 30; >> > >> > int main(int argc, char** argv) >> > { >> > int rank, size; >> > int i, j, k; >> > double t1, t2; >> > int rc; >> > >> > MPI_Init(&argc, &argv); >> > MPI_Comm world = MPI_COMM_WORLD, newworld, newworld2; >> > MPI_Comm_rank(world, &rank); >> > MPI_Comm_size(world, &size); >> > int a = 0, b = 1; >> > if(rank == 0){ >> > MPI_Send(&a, 1, MPI_INT, 1, 0, world); >> > }else{ >> > MPI_Recv(&b, 1, MPI_INT, 0, 0, world, MPI_STATUS_IGNORE); >> > } >> > >> > printf("b is %d\n", b); >> > MPI_Finalize(); >> > >> > return 0; >> > } >> > >> > Thank you. >> > >> > >> > Amin Hassani, >> > CIS department at UAB, >> > Birmingham, AL, USA. >> > >> > On Tue, Nov 25, 2014 at 10:20 PM, Lu, Huiwei >> wrote: >> > Hi, Amin, >> > >> > Could you quickly give us the output of the following command: "which >> mpirun" >> > >> > Also, your simplest code couldn?t compile correctly: "error: ?t_avg? >> undeclared (first use in this function)?. Can you fix it? >> > >> > ? >> > Huiwei >> > >> > > On Nov 25, 2014, at 2:58 PM, Amin Hassani >> wrote: >> > > >> > > This is the simplest code I have that doesn't run. >> > > >> > > >> > > #include >> > > #include >> > > #include >> > > #include >> > > #include >> > > >> > > int main(int argc, char** argv) >> > > { >> > > int rank, size; >> > > int i, j, k; >> > > double t1, t2; >> > > int rc; >> > > >> > > MPI_Init(&argc, &argv); >> > > MPI_Comm world = MPI_COMM_WORLD, newworld, newworld2; >> > > MPI_Comm_rank(world, &rank); >> > > MPI_Comm_size(world, &size); >> > > >> > > t2 = 1; >> > > MPI_Allreduce(&t2, &t_avg, 1, MPI_DOUBLE, MPI_SUM, world); >> > > t_avg = t_avg / size; >> > > >> > > MPI_Finalize(); >> > > >> > > return 0; >> > > }? >> > > >> > > Amin Hassani, >> > > CIS department at UAB, >> > > Birmingham, AL, USA. >> > > >> > > On Tue, Nov 25, 2014 at 2:46 PM, "Antonio J. Pe?a" < >> apenya at mcs.anl.gov> wrote: >> > > >> > > Hi Amin, >> > > >> > > Can you share with us a minimal piece of code with which you can >> reproduce this issue? >> > > >> > > Thanks, >> > > Antonio >> > > >> > > >> > > >> > > On 11/25/2014 12:52 PM, Amin Hassani wrote: >> > >> Hi, >> > >> >> > >> I am having problem running MPICH, on multiple nodes. When I run an >> multiple MPI processes on one node, it totally works, but when I try to run >> on multiple nodes, it fails with the error below. >> > >> My machines have Debian OS, Both infiniband and TCP interconnects. >> I'm guessing it has something do to with the TCP network, but I can run >> openmpi on these machines with no problem. But for some reason I cannot run >> MPICH on multiple nodes. Please let me know if more info is needed from my >> side. I'm guessing there are some configuration that I am missing. I used >> MPICH 3.1.3 for this test. I googled this problem but couldn't find any >> solution. >> > >> >> > >> ?In my MPI program, I am doing a simple allreduce over >> MPI_COMM_WORLD?. >> > >> >> > >> ?my host file (hosts-hydra) is something like this: >> > >> oakmnt-0-a:1 >> > >> oakmnt-0-b:1 ? >> > >> >> > >> ?I get this error:? >> > >> >> > >> $ mpirun -hostfile hosts-hydra -np 2 test_dup >> > >> Assertion failed in file ../src/mpi/coll/helper_fns.c at line 490: >> status->MPI_TAG == recvtag >> > >> Assertion failed in file ../src/mpi/coll/helper_fns.c at line 490: >> status->MPI_TAG == recvtag >> > >> internal ABORT - process 1 >> > >> internal ABORT - process 0 >> > >> >> > >> >> =================================================================================== >> > >> = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES >> > >> = PID 30744 RUNNING AT oakmnt-0-b >> > >> = EXIT CODE: 1 >> > >> = CLEANING UP REMAINING PROCESSES >> > >> = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES >> > >> >> =================================================================================== >> > >> [mpiexec at vulcan13] HYDU_sock_read >> (../../../../src/pm/hydra/utils/sock/sock.c:239): read error (Bad file >> descriptor) >> > >> [mpiexec at vulcan13] control_cb >> (../../../../src/pm/hydra/pm/pmiserv/pmiserv_cb.c:199): unable to read >> command from proxy >> > >> [mpiexec at vulcan13] HYDT_dmxu_poll_wait_for_event >> (../../../../src/pm/hydra/tools/demux/demux_poll.c:76): callback returned >> error status >> > >> [mpiexec at vulcan13] HYD_pmci_wait_for_completion >> (../../../../src/pm/hydra/pm/pmiserv/pmiserv_pmci.c:198): error waiting for >> event >> > >> [mpiexec at vulcan13] main >> (../../../../src/pm/hydra/ui/mpich/mpiexec.c:344): process manager error >> waiting for completion >> > >> >> > >> Thanks. >> > >> Amin Hassani, >> > >> CIS department at UAB, >> > >> Birmingham, AL, USA. >> > >> >> > >> >> > >> _______________________________________________ >> > >> discuss mailing list >> > >> discuss at mpich.org >> > >> >> > >> To manage subscription options or unsubscribe: >> > >> >> > >> https://lists.mpich.org/mailman/listinfo/discuss >> > > >> > > >> > > -- >> > > Antonio J. Pe?a >> > > Postdoctoral Appointee >> > > Mathematics and Computer Science Division >> > > Argonne National Laboratory >> > > 9700 South Cass Avenue, Bldg. 240, Of. 3148 >> > > Argonne, IL 60439-4847 >> > > >> > > apenya at mcs.anl.gov >> > > www.mcs.anl.gov/~apenya >> > > >> > > _______________________________________________ >> > > discuss mailing list discuss at mpich.org >> > > To manage subscription options or unsubscribe: >> > > https://lists.mpich.org/mailman/listinfo/discuss >> > > >> > > _______________________________________________ >> > > discuss mailing list discuss at mpich.org >> > > To manage subscription options or unsubscribe: >> > > https://lists.mpich.org/mailman/listinfo/discuss >> > >> > _______________________________________________ >> > discuss mailing list discuss at mpich.org >> > To manage subscription options or unsubscribe: >> > https://lists.mpich.org/mailman/listinfo/discuss >> > >> > _______________________________________________ >> > discuss mailing list discuss at mpich.org >> > To manage subscription options or unsubscribe: >> > https://lists.mpich.org/mailman/listinfo/discuss >> >> _______________________________________________ >> discuss mailing list discuss at mpich.org >> To manage subscription options or unsubscribe: >> https://lists.mpich.org/mailman/listinfo/discuss >> > > _______________________________________________ > discuss mailing list discuss at mpich.org > To manage subscription options or unsubscribe: > https://lists.mpich.org/mailman/listinfo/discuss > > > _______________________________________________ > discuss mailing list discuss at mpich.org > To manage subscription options or unsubscribe: > https://lists.mpich.org/mailman/listinfo/discuss > -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- _______________________________________________ discuss mailing list discuss at mpich.org To manage subscription options or unsubscribe: https://lists.mpich.org/mailman/listinfo/discuss From wbland at anl.gov Tue Nov 25 22:36:23 2014 From: wbland at anl.gov (Bland, Wesley B.) Date: Wed, 26 Nov 2014 04:36:23 +0000 Subject: [mpich-discuss] having problem running MPICH on multiple nodes In-Reply-To: References: <5474EAA0.9090800@mcs.anl.gov> <0A759D1E-AF0A-4841-BFA6-88408D304C29@anl.gov>, Message-ID: <76E33806-4840-4F5D-AAAA-E5EB8C42F93A@anl.gov> Any chance you're using MPICH on one side and Open MPI on the other? You can get some weird situations when mixing the two. On Nov 25, 2014, at 11:33 PM, Amin Hassani > wrote: Here you go! $ mpirun -hostfile hosts-hydra -np 2 hostname oakmnt-0-a oakmnt-0-b Thanks. Amin Hassani, CIS department at UAB, Birmingham, AL, USA. On Tue, Nov 25, 2014 at 10:31 PM, Lu, Huiwei > wrote: I can run your simplest code on my machine without a problem. So I guess there is some problem in cluster connection. Could you give me the output of the following? $ mpirun -hostfile hosts-hydra -np 2 hostname ? Huiwei > On Nov 25, 2014, at 10:24 PM, Amin Hassani > wrote: > > Hi, > > the code that I gave you had more stuff in it that I didn't want to distract you. here is the simpler send/recv test that I just ran and it failed. > > which mpirun: specific directory that I install my MPIs > /nethome/students/ahassani/usr/mpi/bin/mpirun > > mpirun with no argument: > $ mpirun > [mpiexec at oakmnt-0-a] set_default_values (../../../../src/pm/hydra/ui/mpich/utils.c:1528): no executable provided > [mpiexec at oakmnt-0-a] HYD_uii_mpx_get_parameters (../../../../src/pm/hydra/ui/mpich/utils.c:1739): setting default values failed > [mpiexec at oakmnt-0-a] main (../../../../src/pm/hydra/ui/mpich/mpiexec.c:153): error parsing parameters > > > > #include > #include > #include > #include > #include > > int skip = 10; > int iter = 30; > > int main(int argc, char** argv) > { > int rank, size; > int i, j, k; > double t1, t2; > int rc; > > MPI_Init(&argc, &argv); > MPI_Comm world = MPI_COMM_WORLD, newworld, newworld2; > MPI_Comm_rank(world, &rank); > MPI_Comm_size(world, &size); > int a = 0, b = 1; > if(rank == 0){ > MPI_Send(&a, 1, MPI_INT, 1, 0, world); > }else{ > MPI_Recv(&b, 1, MPI_INT, 0, 0, world, MPI_STATUS_IGNORE); > } > > printf("b is %d\n", b); > MPI_Finalize(); > > return 0; > } > > Thank you. > > > Amin Hassani, > CIS department at UAB, > Birmingham, AL, USA. > > On Tue, Nov 25, 2014 at 10:20 PM, Lu, Huiwei > wrote: > Hi, Amin, > > Could you quickly give us the output of the following command: "which mpirun" > > Also, your simplest code couldn?t compile correctly: "error: ?t_avg? undeclared (first use in this function)?. Can you fix it? > > ? > Huiwei > > > On Nov 25, 2014, at 2:58 PM, Amin Hassani > wrote: > > > > This is the simplest code I have that doesn't run. > > > > > > #include > > #include > > #include > > #include > > #include > > > > int main(int argc, char** argv) > > { > > int rank, size; > > int i, j, k; > > double t1, t2; > > int rc; > > > > MPI_Init(&argc, &argv); > > MPI_Comm world = MPI_COMM_WORLD, newworld, newworld2; > > MPI_Comm_rank(world, &rank); > > MPI_Comm_size(world, &size); > > > > t2 = 1; > > MPI_Allreduce(&t2, &t_avg, 1, MPI_DOUBLE, MPI_SUM, world); > > t_avg = t_avg / size; > > > > MPI_Finalize(); > > > > return 0; > > }? > > > > Amin Hassani, > > CIS department at UAB, > > Birmingham, AL, USA. > > > > On Tue, Nov 25, 2014 at 2:46 PM, "Antonio J. Pe?a" > wrote: > > > > Hi Amin, > > > > Can you share with us a minimal piece of code with which you can reproduce this issue? > > > > Thanks, > > Antonio > > > > > > > > On 11/25/2014 12:52 PM, Amin Hassani wrote: > >> Hi, > >> > >> I am having problem running MPICH, on multiple nodes. When I run an multiple MPI processes on one node, it totally works, but when I try to run on multiple nodes, it fails with the error below. > >> My machines have Debian OS, Both infiniband and TCP interconnects. I'm guessing it has something do to with the TCP network, but I can run openmpi on these machines with no problem. But for some reason I cannot run MPICH on multiple nodes. Please let me know if more info is needed from my side. I'm guessing there are some configuration that I am missing. I used MPICH 3.1.3 for this test. I googled this problem but couldn't find any solution. > >> > >> ?In my MPI program, I am doing a simple allreduce over MPI_COMM_WORLD?. > >> > >> ?my host file (hosts-hydra) is something like this: > >> oakmnt-0-a:1 > >> oakmnt-0-b:1 ? > >> > >> ?I get this error:? > >> > >> $ mpirun -hostfile hosts-hydra -np 2 test_dup > >> Assertion failed in file ../src/mpi/coll/helper_fns.c at line 490: status->MPI_TAG == recvtag > >> Assertion failed in file ../src/mpi/coll/helper_fns.c at line 490: status->MPI_TAG == recvtag > >> internal ABORT - process 1 > >> internal ABORT - process 0 > >> > >> =================================================================================== > >> = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES > >> = PID 30744 RUNNING AT oakmnt-0-b > >> = EXIT CODE: 1 > >> = CLEANING UP REMAINING PROCESSES > >> = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES > >> =================================================================================== > >> [mpiexec at vulcan13] HYDU_sock_read (../../../../src/pm/hydra/utils/sock/sock.c:239): read error (Bad file descriptor) > >> [mpiexec at vulcan13] control_cb (../../../../src/pm/hydra/pm/pmiserv/pmiserv_cb.c:199): unable to read command from proxy > >> [mpiexec at vulcan13] HYDT_dmxu_poll_wait_for_event (../../../../src/pm/hydra/tools/demux/demux_poll.c:76): callback returned error status > >> [mpiexec at vulcan13] HYD_pmci_wait_for_completion (../../../../src/pm/hydra/pm/pmiserv/pmiserv_pmci.c:198): error waiting for event > >> [mpiexec at vulcan13] main (../../../../src/pm/hydra/ui/mpich/mpiexec.c:344): process manager error waiting for completion > >> > >> Thanks. > >> Amin Hassani, > >> CIS department at UAB, > >> Birmingham, AL, USA. > >> > >> > >> _______________________________________________ > >> discuss mailing list > >> discuss at mpich.org > >> > >> To manage subscription options or unsubscribe: > >> > >> https://lists.mpich.org/mailman/listinfo/discuss > > > > > > -- > > Antonio J. Pe?a > > Postdoctoral Appointee > > Mathematics and Computer Science Division > > Argonne National Laboratory > > 9700 South Cass Avenue, Bldg. 240, Of. 3148 > > Argonne, IL 60439-4847 > > > > apenya at mcs.anl.gov > > www.mcs.anl.gov/~apenya > > > > _______________________________________________ > > discuss mailing list discuss at mpich.org > > To manage subscription options or unsubscribe: > > https://lists.mpich.org/mailman/listinfo/discuss > > > > _______________________________________________ > > discuss mailing list discuss at mpich.org > > To manage subscription options or unsubscribe: > > https://lists.mpich.org/mailman/listinfo/discuss > > _______________________________________________ > discuss mailing list discuss at mpich.org > To manage subscription options or unsubscribe: > https://lists.mpich.org/mailman/listinfo/discuss > > _______________________________________________ > discuss mailing list discuss at mpich.org > To manage subscription options or unsubscribe: > https://lists.mpich.org/mailman/listinfo/discuss _______________________________________________ discuss mailing list discuss at mpich.org To manage subscription options or unsubscribe: https://lists.mpich.org/mailman/listinfo/discuss _______________________________________________ discuss mailing list discuss at mpich.org To manage subscription options or unsubscribe: https://lists.mpich.org/mailman/listinfo/discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- _______________________________________________ discuss mailing list discuss at mpich.org To manage subscription options or unsubscribe: https://lists.mpich.org/mailman/listinfo/discuss From ahassani at cis.uab.edu Tue Nov 25 22:35:23 2014 From: ahassani at cis.uab.edu (Amin Hassani) Date: Tue, 25 Nov 2014 22:35:23 -0600 Subject: [mpich-discuss] having problem running MPICH on multiple nodes In-Reply-To: References: <5474EAA0.9090800@mcs.anl.gov> <0A759D1E-AF0A-4841-BFA6-88408D304C29@anl.gov> Message-ID: It might be an issue with the cluster, but if I could some how run the mpich in debug mode, It might be useful, but no idea how to do it in mpich. Thanks. Amin Hassani, CIS department at UAB, Birmingham, AL, USA. On Tue, Nov 25, 2014 at 10:33 PM, Amin Hassani wrote: > Here you go! > > $ mpirun -hostfile hosts-hydra -np 2 hostname > oakmnt-0-a > oakmnt-0-b > > Thanks. > > Amin Hassani, > CIS department at UAB, > Birmingham, AL, USA. > > On Tue, Nov 25, 2014 at 10:31 PM, Lu, Huiwei wrote: > >> I can run your simplest code on my machine without a problem. So I guess >> there is some problem in cluster connection. Could you give me the output >> of the following? >> >> $ mpirun -hostfile hosts-hydra -np 2 hostname >> >> ? >> Huiwei >> >> > On Nov 25, 2014, at 10:24 PM, Amin Hassani >> wrote: >> > >> > Hi, >> > >> > the code that I gave you had more stuff in it that I didn't want to >> distract you. here is the simpler send/recv test that I just ran and it >> failed. >> > >> > which mpirun: specific directory that I install my MPIs >> > /nethome/students/ahassani/usr/mpi/bin/mpirun >> > >> > mpirun with no argument: >> > $ mpirun >> > [mpiexec at oakmnt-0-a] set_default_values >> (../../../../src/pm/hydra/ui/mpich/utils.c:1528): no executable provided >> > [mpiexec at oakmnt-0-a] HYD_uii_mpx_get_parameters >> (../../../../src/pm/hydra/ui/mpich/utils.c:1739): setting default values >> failed >> > [mpiexec at oakmnt-0-a] main >> (../../../../src/pm/hydra/ui/mpich/mpiexec.c:153): error parsing parameters >> > >> > >> > >> > #include >> > #include >> > #include >> > #include >> > #include >> > >> > int skip = 10; >> > int iter = 30; >> > >> > int main(int argc, char** argv) >> > { >> > int rank, size; >> > int i, j, k; >> > double t1, t2; >> > int rc; >> > >> > MPI_Init(&argc, &argv); >> > MPI_Comm world = MPI_COMM_WORLD, newworld, newworld2; >> > MPI_Comm_rank(world, &rank); >> > MPI_Comm_size(world, &size); >> > int a = 0, b = 1; >> > if(rank == 0){ >> > MPI_Send(&a, 1, MPI_INT, 1, 0, world); >> > }else{ >> > MPI_Recv(&b, 1, MPI_INT, 0, 0, world, MPI_STATUS_IGNORE); >> > } >> > >> > printf("b is %d\n", b); >> > MPI_Finalize(); >> > >> > return 0; >> > } >> > >> > Thank you. >> > >> > >> > Amin Hassani, >> > CIS department at UAB, >> > Birmingham, AL, USA. >> > >> > On Tue, Nov 25, 2014 at 10:20 PM, Lu, Huiwei >> wrote: >> > Hi, Amin, >> > >> > Could you quickly give us the output of the following command: "which >> mpirun" >> > >> > Also, your simplest code couldn?t compile correctly: "error: ?t_avg? >> undeclared (first use in this function)?. Can you fix it? >> > >> > ? >> > Huiwei >> > >> > > On Nov 25, 2014, at 2:58 PM, Amin Hassani >> wrote: >> > > >> > > This is the simplest code I have that doesn't run. >> > > >> > > >> > > #include >> > > #include >> > > #include >> > > #include >> > > #include >> > > >> > > int main(int argc, char** argv) >> > > { >> > > int rank, size; >> > > int i, j, k; >> > > double t1, t2; >> > > int rc; >> > > >> > > MPI_Init(&argc, &argv); >> > > MPI_Comm world = MPI_COMM_WORLD, newworld, newworld2; >> > > MPI_Comm_rank(world, &rank); >> > > MPI_Comm_size(world, &size); >> > > >> > > t2 = 1; >> > > MPI_Allreduce(&t2, &t_avg, 1, MPI_DOUBLE, MPI_SUM, world); >> > > t_avg = t_avg / size; >> > > >> > > MPI_Finalize(); >> > > >> > > return 0; >> > > }? >> > > >> > > Amin Hassani, >> > > CIS department at UAB, >> > > Birmingham, AL, USA. >> > > >> > > On Tue, Nov 25, 2014 at 2:46 PM, "Antonio J. Pe?a" < >> apenya at mcs.anl.gov> wrote: >> > > >> > > Hi Amin, >> > > >> > > Can you share with us a minimal piece of code with which you can >> reproduce this issue? >> > > >> > > Thanks, >> > > Antonio >> > > >> > > >> > > >> > > On 11/25/2014 12:52 PM, Amin Hassani wrote: >> > >> Hi, >> > >> >> > >> I am having problem running MPICH, on multiple nodes. When I run an >> multiple MPI processes on one node, it totally works, but when I try to run >> on multiple nodes, it fails with the error below. >> > >> My machines have Debian OS, Both infiniband and TCP interconnects. >> I'm guessing it has something do to with the TCP network, but I can run >> openmpi on these machines with no problem. But for some reason I cannot run >> MPICH on multiple nodes. Please let me know if more info is needed from my >> side. I'm guessing there are some configuration that I am missing. I used >> MPICH 3.1.3 for this test. I googled this problem but couldn't find any >> solution. >> > >> >> > >> ?In my MPI program, I am doing a simple allreduce over >> MPI_COMM_WORLD?. >> > >> >> > >> ?my host file (hosts-hydra) is something like this: >> > >> oakmnt-0-a:1 >> > >> oakmnt-0-b:1 ? >> > >> >> > >> ?I get this error:? >> > >> >> > >> $ mpirun -hostfile hosts-hydra -np 2 test_dup >> > >> Assertion failed in file ../src/mpi/coll/helper_fns.c at line 490: >> status->MPI_TAG == recvtag >> > >> Assertion failed in file ../src/mpi/coll/helper_fns.c at line 490: >> status->MPI_TAG == recvtag >> > >> internal ABORT - process 1 >> > >> internal ABORT - process 0 >> > >> >> > >> >> =================================================================================== >> > >> = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES >> > >> = PID 30744 RUNNING AT oakmnt-0-b >> > >> = EXIT CODE: 1 >> > >> = CLEANING UP REMAINING PROCESSES >> > >> = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES >> > >> >> =================================================================================== >> > >> [mpiexec at vulcan13] HYDU_sock_read >> (../../../../src/pm/hydra/utils/sock/sock.c:239): read error (Bad file >> descriptor) >> > >> [mpiexec at vulcan13] control_cb >> (../../../../src/pm/hydra/pm/pmiserv/pmiserv_cb.c:199): unable to read >> command from proxy >> > >> [mpiexec at vulcan13] HYDT_dmxu_poll_wait_for_event >> (../../../../src/pm/hydra/tools/demux/demux_poll.c:76): callback returned >> error status >> > >> [mpiexec at vulcan13] HYD_pmci_wait_for_completion >> (../../../../src/pm/hydra/pm/pmiserv/pmiserv_pmci.c:198): error waiting for >> event >> > >> [mpiexec at vulcan13] main >> (../../../../src/pm/hydra/ui/mpich/mpiexec.c:344): process manager error >> waiting for completion >> > >> >> > >> Thanks. >> > >> Amin Hassani, >> > >> CIS department at UAB, >> > >> Birmingham, AL, USA. >> > >> >> > >> >> > >> _______________________________________________ >> > >> discuss mailing list >> > >> discuss at mpich.org >> > >> >> > >> To manage subscription options or unsubscribe: >> > >> >> > >> https://lists.mpich.org/mailman/listinfo/discuss >> > > >> > > >> > > -- >> > > Antonio J. Pe?a >> > > Postdoctoral Appointee >> > > Mathematics and Computer Science Division >> > > Argonne National Laboratory >> > > 9700 South Cass Avenue, Bldg. 240, Of. 3148 >> > > Argonne, IL 60439-4847 >> > > >> > > apenya at mcs.anl.gov >> > > www.mcs.anl.gov/~apenya >> > > >> > > _______________________________________________ >> > > discuss mailing list discuss at mpich.org >> > > To manage subscription options or unsubscribe: >> > > https://lists.mpich.org/mailman/listinfo/discuss >> > > >> > > _______________________________________________ >> > > discuss mailing list discuss at mpich.org >> > > To manage subscription options or unsubscribe: >> > > https://lists.mpich.org/mailman/listinfo/discuss >> > >> > _______________________________________________ >> > discuss mailing list discuss at mpich.org >> > To manage subscription options or unsubscribe: >> > https://lists.mpich.org/mailman/listinfo/discuss >> > >> > _______________________________________________ >> > discuss mailing list discuss at mpich.org >> > To manage subscription options or unsubscribe: >> > https://lists.mpich.org/mailman/listinfo/discuss >> >> _______________________________________________ >> discuss mailing list discuss at mpich.org >> To manage subscription options or unsubscribe: >> https://lists.mpich.org/mailman/listinfo/discuss >> > > -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- _______________________________________________ discuss mailing list discuss at mpich.org To manage subscription options or unsubscribe: https://lists.mpich.org/mailman/listinfo/discuss From ahassani at cis.uab.edu Tue Nov 25 22:33:22 2014 From: ahassani at cis.uab.edu (Amin Hassani) Date: Tue, 25 Nov 2014 22:33:22 -0600 Subject: [mpich-discuss] having problem running MPICH on multiple nodes In-Reply-To: <0A759D1E-AF0A-4841-BFA6-88408D304C29@anl.gov> References: <5474EAA0.9090800@mcs.anl.gov> <0A759D1E-AF0A-4841-BFA6-88408D304C29@anl.gov> Message-ID: Here you go! $ mpirun -hostfile hosts-hydra -np 2 hostname oakmnt-0-a oakmnt-0-b Thanks. Amin Hassani, CIS department at UAB, Birmingham, AL, USA. On Tue, Nov 25, 2014 at 10:31 PM, Lu, Huiwei wrote: > I can run your simplest code on my machine without a problem. So I guess > there is some problem in cluster connection. Could you give me the output > of the following? > > $ mpirun -hostfile hosts-hydra -np 2 hostname > > ? > Huiwei > > > On Nov 25, 2014, at 10:24 PM, Amin Hassani wrote: > > > > Hi, > > > > the code that I gave you had more stuff in it that I didn't want to > distract you. here is the simpler send/recv test that I just ran and it > failed. > > > > which mpirun: specific directory that I install my MPIs > > /nethome/students/ahassani/usr/mpi/bin/mpirun > > > > mpirun with no argument: > > $ mpirun > > [mpiexec at oakmnt-0-a] set_default_values > (../../../../src/pm/hydra/ui/mpich/utils.c:1528): no executable provided > > [mpiexec at oakmnt-0-a] HYD_uii_mpx_get_parameters > (../../../../src/pm/hydra/ui/mpich/utils.c:1739): setting default values > failed > > [mpiexec at oakmnt-0-a] main > (../../../../src/pm/hydra/ui/mpich/mpiexec.c:153): error parsing parameters > > > > > > > > #include > > #include > > #include > > #include > > #include > > > > int skip = 10; > > int iter = 30; > > > > int main(int argc, char** argv) > > { > > int rank, size; > > int i, j, k; > > double t1, t2; > > int rc; > > > > MPI_Init(&argc, &argv); > > MPI_Comm world = MPI_COMM_WORLD, newworld, newworld2; > > MPI_Comm_rank(world, &rank); > > MPI_Comm_size(world, &size); > > int a = 0, b = 1; > > if(rank == 0){ > > MPI_Send(&a, 1, MPI_INT, 1, 0, world); > > }else{ > > MPI_Recv(&b, 1, MPI_INT, 0, 0, world, MPI_STATUS_IGNORE); > > } > > > > printf("b is %d\n", b); > > MPI_Finalize(); > > > > return 0; > > } > > > > Thank you. > > > > > > Amin Hassani, > > CIS department at UAB, > > Birmingham, AL, USA. > > > > On Tue, Nov 25, 2014 at 10:20 PM, Lu, Huiwei > wrote: > > Hi, Amin, > > > > Could you quickly give us the output of the following command: "which > mpirun" > > > > Also, your simplest code couldn?t compile correctly: "error: ?t_avg? > undeclared (first use in this function)?. Can you fix it? > > > > ? > > Huiwei > > > > > On Nov 25, 2014, at 2:58 PM, Amin Hassani > wrote: > > > > > > This is the simplest code I have that doesn't run. > > > > > > > > > #include > > > #include > > > #include > > > #include > > > #include > > > > > > int main(int argc, char** argv) > > > { > > > int rank, size; > > > int i, j, k; > > > double t1, t2; > > > int rc; > > > > > > MPI_Init(&argc, &argv); > > > MPI_Comm world = MPI_COMM_WORLD, newworld, newworld2; > > > MPI_Comm_rank(world, &rank); > > > MPI_Comm_size(world, &size); > > > > > > t2 = 1; > > > MPI_Allreduce(&t2, &t_avg, 1, MPI_DOUBLE, MPI_SUM, world); > > > t_avg = t_avg / size; > > > > > > MPI_Finalize(); > > > > > > return 0; > > > }? > > > > > > Amin Hassani, > > > CIS department at UAB, > > > Birmingham, AL, USA. > > > > > > On Tue, Nov 25, 2014 at 2:46 PM, "Antonio J. Pe?a" > wrote: > > > > > > Hi Amin, > > > > > > Can you share with us a minimal piece of code with which you can > reproduce this issue? > > > > > > Thanks, > > > Antonio > > > > > > > > > > > > On 11/25/2014 12:52 PM, Amin Hassani wrote: > > >> Hi, > > >> > > >> I am having problem running MPICH, on multiple nodes. When I run an > multiple MPI processes on one node, it totally works, but when I try to run > on multiple nodes, it fails with the error below. > > >> My machines have Debian OS, Both infiniband and TCP interconnects. > I'm guessing it has something do to with the TCP network, but I can run > openmpi on these machines with no problem. But for some reason I cannot run > MPICH on multiple nodes. Please let me know if more info is needed from my > side. I'm guessing there are some configuration that I am missing. I used > MPICH 3.1.3 for this test. I googled this problem but couldn't find any > solution. > > >> > > >> ?In my MPI program, I am doing a simple allreduce over > MPI_COMM_WORLD?. > > >> > > >> ?my host file (hosts-hydra) is something like this: > > >> oakmnt-0-a:1 > > >> oakmnt-0-b:1 ? > > >> > > >> ?I get this error:? > > >> > > >> $ mpirun -hostfile hosts-hydra -np 2 test_dup > > >> Assertion failed in file ../src/mpi/coll/helper_fns.c at line 490: > status->MPI_TAG == recvtag > > >> Assertion failed in file ../src/mpi/coll/helper_fns.c at line 490: > status->MPI_TAG == recvtag > > >> internal ABORT - process 1 > > >> internal ABORT - process 0 > > >> > > >> > =================================================================================== > > >> = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES > > >> = PID 30744 RUNNING AT oakmnt-0-b > > >> = EXIT CODE: 1 > > >> = CLEANING UP REMAINING PROCESSES > > >> = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES > > >> > =================================================================================== > > >> [mpiexec at vulcan13] HYDU_sock_read > (../../../../src/pm/hydra/utils/sock/sock.c:239): read error (Bad file > descriptor) > > >> [mpiexec at vulcan13] control_cb > (../../../../src/pm/hydra/pm/pmiserv/pmiserv_cb.c:199): unable to read > command from proxy > > >> [mpiexec at vulcan13] HYDT_dmxu_poll_wait_for_event > (../../../../src/pm/hydra/tools/demux/demux_poll.c:76): callback returned > error status > > >> [mpiexec at vulcan13] HYD_pmci_wait_for_completion > (../../../../src/pm/hydra/pm/pmiserv/pmiserv_pmci.c:198): error waiting for > event > > >> [mpiexec at vulcan13] main > (../../../../src/pm/hydra/ui/mpich/mpiexec.c:344): process manager error > waiting for completion > > >> > > >> Thanks. > > >> Amin Hassani, > > >> CIS department at UAB, > > >> Birmingham, AL, USA. > > >> > > >> > > >> _______________________________________________ > > >> discuss mailing list > > >> discuss at mpich.org > > >> > > >> To manage subscription options or unsubscribe: > > >> > > >> https://lists.mpich.org/mailman/listinfo/discuss > > > > > > > > > -- > > > Antonio J. Pe?a > > > Postdoctoral Appointee > > > Mathematics and Computer Science Division > > > Argonne National Laboratory > > > 9700 South Cass Avenue, Bldg. 240, Of. 3148 > > > Argonne, IL 60439-4847 > > > > > > apenya at mcs.anl.gov > > > www.mcs.anl.gov/~apenya > > > > > > _______________________________________________ > > > discuss mailing list discuss at mpich.org > > > To manage subscription options or unsubscribe: > > > https://lists.mpich.org/mailman/listinfo/discuss > > > > > > _______________________________________________ > > > discuss mailing list discuss at mpich.org > > > To manage subscription options or unsubscribe: > > > https://lists.mpich.org/mailman/listinfo/discuss > > > > _______________________________________________ > > discuss mailing list discuss at mpich.org > > To manage subscription options or unsubscribe: > > https://lists.mpich.org/mailman/listinfo/discuss > > > > _______________________________________________ > > discuss mailing list discuss at mpich.org > > To manage subscription options or unsubscribe: > > https://lists.mpich.org/mailman/listinfo/discuss > > _______________________________________________ > discuss mailing list discuss at mpich.org > To manage subscription options or unsubscribe: > https://lists.mpich.org/mailman/listinfo/discuss > -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- _______________________________________________ discuss mailing list discuss at mpich.org To manage subscription options or unsubscribe: https://lists.mpich.org/mailman/listinfo/discuss From huiweilu at mcs.anl.gov Tue Nov 25 22:31:47 2014 From: huiweilu at mcs.anl.gov (Lu, Huiwei) Date: Wed, 26 Nov 2014 04:31:47 +0000 Subject: [mpich-discuss] having problem running MPICH on multiple nodes In-Reply-To: References: <5474EAA0.9090800@mcs.anl.gov> Message-ID: <0A759D1E-AF0A-4841-BFA6-88408D304C29@anl.gov> I can run your simplest code on my machine without a problem. So I guess there is some problem in cluster connection. Could you give me the output of the following? $ mpirun -hostfile hosts-hydra -np 2 hostname ? Huiwei > On Nov 25, 2014, at 10:24 PM, Amin Hassani wrote: > > Hi, > > the code that I gave you had more stuff in it that I didn't want to distract you. here is the simpler send/recv test that I just ran and it failed. > > which mpirun: specific directory that I install my MPIs > /nethome/students/ahassani/usr/mpi/bin/mpirun > > mpirun with no argument: > $ mpirun > [mpiexec at oakmnt-0-a] set_default_values (../../../../src/pm/hydra/ui/mpich/utils.c:1528): no executable provided > [mpiexec at oakmnt-0-a] HYD_uii_mpx_get_parameters (../../../../src/pm/hydra/ui/mpich/utils.c:1739): setting default values failed > [mpiexec at oakmnt-0-a] main (../../../../src/pm/hydra/ui/mpich/mpiexec.c:153): error parsing parameters > > > > #include > #include > #include > #include > #include > > int skip = 10; > int iter = 30; > > int main(int argc, char** argv) > { > int rank, size; > int i, j, k; > double t1, t2; > int rc; > > MPI_Init(&argc, &argv); > MPI_Comm world = MPI_COMM_WORLD, newworld, newworld2; > MPI_Comm_rank(world, &rank); > MPI_Comm_size(world, &size); > int a = 0, b = 1; > if(rank == 0){ > MPI_Send(&a, 1, MPI_INT, 1, 0, world); > }else{ > MPI_Recv(&b, 1, MPI_INT, 0, 0, world, MPI_STATUS_IGNORE); > } > > printf("b is %d\n", b); > MPI_Finalize(); > > return 0; > } > > Thank you. > > > Amin Hassani, > CIS department at UAB, > Birmingham, AL, USA. > > On Tue, Nov 25, 2014 at 10:20 PM, Lu, Huiwei wrote: > Hi, Amin, > > Could you quickly give us the output of the following command: "which mpirun" > > Also, your simplest code couldn?t compile correctly: "error: ?t_avg? undeclared (first use in this function)?. Can you fix it? > > ? > Huiwei > > > On Nov 25, 2014, at 2:58 PM, Amin Hassani wrote: > > > > This is the simplest code I have that doesn't run. > > > > > > #include > > #include > > #include > > #include > > #include > > > > int main(int argc, char** argv) > > { > > int rank, size; > > int i, j, k; > > double t1, t2; > > int rc; > > > > MPI_Init(&argc, &argv); > > MPI_Comm world = MPI_COMM_WORLD, newworld, newworld2; > > MPI_Comm_rank(world, &rank); > > MPI_Comm_size(world, &size); > > > > t2 = 1; > > MPI_Allreduce(&t2, &t_avg, 1, MPI_DOUBLE, MPI_SUM, world); > > t_avg = t_avg / size; > > > > MPI_Finalize(); > > > > return 0; > > }? > > > > Amin Hassani, > > CIS department at UAB, > > Birmingham, AL, USA. > > > > On Tue, Nov 25, 2014 at 2:46 PM, "Antonio J. Pe?a" wrote: > > > > Hi Amin, > > > > Can you share with us a minimal piece of code with which you can reproduce this issue? > > > > Thanks, > > Antonio > > > > > > > > On 11/25/2014 12:52 PM, Amin Hassani wrote: > >> Hi, > >> > >> I am having problem running MPICH, on multiple nodes. When I run an multiple MPI processes on one node, it totally works, but when I try to run on multiple nodes, it fails with the error below. > >> My machines have Debian OS, Both infiniband and TCP interconnects. I'm guessing it has something do to with the TCP network, but I can run openmpi on these machines with no problem. But for some reason I cannot run MPICH on multiple nodes. Please let me know if more info is needed from my side. I'm guessing there are some configuration that I am missing. I used MPICH 3.1.3 for this test. I googled this problem but couldn't find any solution. > >> > >> ?In my MPI program, I am doing a simple allreduce over MPI_COMM_WORLD?. > >> > >> ?my host file (hosts-hydra) is something like this: > >> oakmnt-0-a:1 > >> oakmnt-0-b:1 ? > >> > >> ?I get this error:? > >> > >> $ mpirun -hostfile hosts-hydra -np 2 test_dup > >> Assertion failed in file ../src/mpi/coll/helper_fns.c at line 490: status->MPI_TAG == recvtag > >> Assertion failed in file ../src/mpi/coll/helper_fns.c at line 490: status->MPI_TAG == recvtag > >> internal ABORT - process 1 > >> internal ABORT - process 0 > >> > >> =================================================================================== > >> = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES > >> = PID 30744 RUNNING AT oakmnt-0-b > >> = EXIT CODE: 1 > >> = CLEANING UP REMAINING PROCESSES > >> = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES > >> =================================================================================== > >> [mpiexec at vulcan13] HYDU_sock_read (../../../../src/pm/hydra/utils/sock/sock.c:239): read error (Bad file descriptor) > >> [mpiexec at vulcan13] control_cb (../../../../src/pm/hydra/pm/pmiserv/pmiserv_cb.c:199): unable to read command from proxy > >> [mpiexec at vulcan13] HYDT_dmxu_poll_wait_for_event (../../../../src/pm/hydra/tools/demux/demux_poll.c:76): callback returned error status > >> [mpiexec at vulcan13] HYD_pmci_wait_for_completion (../../../../src/pm/hydra/pm/pmiserv/pmiserv_pmci.c:198): error waiting for event > >> [mpiexec at vulcan13] main (../../../../src/pm/hydra/ui/mpich/mpiexec.c:344): process manager error waiting for completion > >> > >> Thanks. > >> Amin Hassani, > >> CIS department at UAB, > >> Birmingham, AL, USA. > >> > >> > >> _______________________________________________ > >> discuss mailing list > >> discuss at mpich.org > >> > >> To manage subscription options or unsubscribe: > >> > >> https://lists.mpich.org/mailman/listinfo/discuss > > > > > > -- > > Antonio J. Pe?a > > Postdoctoral Appointee > > Mathematics and Computer Science Division > > Argonne National Laboratory > > 9700 South Cass Avenue, Bldg. 240, Of. 3148 > > Argonne, IL 60439-4847 > > > > apenya at mcs.anl.gov > > www.mcs.anl.gov/~apenya > > > > _______________________________________________ > > discuss mailing list discuss at mpich.org > > To manage subscription options or unsubscribe: > > https://lists.mpich.org/mailman/listinfo/discuss > > > > _______________________________________________ > > discuss mailing list discuss at mpich.org > > To manage subscription options or unsubscribe: > > https://lists.mpich.org/mailman/listinfo/discuss > > _______________________________________________ > discuss mailing list discuss at mpich.org > To manage subscription options or unsubscribe: > https://lists.mpich.org/mailman/listinfo/discuss > > _______________________________________________ > discuss mailing list discuss at mpich.org > To manage subscription options or unsubscribe: > https://lists.mpich.org/mailman/listinfo/discuss _______________________________________________ discuss mailing list discuss at mpich.org To manage subscription options or unsubscribe: https://lists.mpich.org/mailman/listinfo/discuss From ahassani at cis.uab.edu Tue Nov 25 22:24:17 2014 From: ahassani at cis.uab.edu (Amin Hassani) Date: Tue, 25 Nov 2014 22:24:17 -0600 Subject: [mpich-discuss] having problem running MPICH on multiple nodes In-Reply-To: References: <5474EAA0.9090800@mcs.anl.gov> Message-ID: Hi, the code that I gave you had more stuff in it that I didn't want to distract you. here is the simpler send/recv test that I just ran and it failed. which mpirun: specific directory that I install my MPIs /nethome/students/ahassani/usr/mpi/bin/mpirun mpirun with no argument: $ mpirun [mpiexec at oakmnt-0-a] set_default_values (../../../../src/pm/hydra/ui/mpich/utils.c:1528): no executable provided [mpiexec at oakmnt-0-a] HYD_uii_mpx_get_parameters (../../../../src/pm/hydra/ui/mpich/utils.c:1739): setting default values failed [mpiexec at oakmnt-0-a] main (../../../../src/pm/hydra/ui/mpich/mpiexec.c:153): error parsing parameters #include #include #include #include #include int skip = 10; int iter = 30; int main(int argc, char** argv) { int rank, size; int i, j, k; double t1, t2; int rc; MPI_Init(&argc, &argv); MPI_Comm world = MPI_COMM_WORLD, newworld, newworld2; MPI_Comm_rank(world, &rank); MPI_Comm_size(world, &size); int a = 0, b = 1; if(rank == 0){ MPI_Send(&a, 1, MPI_INT, 1, 0, world); }else{ MPI_Recv(&b, 1, MPI_INT, 0, 0, world, MPI_STATUS_IGNORE); } printf("b is %d\n", b); MPI_Finalize(); return 0; } Thank you. Amin Hassani, CIS department at UAB, Birmingham, AL, USA. On Tue, Nov 25, 2014 at 10:20 PM, Lu, Huiwei wrote: > Hi, Amin, > > Could you quickly give us the output of the following command: "which > mpirun" > > Also, your simplest code couldn?t compile correctly: "error: ?t_avg? > undeclared (first use in this function)?. Can you fix it? > > ? > Huiwei > > > On Nov 25, 2014, at 2:58 PM, Amin Hassani wrote: > > > > This is the simplest code I have that doesn't run. > > > > > > #include > > #include > > #include > > #include > > #include > > > > int main(int argc, char** argv) > > { > > int rank, size; > > int i, j, k; > > double t1, t2; > > int rc; > > > > MPI_Init(&argc, &argv); > > MPI_Comm world = MPI_COMM_WORLD, newworld, newworld2; > > MPI_Comm_rank(world, &rank); > > MPI_Comm_size(world, &size); > > > > t2 = 1; > > MPI_Allreduce(&t2, &t_avg, 1, MPI_DOUBLE, MPI_SUM, world); > > t_avg = t_avg / size; > > > > MPI_Finalize(); > > > > return 0; > > }? > > > > Amin Hassani, > > CIS department at UAB, > > Birmingham, AL, USA. > > > > On Tue, Nov 25, 2014 at 2:46 PM, "Antonio J. Pe?a" > wrote: > > > > Hi Amin, > > > > Can you share with us a minimal piece of code with which you can > reproduce this issue? > > > > Thanks, > > Antonio > > > > > > > > On 11/25/2014 12:52 PM, Amin Hassani wrote: > >> Hi, > >> > >> I am having problem running MPICH, on multiple nodes. When I run an > multiple MPI processes on one node, it totally works, but when I try to run > on multiple nodes, it fails with the error below. > >> My machines have Debian OS, Both infiniband and TCP interconnects. I'm > guessing it has something do to with the TCP network, but I can run openmpi > on these machines with no problem. But for some reason I cannot run MPICH > on multiple nodes. Please let me know if more info is needed from my side. > I'm guessing there are some configuration that I am missing. I used MPICH > 3.1.3 for this test. I googled this problem but couldn't find any solution. > >> > >> ?In my MPI program, I am doing a simple allreduce over MPI_COMM_WORLD?. > >> > >> ?my host file (hosts-hydra) is something like this: > >> oakmnt-0-a:1 > >> oakmnt-0-b:1 ? > >> > >> ?I get this error:? > >> > >> $ mpirun -hostfile hosts-hydra -np 2 test_dup > >> Assertion failed in file ../src/mpi/coll/helper_fns.c at line 490: > status->MPI_TAG == recvtag > >> Assertion failed in file ../src/mpi/coll/helper_fns.c at line 490: > status->MPI_TAG == recvtag > >> internal ABORT - process 1 > >> internal ABORT - process 0 > >> > >> > =================================================================================== > >> = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES > >> = PID 30744 RUNNING AT oakmnt-0-b > >> = EXIT CODE: 1 > >> = CLEANING UP REMAINING PROCESSES > >> = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES > >> > =================================================================================== > >> [mpiexec at vulcan13] HYDU_sock_read > (../../../../src/pm/hydra/utils/sock/sock.c:239): read error (Bad file > descriptor) > >> [mpiexec at vulcan13] control_cb > (../../../../src/pm/hydra/pm/pmiserv/pmiserv_cb.c:199): unable to read > command from proxy > >> [mpiexec at vulcan13] HYDT_dmxu_poll_wait_for_event > (../../../../src/pm/hydra/tools/demux/demux_poll.c:76): callback returned > error status > >> [mpiexec at vulcan13] HYD_pmci_wait_for_completion > (../../../../src/pm/hydra/pm/pmiserv/pmiserv_pmci.c:198): error waiting for > event > >> [mpiexec at vulcan13] main > (../../../../src/pm/hydra/ui/mpich/mpiexec.c:344): process manager error > waiting for completion > >> > >> Thanks. > >> Amin Hassani, > >> CIS department at UAB, > >> Birmingham, AL, USA. > >> > >> > >> _______________________________________________ > >> discuss mailing list > >> discuss at mpich.org > >> > >> To manage subscription options or unsubscribe: > >> > >> https://lists.mpich.org/mailman/listinfo/discuss > > > > > > -- > > Antonio J. Pe?a > > Postdoctoral Appointee > > Mathematics and Computer Science Division > > Argonne National Laboratory > > 9700 South Cass Avenue, Bldg. 240, Of. 3148 > > Argonne, IL 60439-4847 > > > > apenya at mcs.anl.gov > > www.mcs.anl.gov/~apenya > > > > _______________________________________________ > > discuss mailing list discuss at mpich.org > > To manage subscription options or unsubscribe: > > https://lists.mpich.org/mailman/listinfo/discuss > > > > _______________________________________________ > > discuss mailing list discuss at mpich.org > > To manage subscription options or unsubscribe: > > https://lists.mpich.org/mailman/listinfo/discuss > > _______________________________________________ > discuss mailing list discuss at mpich.org > To manage subscription options or unsubscribe: > https://lists.mpich.org/mailman/listinfo/discuss > -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- _______________________________________________ discuss mailing list discuss at mpich.org To manage subscription options or unsubscribe: https://lists.mpich.org/mailman/listinfo/discuss From huiweilu at mcs.anl.gov Tue Nov 25 22:20:32 2014 From: huiweilu at mcs.anl.gov (Lu, Huiwei) Date: Wed, 26 Nov 2014 04:20:32 +0000 Subject: [mpich-discuss] having problem running MPICH on multiple nodes In-Reply-To: References: <5474EAA0.9090800@mcs.anl.gov> Message-ID: Hi, Amin, Could you quickly give us the output of the following command: "which mpirun" Also, your simplest code couldn?t compile correctly: "error: ?t_avg? undeclared (first use in this function)?. Can you fix it? ? Huiwei > On Nov 25, 2014, at 2:58 PM, Amin Hassani wrote: > > This is the simplest code I have that doesn't run. > > > #include > #include > #include > #include > #include > > int main(int argc, char** argv) > { > int rank, size; > int i, j, k; > double t1, t2; > int rc; > > MPI_Init(&argc, &argv); > MPI_Comm world = MPI_COMM_WORLD, newworld, newworld2; > MPI_Comm_rank(world, &rank); > MPI_Comm_size(world, &size); > > t2 = 1; > MPI_Allreduce(&t2, &t_avg, 1, MPI_DOUBLE, MPI_SUM, world); > t_avg = t_avg / size; > > MPI_Finalize(); > > return 0; > }? > > Amin Hassani, > CIS department at UAB, > Birmingham, AL, USA. > > On Tue, Nov 25, 2014 at 2:46 PM, "Antonio J. Pe?a" wrote: > > Hi Amin, > > Can you share with us a minimal piece of code with which you can reproduce this issue? > > Thanks, > Antonio > > > > On 11/25/2014 12:52 PM, Amin Hassani wrote: >> Hi, >> >> I am having problem running MPICH, on multiple nodes. When I run an multiple MPI processes on one node, it totally works, but when I try to run on multiple nodes, it fails with the error below. >> My machines have Debian OS, Both infiniband and TCP interconnects. I'm guessing it has something do to with the TCP network, but I can run openmpi on these machines with no problem. But for some reason I cannot run MPICH on multiple nodes. Please let me know if more info is needed from my side. I'm guessing there are some configuration that I am missing. I used MPICH 3.1.3 for this test. I googled this problem but couldn't find any solution. >> >> ?In my MPI program, I am doing a simple allreduce over MPI_COMM_WORLD?. >> >> ?my host file (hosts-hydra) is something like this: >> oakmnt-0-a:1 >> oakmnt-0-b:1 ? >> >> ?I get this error:? >> >> $ mpirun -hostfile hosts-hydra -np 2 test_dup >> Assertion failed in file ../src/mpi/coll/helper_fns.c at line 490: status->MPI_TAG == recvtag >> Assertion failed in file ../src/mpi/coll/helper_fns.c at line 490: status->MPI_TAG == recvtag >> internal ABORT - process 1 >> internal ABORT - process 0 >> >> =================================================================================== >> = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES >> = PID 30744 RUNNING AT oakmnt-0-b >> = EXIT CODE: 1 >> = CLEANING UP REMAINING PROCESSES >> = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES >> =================================================================================== >> [mpiexec at vulcan13] HYDU_sock_read (../../../../src/pm/hydra/utils/sock/sock.c:239): read error (Bad file descriptor) >> [mpiexec at vulcan13] control_cb (../../../../src/pm/hydra/pm/pmiserv/pmiserv_cb.c:199): unable to read command from proxy >> [mpiexec at vulcan13] HYDT_dmxu_poll_wait_for_event (../../../../src/pm/hydra/tools/demux/demux_poll.c:76): callback returned error status >> [mpiexec at vulcan13] HYD_pmci_wait_for_completion (../../../../src/pm/hydra/pm/pmiserv/pmiserv_pmci.c:198): error waiting for event >> [mpiexec at vulcan13] main (../../../../src/pm/hydra/ui/mpich/mpiexec.c:344): process manager error waiting for completion >> >> Thanks. >> Amin Hassani, >> CIS department at UAB, >> Birmingham, AL, USA. >> >> >> _______________________________________________ >> discuss mailing list >> discuss at mpich.org >> >> To manage subscription options or unsubscribe: >> >> https://lists.mpich.org/mailman/listinfo/discuss > > > -- > Antonio J. Pe?a > Postdoctoral Appointee > Mathematics and Computer Science Division > Argonne National Laboratory > 9700 South Cass Avenue, Bldg. 240, Of. 3148 > Argonne, IL 60439-4847 > > apenya at mcs.anl.gov > www.mcs.anl.gov/~apenya > > _______________________________________________ > discuss mailing list discuss at mpich.org > To manage subscription options or unsubscribe: > https://lists.mpich.org/mailman/listinfo/discuss > > _______________________________________________ > discuss mailing list discuss at mpich.org > To manage subscription options or unsubscribe: > https://lists.mpich.org/mailman/listinfo/discuss _______________________________________________ discuss mailing list discuss at mpich.org To manage subscription options or unsubscribe: https://lists.mpich.org/mailman/listinfo/discuss From jczhang at mcs.anl.gov Tue Nov 25 21:53:50 2014 From: jczhang at mcs.anl.gov (Junchao Zhang) Date: Tue, 25 Nov 2014 21:53:50 -0600 Subject: [mpich-discuss] having problem running MPICH on multiple nodes In-Reply-To: References: <5474EAA0.9090800@mcs.anl.gov> <78B3A627-D829-40A5-98E6-632502B10F74@anl.gov> Message-ID: Is the failure specific to MPI_Allreduce? Did other tests (like simple send/recv) work? --Junchao Zhang On Tue, Nov 25, 2014 at 9:41 PM, Amin Hassani wrote: > Is there any debugging flag that I can turn on to figure out problems? > > Thanks. > > Amin Hassani, > CIS department at UAB, > Birmingham, AL, USA. > > On Tue, Nov 25, 2014 at 9:31 PM, Amin Hassani > wrote: > >> Now I'm getting this error with MPICH-3.2a2 >> Any thought? >> >> ?$ mpirun -hostfile hosts-hydra -np 2 test_dup >> Fatal error in MPI_Allreduce: Unknown error class, error stack: >> MPI_Allreduce(912)....................: >> MPI_Allreduce(sbuf=0x7fffa5240e60, rbuf=0x7fffa5240e68, count=1, >> MPI_DOUBLE, MPI_MAX, MPI_COMM_WORLD) failed >> MPIR_Allreduce_impl(769)..............: >> MPIR_Allreduce_intra(419).............: >> MPIDU_Complete_posted_with_error(1192): Process failed >> Fatal error in MPI_Allreduce: Unknown error class, error stack: >> MPI_Allreduce(912)....................: >> MPI_Allreduce(sbuf=0x7fffaf6ef070, rbuf=0x7fffaf6ef078, count=1, >> MPI_DOUBLE, MPI_MAX, MPI_COMM_WORLD) failed >> MPIR_Allreduce_impl(769)..............: >> MPIR_Allreduce_intra(419).............: >> MPIDU_Complete_posted_with_error(1192): Process failed >> >> >> =================================================================================== >> = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES >> = PID 451 RUNNING AT oakmnt-0-a >> = EXIT CODE: 1 >> = CLEANING UP REMAINING PROCESSES >> = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES >> >> =================================================================================== >> ? >> >> Thanks. >> >> Amin Hassani, >> CIS department at UAB, >> Birmingham, AL, USA. >> >> On Tue, Nov 25, 2014 at 9:25 PM, Amin Hassani >> wrote: >> >>> Ok, I'll try to test the alpha version. I'll let you know the results. >>> >>> Thank you. >>> >>> Amin Hassani, >>> CIS department at UAB, >>> Birmingham, AL, USA. >>> >>> On Tue, Nov 25, 2014 at 9:21 PM, Bland, Wesley B. >>> wrote: >>> >>>> It?s hard to tell then. Other than some problems compiling (not >>>> declaring all of your variables), everything seems ok. Can you try running >>>> with the most recent alpha. I have no idea what bug we could have fixed >>>> here to make things work, but it?d be good to eliminate the possibility. >>>> >>>> Thanks, >>>> Wesley >>>> >>>> On Nov 25, 2014, at 10:11 PM, Amin Hassani >>>> wrote: >>>> >>>> Here I attached config.log exits in the root folder where it is >>>> compiled. I'm not too familiar with MPICH but, there are other config.logs >>>> in other directories also but not sure if you needed them too. >>>> I don't have any specific environment variable that can relate to >>>> MPICH. Also tried with >>>> export HYDRA_HOST_FILE=
, >>>> but have the same problem. >>>> I don't do anything FT related in MPICH, I don't think this version of >>>> MPICH has anything related to FT in it. >>>> >>>> Thanks. >>>> >>>> Amin Hassani, >>>> CIS department at UAB, >>>> Birmingham, AL, USA. >>>> >>>> On Tue, Nov 25, 2014 at 9:02 PM, Bland, Wesley B. >>>> wrote: >>>> >>>>> Can you also provide your config.log and any CVARs or other relevant >>>>> environment variables that you might be setting (for instance, in relation >>>>> to fault tolerance)? >>>>> >>>>> Thanks, >>>>> Wesley >>>>> >>>>> >>>>> On Nov 25, 2014, at 3:58 PM, Amin Hassani >>>>> wrote: >>>>> >>>>> This is the simplest code I have that doesn't run. >>>>> >>>>> >>>>> #include >>>>> #include >>>>> #include >>>>> #include >>>>> #include >>>>> >>>>> int main(int argc, char** argv) >>>>> { >>>>> int rank, size; >>>>> int i, j, k; >>>>> double t1, t2; >>>>> int rc; >>>>> >>>>> MPI_Init(&argc, &argv); >>>>> MPI_Comm world = MPI_COMM_WORLD, newworld, newworld2; >>>>> MPI_Comm_rank(world, &rank); >>>>> MPI_Comm_size(world, &size); >>>>> >>>>> t2 = 1; >>>>> MPI_Allreduce(&t2, &t_avg, 1, MPI_DOUBLE, MPI_SUM, world); >>>>> t_avg = t_avg / size; >>>>> >>>>> MPI_Finalize(); >>>>> >>>>> return 0; >>>>> }? >>>>> >>>>> Amin Hassani, >>>>> CIS department at UAB, >>>>> Birmingham, AL, USA. >>>>> >>>>> On Tue, Nov 25, 2014 at 2:46 PM, "Antonio J. Pe?a" >>>> > wrote: >>>>> >>>>>> >>>>>> Hi Amin, >>>>>> >>>>>> Can you share with us a minimal piece of code with which you can >>>>>> reproduce this issue? >>>>>> >>>>>> Thanks, >>>>>> Antonio >>>>>> >>>>>> >>>>>> >>>>>> On 11/25/2014 12:52 PM, Amin Hassani wrote: >>>>>> >>>>>> Hi, >>>>>> >>>>>> I am having problem running MPICH, on multiple nodes. When I run an >>>>>> multiple MPI processes on one node, it totally works, but when I try to run >>>>>> on multiple nodes, it fails with the error below. >>>>>> My machines have Debian OS, Both infiniband and TCP interconnects. >>>>>> I'm guessing it has something do to with the TCP network, but I can run >>>>>> openmpi on these machines with no problem. But for some reason I cannot run >>>>>> MPICH on multiple nodes. Please let me know if more info is needed from my >>>>>> side. I'm guessing there are some configuration that I am missing. I used >>>>>> MPICH 3.1.3 for this test. I googled this problem but couldn't find any >>>>>> solution. >>>>>> >>>>>> ?In my MPI program, I am doing a simple allreduce over >>>>>> MPI_COMM_WORLD?. >>>>>> >>>>>> ?my host file (hosts-hydra) is something like this: >>>>>> oakmnt-0-a:1 >>>>>> oakmnt-0-b:1 ? >>>>>> >>>>>> ?I get this error:? >>>>>> >>>>>> $ mpirun -hostfile hosts-hydra -np 2 test_dup >>>>>> Assertion failed in file ../src/mpi/coll/helper_fns.c at line 490: >>>>>> status->MPI_TAG == recvtag >>>>>> Assertion failed in file ../src/mpi/coll/helper_fns.c at line 490: >>>>>> status->MPI_TAG == recvtag >>>>>> internal ABORT - process 1 >>>>>> internal ABORT - process 0 >>>>>> >>>>>> >>>>>> =================================================================================== >>>>>> = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES >>>>>> = PID 30744 RUNNING AT oakmnt-0-b >>>>>> = EXIT CODE: 1 >>>>>> = CLEANING UP REMAINING PROCESSES >>>>>> = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES >>>>>> >>>>>> =================================================================================== >>>>>> [mpiexec at vulcan13] HYDU_sock_read >>>>>> (../../../../src/pm/hydra/utils/sock/sock.c:239): read error (Bad file >>>>>> descriptor) >>>>>> [mpiexec at vulcan13] control_cb >>>>>> (../../../../src/pm/hydra/pm/pmiserv/pmiserv_cb.c:199): unable to read >>>>>> command from proxy >>>>>> [mpiexec at vulcan13] HYDT_dmxu_poll_wait_for_event >>>>>> (../../../../src/pm/hydra/tools/demux/demux_poll.c:76): callback returned >>>>>> error status >>>>>> [mpiexec at vulcan13] HYD_pmci_wait_for_completion >>>>>> (../../../../src/pm/hydra/pm/pmiserv/pmiserv_pmci.c:198): error waiting for >>>>>> event >>>>>> [mpiexec at vulcan13] main >>>>>> (../../../../src/pm/hydra/ui/mpich/mpiexec.c:344): process manager error >>>>>> waiting for completion >>>>>> >>>>>> Thanks. >>>>>> Amin Hassani, >>>>>> CIS department at UAB, >>>>>> Birmingham, AL, USA. >>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> discuss mailing list discuss at mpich.org >>>>>> To manage subscription options or unsubscribe:https://lists.mpich.org/mailman/listinfo/discuss >>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> Antonio J. Pe?a >>>>>> Postdoctoral Appointee >>>>>> Mathematics and Computer Science Division >>>>>> Argonne National Laboratory >>>>>> 9700 South Cass Avenue, Bldg. 240, Of. 3148 >>>>>> Argonne, IL 60439-4847apenya at mcs.anl.govwww.mcs.anl.gov/~apenya >>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> discuss mailing list discuss at mpich.org >>>>>> To manage subscription options or unsubscribe: >>>>>> https://lists.mpich.org/mailman/listinfo/discuss >>>>>> >>>>> >>>>> _______________________________________________ >>>>> discuss mailing list discuss at mpich.org >>>>> To manage subscription options or unsubscribe: >>>>> https://lists.mpich.org/mailman/listinfo/discuss >>>>> >>>>> >>>>> >>>>> _______________________________________________ >>>>> discuss mailing list discuss at mpich.org >>>>> To manage subscription options or unsubscribe: >>>>> https://lists.mpich.org/mailman/listinfo/discuss >>>>> >>>> >>>> _______________________________________________ >>>> discuss mailing list discuss at mpich.org >>>> To manage subscription options or unsubscribe: >>>> https://lists.mpich.org/mailman/listinfo/discuss >>>> >>>> >>>> >>>> _______________________________________________ >>>> discuss mailing list discuss at mpich.org >>>> To manage subscription options or unsubscribe: >>>> https://lists.mpich.org/mailman/listinfo/discuss >>>> >>> >>> >> > > _______________________________________________ > discuss mailing list discuss at mpich.org > To manage subscription options or unsubscribe: > https://lists.mpich.org/mailman/listinfo/discuss > -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- _______________________________________________ discuss mailing list discuss at mpich.org To manage subscription options or unsubscribe: https://lists.mpich.org/mailman/listinfo/discuss From ahassani at cis.uab.edu Tue Nov 25 21:41:06 2014 From: ahassani at cis.uab.edu (Amin Hassani) Date: Tue, 25 Nov 2014 21:41:06 -0600 Subject: [mpich-discuss] having problem running MPICH on multiple nodes In-Reply-To: References: <5474EAA0.9090800@mcs.anl.gov> <78B3A627-D829-40A5-98E6-632502B10F74@anl.gov> Message-ID: Is there any debugging flag that I can turn on to figure out problems? Thanks. Amin Hassani, CIS department at UAB, Birmingham, AL, USA. On Tue, Nov 25, 2014 at 9:31 PM, Amin Hassani wrote: > Now I'm getting this error with MPICH-3.2a2 > Any thought? > > ?$ mpirun -hostfile hosts-hydra -np 2 test_dup > Fatal error in MPI_Allreduce: Unknown error class, error stack: > MPI_Allreduce(912)....................: MPI_Allreduce(sbuf=0x7fffa5240e60, > rbuf=0x7fffa5240e68, count=1, MPI_DOUBLE, MPI_MAX, MPI_COMM_WORLD) failed > MPIR_Allreduce_impl(769)..............: > MPIR_Allreduce_intra(419).............: > MPIDU_Complete_posted_with_error(1192): Process failed > Fatal error in MPI_Allreduce: Unknown error class, error stack: > MPI_Allreduce(912)....................: MPI_Allreduce(sbuf=0x7fffaf6ef070, > rbuf=0x7fffaf6ef078, count=1, MPI_DOUBLE, MPI_MAX, MPI_COMM_WORLD) failed > MPIR_Allreduce_impl(769)..............: > MPIR_Allreduce_intra(419).............: > MPIDU_Complete_posted_with_error(1192): Process failed > > > =================================================================================== > = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES > = PID 451 RUNNING AT oakmnt-0-a > = EXIT CODE: 1 > = CLEANING UP REMAINING PROCESSES > = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES > > =================================================================================== > ? > > Thanks. > > Amin Hassani, > CIS department at UAB, > Birmingham, AL, USA. > > On Tue, Nov 25, 2014 at 9:25 PM, Amin Hassani > wrote: > >> Ok, I'll try to test the alpha version. I'll let you know the results. >> >> Thank you. >> >> Amin Hassani, >> CIS department at UAB, >> Birmingham, AL, USA. >> >> On Tue, Nov 25, 2014 at 9:21 PM, Bland, Wesley B. wrote: >> >>> It?s hard to tell then. Other than some problems compiling (not >>> declaring all of your variables), everything seems ok. Can you try running >>> with the most recent alpha. I have no idea what bug we could have fixed >>> here to make things work, but it?d be good to eliminate the possibility. >>> >>> Thanks, >>> Wesley >>> >>> On Nov 25, 2014, at 10:11 PM, Amin Hassani >>> wrote: >>> >>> Here I attached config.log exits in the root folder where it is >>> compiled. I'm not too familiar with MPICH but, there are other config.logs >>> in other directories also but not sure if you needed them too. >>> I don't have any specific environment variable that can relate to >>> MPICH. Also tried with >>> export HYDRA_HOST_FILE=
, >>> but have the same problem. >>> I don't do anything FT related in MPICH, I don't think this version of >>> MPICH has anything related to FT in it. >>> >>> Thanks. >>> >>> Amin Hassani, >>> CIS department at UAB, >>> Birmingham, AL, USA. >>> >>> On Tue, Nov 25, 2014 at 9:02 PM, Bland, Wesley B. >>> wrote: >>> >>>> Can you also provide your config.log and any CVARs or other relevant >>>> environment variables that you might be setting (for instance, in relation >>>> to fault tolerance)? >>>> >>>> Thanks, >>>> Wesley >>>> >>>> >>>> On Nov 25, 2014, at 3:58 PM, Amin Hassani >>>> wrote: >>>> >>>> This is the simplest code I have that doesn't run. >>>> >>>> >>>> #include >>>> #include >>>> #include >>>> #include >>>> #include >>>> >>>> int main(int argc, char** argv) >>>> { >>>> int rank, size; >>>> int i, j, k; >>>> double t1, t2; >>>> int rc; >>>> >>>> MPI_Init(&argc, &argv); >>>> MPI_Comm world = MPI_COMM_WORLD, newworld, newworld2; >>>> MPI_Comm_rank(world, &rank); >>>> MPI_Comm_size(world, &size); >>>> >>>> t2 = 1; >>>> MPI_Allreduce(&t2, &t_avg, 1, MPI_DOUBLE, MPI_SUM, world); >>>> t_avg = t_avg / size; >>>> >>>> MPI_Finalize(); >>>> >>>> return 0; >>>> }? >>>> >>>> Amin Hassani, >>>> CIS department at UAB, >>>> Birmingham, AL, USA. >>>> >>>> On Tue, Nov 25, 2014 at 2:46 PM, "Antonio J. Pe?a" >>>> wrote: >>>> >>>>> >>>>> Hi Amin, >>>>> >>>>> Can you share with us a minimal piece of code with which you can >>>>> reproduce this issue? >>>>> >>>>> Thanks, >>>>> Antonio >>>>> >>>>> >>>>> >>>>> On 11/25/2014 12:52 PM, Amin Hassani wrote: >>>>> >>>>> Hi, >>>>> >>>>> I am having problem running MPICH, on multiple nodes. When I run an >>>>> multiple MPI processes on one node, it totally works, but when I try to run >>>>> on multiple nodes, it fails with the error below. >>>>> My machines have Debian OS, Both infiniband and TCP interconnects. >>>>> I'm guessing it has something do to with the TCP network, but I can run >>>>> openmpi on these machines with no problem. But for some reason I cannot run >>>>> MPICH on multiple nodes. Please let me know if more info is needed from my >>>>> side. I'm guessing there are some configuration that I am missing. I used >>>>> MPICH 3.1.3 for this test. I googled this problem but couldn't find any >>>>> solution. >>>>> >>>>> ?In my MPI program, I am doing a simple allreduce over >>>>> MPI_COMM_WORLD?. >>>>> >>>>> ?my host file (hosts-hydra) is something like this: >>>>> oakmnt-0-a:1 >>>>> oakmnt-0-b:1 ? >>>>> >>>>> ?I get this error:? >>>>> >>>>> $ mpirun -hostfile hosts-hydra -np 2 test_dup >>>>> Assertion failed in file ../src/mpi/coll/helper_fns.c at line 490: >>>>> status->MPI_TAG == recvtag >>>>> Assertion failed in file ../src/mpi/coll/helper_fns.c at line 490: >>>>> status->MPI_TAG == recvtag >>>>> internal ABORT - process 1 >>>>> internal ABORT - process 0 >>>>> >>>>> >>>>> =================================================================================== >>>>> = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES >>>>> = PID 30744 RUNNING AT oakmnt-0-b >>>>> = EXIT CODE: 1 >>>>> = CLEANING UP REMAINING PROCESSES >>>>> = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES >>>>> >>>>> =================================================================================== >>>>> [mpiexec at vulcan13] HYDU_sock_read >>>>> (../../../../src/pm/hydra/utils/sock/sock.c:239): read error (Bad file >>>>> descriptor) >>>>> [mpiexec at vulcan13] control_cb >>>>> (../../../../src/pm/hydra/pm/pmiserv/pmiserv_cb.c:199): unable to read >>>>> command from proxy >>>>> [mpiexec at vulcan13] HYDT_dmxu_poll_wait_for_event >>>>> (../../../../src/pm/hydra/tools/demux/demux_poll.c:76): callback returned >>>>> error status >>>>> [mpiexec at vulcan13] HYD_pmci_wait_for_completion >>>>> (../../../../src/pm/hydra/pm/pmiserv/pmiserv_pmci.c:198): error waiting for >>>>> event >>>>> [mpiexec at vulcan13] main >>>>> (../../../../src/pm/hydra/ui/mpich/mpiexec.c:344): process manager error >>>>> waiting for completion >>>>> >>>>> Thanks. >>>>> Amin Hassani, >>>>> CIS department at UAB, >>>>> Birmingham, AL, USA. >>>>> >>>>> >>>>> _______________________________________________ >>>>> discuss mailing list discuss at mpich.org >>>>> To manage subscription options or unsubscribe:https://lists.mpich.org/mailman/listinfo/discuss >>>>> >>>>> >>>>> >>>>> -- >>>>> Antonio J. Pe?a >>>>> Postdoctoral Appointee >>>>> Mathematics and Computer Science Division >>>>> Argonne National Laboratory >>>>> 9700 South Cass Avenue, Bldg. 240, Of. 3148 >>>>> Argonne, IL 60439-4847apenya at mcs.anl.govwww.mcs.anl.gov/~apenya >>>>> >>>>> >>>>> _______________________________________________ >>>>> discuss mailing list discuss at mpich.org >>>>> To manage subscription options or unsubscribe: >>>>> https://lists.mpich.org/mailman/listinfo/discuss >>>>> >>>> >>>> _______________________________________________ >>>> discuss mailing list discuss at mpich.org >>>> To manage subscription options or unsubscribe: >>>> https://lists.mpich.org/mailman/listinfo/discuss >>>> >>>> >>>> >>>> _______________________________________________ >>>> discuss mailing list discuss at mpich.org >>>> To manage subscription options or unsubscribe: >>>> https://lists.mpich.org/mailman/listinfo/discuss >>>> >>> >>> _______________________________________________ >>> discuss mailing list discuss at mpich.org >>> To manage subscription options or unsubscribe: >>> https://lists.mpich.org/mailman/listinfo/discuss >>> >>> >>> >>> _______________________________________________ >>> discuss mailing list discuss at mpich.org >>> To manage subscription options or unsubscribe: >>> https://lists.mpich.org/mailman/listinfo/discuss >>> >> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- _______________________________________________ discuss mailing list discuss at mpich.org To manage subscription options or unsubscribe: https://lists.mpich.org/mailman/listinfo/discuss From ahassani at cis.uab.edu Tue Nov 25 21:31:19 2014 From: ahassani at cis.uab.edu (Amin Hassani) Date: Tue, 25 Nov 2014 21:31:19 -0600 Subject: [mpich-discuss] having problem running MPICH on multiple nodes In-Reply-To: References: <5474EAA0.9090800@mcs.anl.gov> <78B3A627-D829-40A5-98E6-632502B10F74@anl.gov> Message-ID: Now I'm getting this error with MPICH-3.2a2 Any thought? ?$ mpirun -hostfile hosts-hydra -np 2 test_dup Fatal error in MPI_Allreduce: Unknown error class, error stack: MPI_Allreduce(912)....................: MPI_Allreduce(sbuf=0x7fffa5240e60, rbuf=0x7fffa5240e68, count=1, MPI_DOUBLE, MPI_MAX, MPI_COMM_WORLD) failed MPIR_Allreduce_impl(769)..............: MPIR_Allreduce_intra(419).............: MPIDU_Complete_posted_with_error(1192): Process failed Fatal error in MPI_Allreduce: Unknown error class, error stack: MPI_Allreduce(912)....................: MPI_Allreduce(sbuf=0x7fffaf6ef070, rbuf=0x7fffaf6ef078, count=1, MPI_DOUBLE, MPI_MAX, MPI_COMM_WORLD) failed MPIR_Allreduce_impl(769)..............: MPIR_Allreduce_intra(419).............: MPIDU_Complete_posted_with_error(1192): Process failed =================================================================================== = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES = PID 451 RUNNING AT oakmnt-0-a = EXIT CODE: 1 = CLEANING UP REMAINING PROCESSES = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES =================================================================================== ? Thanks. Amin Hassani, CIS department at UAB, Birmingham, AL, USA. On Tue, Nov 25, 2014 at 9:25 PM, Amin Hassani wrote: > Ok, I'll try to test the alpha version. I'll let you know the results. > > Thank you. > > Amin Hassani, > CIS department at UAB, > Birmingham, AL, USA. > > On Tue, Nov 25, 2014 at 9:21 PM, Bland, Wesley B. wrote: > >> It?s hard to tell then. Other than some problems compiling (not >> declaring all of your variables), everything seems ok. Can you try running >> with the most recent alpha. I have no idea what bug we could have fixed >> here to make things work, but it?d be good to eliminate the possibility. >> >> Thanks, >> Wesley >> >> On Nov 25, 2014, at 10:11 PM, Amin Hassani wrote: >> >> Here I attached config.log exits in the root folder where it is >> compiled. I'm not too familiar with MPICH but, there are other config.logs >> in other directories also but not sure if you needed them too. >> I don't have any specific environment variable that can relate to MPICH. >> Also tried with >> export HYDRA_HOST_FILE=
, >> but have the same problem. >> I don't do anything FT related in MPICH, I don't think this version of >> MPICH has anything related to FT in it. >> >> Thanks. >> >> Amin Hassani, >> CIS department at UAB, >> Birmingham, AL, USA. >> >> On Tue, Nov 25, 2014 at 9:02 PM, Bland, Wesley B. wrote: >> >>> Can you also provide your config.log and any CVARs or other relevant >>> environment variables that you might be setting (for instance, in relation >>> to fault tolerance)? >>> >>> Thanks, >>> Wesley >>> >>> >>> On Nov 25, 2014, at 3:58 PM, Amin Hassani wrote: >>> >>> This is the simplest code I have that doesn't run. >>> >>> >>> #include >>> #include >>> #include >>> #include >>> #include >>> >>> int main(int argc, char** argv) >>> { >>> int rank, size; >>> int i, j, k; >>> double t1, t2; >>> int rc; >>> >>> MPI_Init(&argc, &argv); >>> MPI_Comm world = MPI_COMM_WORLD, newworld, newworld2; >>> MPI_Comm_rank(world, &rank); >>> MPI_Comm_size(world, &size); >>> >>> t2 = 1; >>> MPI_Allreduce(&t2, &t_avg, 1, MPI_DOUBLE, MPI_SUM, world); >>> t_avg = t_avg / size; >>> >>> MPI_Finalize(); >>> >>> return 0; >>> }? >>> >>> Amin Hassani, >>> CIS department at UAB, >>> Birmingham, AL, USA. >>> >>> On Tue, Nov 25, 2014 at 2:46 PM, "Antonio J. Pe?a" >>> wrote: >>> >>>> >>>> Hi Amin, >>>> >>>> Can you share with us a minimal piece of code with which you can >>>> reproduce this issue? >>>> >>>> Thanks, >>>> Antonio >>>> >>>> >>>> >>>> On 11/25/2014 12:52 PM, Amin Hassani wrote: >>>> >>>> Hi, >>>> >>>> I am having problem running MPICH, on multiple nodes. When I run an >>>> multiple MPI processes on one node, it totally works, but when I try to run >>>> on multiple nodes, it fails with the error below. >>>> My machines have Debian OS, Both infiniband and TCP interconnects. I'm >>>> guessing it has something do to with the TCP network, but I can run openmpi >>>> on these machines with no problem. But for some reason I cannot run MPICH >>>> on multiple nodes. Please let me know if more info is needed from my side. >>>> I'm guessing there are some configuration that I am missing. I used MPICH >>>> 3.1.3 for this test. I googled this problem but couldn't find any solution. >>>> >>>> ?In my MPI program, I am doing a simple allreduce over >>>> MPI_COMM_WORLD?. >>>> >>>> ?my host file (hosts-hydra) is something like this: >>>> oakmnt-0-a:1 >>>> oakmnt-0-b:1 ? >>>> >>>> ?I get this error:? >>>> >>>> $ mpirun -hostfile hosts-hydra -np 2 test_dup >>>> Assertion failed in file ../src/mpi/coll/helper_fns.c at line 490: >>>> status->MPI_TAG == recvtag >>>> Assertion failed in file ../src/mpi/coll/helper_fns.c at line 490: >>>> status->MPI_TAG == recvtag >>>> internal ABORT - process 1 >>>> internal ABORT - process 0 >>>> >>>> >>>> =================================================================================== >>>> = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES >>>> = PID 30744 RUNNING AT oakmnt-0-b >>>> = EXIT CODE: 1 >>>> = CLEANING UP REMAINING PROCESSES >>>> = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES >>>> >>>> =================================================================================== >>>> [mpiexec at vulcan13] HYDU_sock_read >>>> (../../../../src/pm/hydra/utils/sock/sock.c:239): read error (Bad file >>>> descriptor) >>>> [mpiexec at vulcan13] control_cb >>>> (../../../../src/pm/hydra/pm/pmiserv/pmiserv_cb.c:199): unable to read >>>> command from proxy >>>> [mpiexec at vulcan13] HYDT_dmxu_poll_wait_for_event >>>> (../../../../src/pm/hydra/tools/demux/demux_poll.c:76): callback returned >>>> error status >>>> [mpiexec at vulcan13] HYD_pmci_wait_for_completion >>>> (../../../../src/pm/hydra/pm/pmiserv/pmiserv_pmci.c:198): error waiting for >>>> event >>>> [mpiexec at vulcan13] main >>>> (../../../../src/pm/hydra/ui/mpich/mpiexec.c:344): process manager error >>>> waiting for completion >>>> >>>> Thanks. >>>> Amin Hassani, >>>> CIS department at UAB, >>>> Birmingham, AL, USA. >>>> >>>> >>>> _______________________________________________ >>>> discuss mailing list discuss at mpich.org >>>> To manage subscription options or unsubscribe:https://lists.mpich.org/mailman/listinfo/discuss >>>> >>>> >>>> >>>> -- >>>> Antonio J. Pe?a >>>> Postdoctoral Appointee >>>> Mathematics and Computer Science Division >>>> Argonne National Laboratory >>>> 9700 South Cass Avenue, Bldg. 240, Of. 3148 >>>> Argonne, IL 60439-4847apenya at mcs.anl.govwww.mcs.anl.gov/~apenya >>>> >>>> >>>> _______________________________________________ >>>> discuss mailing list discuss at mpich.org >>>> To manage subscription options or unsubscribe: >>>> https://lists.mpich.org/mailman/listinfo/discuss >>>> >>> >>> _______________________________________________ >>> discuss mailing list discuss at mpich.org >>> To manage subscription options or unsubscribe: >>> https://lists.mpich.org/mailman/listinfo/discuss >>> >>> >>> >>> _______________________________________________ >>> discuss mailing list discuss at mpich.org >>> To manage subscription options or unsubscribe: >>> https://lists.mpich.org/mailman/listinfo/discuss >>> >> >> _______________________________________________ >> discuss mailing list discuss at mpich.org >> To manage subscription options or unsubscribe: >> https://lists.mpich.org/mailman/listinfo/discuss >> >> >> >> _______________________________________________ >> discuss mailing list discuss at mpich.org >> To manage subscription options or unsubscribe: >> https://lists.mpich.org/mailman/listinfo/discuss >> > > -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- _______________________________________________ discuss mailing list discuss at mpich.org To manage subscription options or unsubscribe: https://lists.mpich.org/mailman/listinfo/discuss From ahassani at cis.uab.edu Tue Nov 25 21:25:02 2014 From: ahassani at cis.uab.edu (Amin Hassani) Date: Tue, 25 Nov 2014 21:25:02 -0600 Subject: [mpich-discuss] having problem running MPICH on multiple nodes In-Reply-To: References: <5474EAA0.9090800@mcs.anl.gov> <78B3A627-D829-40A5-98E6-632502B10F74@anl.gov> Message-ID: Ok, I'll try to test the alpha version. I'll let you know the results. Thank you. Amin Hassani, CIS department at UAB, Birmingham, AL, USA. On Tue, Nov 25, 2014 at 9:21 PM, Bland, Wesley B. wrote: > It?s hard to tell then. Other than some problems compiling (not declaring > all of your variables), everything seems ok. Can you try running with the > most recent alpha. I have no idea what bug we could have fixed here to make > things work, but it?d be good to eliminate the possibility. > > Thanks, > Wesley > > On Nov 25, 2014, at 10:11 PM, Amin Hassani wrote: > > Here I attached config.log exits in the root folder where it is > compiled. I'm not too familiar with MPICH but, there are other config.logs > in other directories also but not sure if you needed them too. > I don't have any specific environment variable that can relate to MPICH. > Also tried with > export HYDRA_HOST_FILE=
, > but have the same problem. > I don't do anything FT related in MPICH, I don't think this version of > MPICH has anything related to FT in it. > > Thanks. > > Amin Hassani, > CIS department at UAB, > Birmingham, AL, USA. > > On Tue, Nov 25, 2014 at 9:02 PM, Bland, Wesley B. wrote: > >> Can you also provide your config.log and any CVARs or other relevant >> environment variables that you might be setting (for instance, in relation >> to fault tolerance)? >> >> Thanks, >> Wesley >> >> >> On Nov 25, 2014, at 3:58 PM, Amin Hassani wrote: >> >> This is the simplest code I have that doesn't run. >> >> >> #include >> #include >> #include >> #include >> #include >> >> int main(int argc, char** argv) >> { >> int rank, size; >> int i, j, k; >> double t1, t2; >> int rc; >> >> MPI_Init(&argc, &argv); >> MPI_Comm world = MPI_COMM_WORLD, newworld, newworld2; >> MPI_Comm_rank(world, &rank); >> MPI_Comm_size(world, &size); >> >> t2 = 1; >> MPI_Allreduce(&t2, &t_avg, 1, MPI_DOUBLE, MPI_SUM, world); >> t_avg = t_avg / size; >> >> MPI_Finalize(); >> >> return 0; >> }? >> >> Amin Hassani, >> CIS department at UAB, >> Birmingham, AL, USA. >> >> On Tue, Nov 25, 2014 at 2:46 PM, "Antonio J. Pe?a" >> wrote: >> >>> >>> Hi Amin, >>> >>> Can you share with us a minimal piece of code with which you can >>> reproduce this issue? >>> >>> Thanks, >>> Antonio >>> >>> >>> >>> On 11/25/2014 12:52 PM, Amin Hassani wrote: >>> >>> Hi, >>> >>> I am having problem running MPICH, on multiple nodes. When I run an >>> multiple MPI processes on one node, it totally works, but when I try to run >>> on multiple nodes, it fails with the error below. >>> My machines have Debian OS, Both infiniband and TCP interconnects. I'm >>> guessing it has something do to with the TCP network, but I can run openmpi >>> on these machines with no problem. But for some reason I cannot run MPICH >>> on multiple nodes. Please let me know if more info is needed from my side. >>> I'm guessing there are some configuration that I am missing. I used MPICH >>> 3.1.3 for this test. I googled this problem but couldn't find any solution. >>> >>> ?In my MPI program, I am doing a simple allreduce over >>> MPI_COMM_WORLD?. >>> >>> ?my host file (hosts-hydra) is something like this: >>> oakmnt-0-a:1 >>> oakmnt-0-b:1 ? >>> >>> ?I get this error:? >>> >>> $ mpirun -hostfile hosts-hydra -np 2 test_dup >>> Assertion failed in file ../src/mpi/coll/helper_fns.c at line 490: >>> status->MPI_TAG == recvtag >>> Assertion failed in file ../src/mpi/coll/helper_fns.c at line 490: >>> status->MPI_TAG == recvtag >>> internal ABORT - process 1 >>> internal ABORT - process 0 >>> >>> >>> =================================================================================== >>> = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES >>> = PID 30744 RUNNING AT oakmnt-0-b >>> = EXIT CODE: 1 >>> = CLEANING UP REMAINING PROCESSES >>> = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES >>> >>> =================================================================================== >>> [mpiexec at vulcan13] HYDU_sock_read >>> (../../../../src/pm/hydra/utils/sock/sock.c:239): read error (Bad file >>> descriptor) >>> [mpiexec at vulcan13] control_cb >>> (../../../../src/pm/hydra/pm/pmiserv/pmiserv_cb.c:199): unable to read >>> command from proxy >>> [mpiexec at vulcan13] HYDT_dmxu_poll_wait_for_event >>> (../../../../src/pm/hydra/tools/demux/demux_poll.c:76): callback returned >>> error status >>> [mpiexec at vulcan13] HYD_pmci_wait_for_completion >>> (../../../../src/pm/hydra/pm/pmiserv/pmiserv_pmci.c:198): error waiting for >>> event >>> [mpiexec at vulcan13] main >>> (../../../../src/pm/hydra/ui/mpich/mpiexec.c:344): process manager error >>> waiting for completion >>> >>> Thanks. >>> Amin Hassani, >>> CIS department at UAB, >>> Birmingham, AL, USA. >>> >>> >>> _______________________________________________ >>> discuss mailing list discuss at mpich.org >>> To manage subscription options or unsubscribe:https://lists.mpich.org/mailman/listinfo/discuss >>> >>> >>> >>> -- >>> Antonio J. Pe?a >>> Postdoctoral Appointee >>> Mathematics and Computer Science Division >>> Argonne National Laboratory >>> 9700 South Cass Avenue, Bldg. 240, Of. 3148 >>> Argonne, IL 60439-4847apenya at mcs.anl.govwww.mcs.anl.gov/~apenya >>> >>> >>> _______________________________________________ >>> discuss mailing list discuss at mpich.org >>> To manage subscription options or unsubscribe: >>> https://lists.mpich.org/mailman/listinfo/discuss >>> >> >> _______________________________________________ >> discuss mailing list discuss at mpich.org >> To manage subscription options or unsubscribe: >> https://lists.mpich.org/mailman/listinfo/discuss >> >> >> >> _______________________________________________ >> discuss mailing list discuss at mpich.org >> To manage subscription options or unsubscribe: >> https://lists.mpich.org/mailman/listinfo/discuss >> > > _______________________________________________ > discuss mailing list discuss at mpich.org > To manage subscription options or unsubscribe: > https://lists.mpich.org/mailman/listinfo/discuss > > > > _______________________________________________ > discuss mailing list discuss at mpich.org > To manage subscription options or unsubscribe: > https://lists.mpich.org/mailman/listinfo/discuss > -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- _______________________________________________ discuss mailing list discuss at mpich.org To manage subscription options or unsubscribe: https://lists.mpich.org/mailman/listinfo/discuss From wbland at anl.gov Tue Nov 25 21:21:23 2014 From: wbland at anl.gov (Bland, Wesley B.) Date: Wed, 26 Nov 2014 03:21:23 +0000 Subject: [mpich-discuss] having problem running MPICH on multiple nodes In-Reply-To: References: <5474EAA0.9090800@mcs.anl.gov> <78B3A627-D829-40A5-98E6-632502B10F74@anl.gov> Message-ID: It?s hard to tell then. Other than some problems compiling (not declaring all of your variables), everything seems ok. Can you try running with the most recent alpha. I have no idea what bug we could have fixed here to make things work, but it?d be good to eliminate the possibility. Thanks, Wesley On Nov 25, 2014, at 10:11 PM, Amin Hassani > wrote: Here I attached config.log exits in the root folder where it is compiled. I'm not too familiar with MPICH but, there are other config.logs in other directories also but not sure if you needed them too. I don't have any specific environment variable that can relate to MPICH. Also tried with export HYDRA_HOST_FILE=
, but have the same problem. I don't do anything FT related in MPICH, I don't think this version of MPICH has anything related to FT in it. Thanks. Amin Hassani, CIS department at UAB, Birmingham, AL, USA. On Tue, Nov 25, 2014 at 9:02 PM, Bland, Wesley B. > wrote: Can you also provide your config.log and any CVARs or other relevant environment variables that you might be setting (for instance, in relation to fault tolerance)? Thanks, Wesley On Nov 25, 2014, at 3:58 PM, Amin Hassani > wrote: This is the simplest code I have that doesn't run. #include #include #include #include #include int main(int argc, char** argv) { int rank, size; int i, j, k; double t1, t2; int rc; MPI_Init(&argc, &argv); MPI_Comm world = MPI_COMM_WORLD, newworld, newworld2; MPI_Comm_rank(world, &rank); MPI_Comm_size(world, &size); t2 = 1; MPI_Allreduce(&t2, &t_avg, 1, MPI_DOUBLE, MPI_SUM, world); t_avg = t_avg / size; MPI_Finalize(); return 0; }? Amin Hassani, CIS department at UAB, Birmingham, AL, USA. On Tue, Nov 25, 2014 at 2:46 PM, "Antonio J. Pe?a" > wrote: Hi Amin, Can you share with us a minimal piece of code with which you can reproduce this issue? Thanks, Antonio On 11/25/2014 12:52 PM, Amin Hassani wrote: Hi, I am having problem running MPICH, on multiple nodes. When I run an multiple MPI processes on one node, it totally works, but when I try to run on multiple nodes, it fails with the error below. My machines have Debian OS, Both infiniband and TCP interconnects. I'm guessing it has something do to with the TCP network, but I can run openmpi on these machines with no problem. But for some reason I cannot run MPICH on multiple nodes. Please let me know if more info is needed from my side. I'm guessing there are some configuration that I am missing. I used MPICH 3.1.3 for this test. I googled this problem but couldn't find any solution. ?In my MPI program, I am doing a simple allreduce over MPI_COMM_WORLD?. ?my host file (hosts-hydra) is something like this: oakmnt-0-a:1 oakmnt-0-b:1 ? ?I get this error:? $ mpirun -hostfile hosts-hydra -np 2 test_dup Assertion failed in file ../src/mpi/coll/helper_fns.c at line 490: status->MPI_TAG == recvtag Assertion failed in file ../src/mpi/coll/helper_fns.c at line 490: status->MPI_TAG == recvtag internal ABORT - process 1 internal ABORT - process 0 =================================================================================== = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES = PID 30744 RUNNING AT oakmnt-0-b = EXIT CODE: 1 = CLEANING UP REMAINING PROCESSES = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES =================================================================================== [mpiexec at vulcan13] HYDU_sock_read (../../../../src/pm/hydra/utils/sock/sock.c:239): read error (Bad file descriptor) [mpiexec at vulcan13] control_cb (../../../../src/pm/hydra/pm/pmiserv/pmiserv_cb.c:199): unable to read command from proxy [mpiexec at vulcan13] HYDT_dmxu_poll_wait_for_event (../../../../src/pm/hydra/tools/demux/demux_poll.c:76): callback returned error status [mpiexec at vulcan13] HYD_pmci_wait_for_completion (../../../../src/pm/hydra/pm/pmiserv/pmiserv_pmci.c:198): error waiting for event [mpiexec at vulcan13] main (../../../../src/pm/hydra/ui/mpich/mpiexec.c:344): process manager error waiting for completion Thanks. Amin Hassani, CIS department at UAB, Birmingham, AL, USA. _______________________________________________ discuss mailing list discuss at mpich.org To manage subscription options or unsubscribe: https://lists.mpich.org/mailman/listinfo/discuss -- Antonio J. Pe?a Postdoctoral Appointee Mathematics and Computer Science Division Argonne National Laboratory 9700 South Cass Avenue, Bldg. 240, Of. 3148 Argonne, IL 60439-4847 apenya at mcs.anl.gov www.mcs.anl.gov/~apenya _______________________________________________ discuss mailing list discuss at mpich.org To manage subscription options or unsubscribe: https://lists.mpich.org/mailman/listinfo/discuss _______________________________________________ discuss mailing list discuss at mpich.org To manage subscription options or unsubscribe: https://lists.mpich.org/mailman/listinfo/discuss _______________________________________________ discuss mailing list discuss at mpich.org To manage subscription options or unsubscribe: https://lists.mpich.org/mailman/listinfo/discuss _______________________________________________ discuss mailing list discuss at mpich.org To manage subscription options or unsubscribe: https://lists.mpich.org/mailman/listinfo/discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- _______________________________________________ discuss mailing list discuss at mpich.org To manage subscription options or unsubscribe: https://lists.mpich.org/mailman/listinfo/discuss From wbland at anl.gov Tue Nov 25 21:02:38 2014 From: wbland at anl.gov (Bland, Wesley B.) Date: Wed, 26 Nov 2014 03:02:38 +0000 Subject: [mpich-discuss] having problem running MPICH on multiple nodes In-Reply-To: References: <5474EAA0.9090800@mcs.anl.gov> Message-ID: <78B3A627-D829-40A5-98E6-632502B10F74@anl.gov> Can you also provide your config.log and any CVARs or other relevant environment variables that you might be setting (for instance, in relation to fault tolerance)? Thanks, Wesley On Nov 25, 2014, at 3:58 PM, Amin Hassani > wrote: This is the simplest code I have that doesn't run. #include #include #include #include #include int main(int argc, char** argv) { int rank, size; int i, j, k; double t1, t2; int rc; MPI_Init(&argc, &argv); MPI_Comm world = MPI_COMM_WORLD, newworld, newworld2; MPI_Comm_rank(world, &rank); MPI_Comm_size(world, &size); t2 = 1; MPI_Allreduce(&t2, &t_avg, 1, MPI_DOUBLE, MPI_SUM, world); t_avg = t_avg / size; MPI_Finalize(); return 0; }? Amin Hassani, CIS department at UAB, Birmingham, AL, USA. On Tue, Nov 25, 2014 at 2:46 PM, "Antonio J. Pe?a" > wrote: Hi Amin, Can you share with us a minimal piece of code with which you can reproduce this issue? Thanks, Antonio On 11/25/2014 12:52 PM, Amin Hassani wrote: Hi, I am having problem running MPICH, on multiple nodes. When I run an multiple MPI processes on one node, it totally works, but when I try to run on multiple nodes, it fails with the error below. My machines have Debian OS, Both infiniband and TCP interconnects. I'm guessing it has something do to with the TCP network, but I can run openmpi on these machines with no problem. But for some reason I cannot run MPICH on multiple nodes. Please let me know if more info is needed from my side. I'm guessing there are some configuration that I am missing. I used MPICH 3.1.3 for this test. I googled this problem but couldn't find any solution. ?In my MPI program, I am doing a simple allreduce over MPI_COMM_WORLD?. ?my host file (hosts-hydra) is something like this: oakmnt-0-a:1 oakmnt-0-b:1 ? ?I get this error:? $ mpirun -hostfile hosts-hydra -np 2 test_dup Assertion failed in file ../src/mpi/coll/helper_fns.c at line 490: status->MPI_TAG == recvtag Assertion failed in file ../src/mpi/coll/helper_fns.c at line 490: status->MPI_TAG == recvtag internal ABORT - process 1 internal ABORT - process 0 =================================================================================== = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES = PID 30744 RUNNING AT oakmnt-0-b = EXIT CODE: 1 = CLEANING UP REMAINING PROCESSES = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES =================================================================================== [mpiexec at vulcan13] HYDU_sock_read (../../../../src/pm/hydra/utils/sock/sock.c:239): read error (Bad file descriptor) [mpiexec at vulcan13] control_cb (../../../../src/pm/hydra/pm/pmiserv/pmiserv_cb.c:199): unable to read command from proxy [mpiexec at vulcan13] HYDT_dmxu_poll_wait_for_event (../../../../src/pm/hydra/tools/demux/demux_poll.c:76): callback returned error status [mpiexec at vulcan13] HYD_pmci_wait_for_completion (../../../../src/pm/hydra/pm/pmiserv/pmiserv_pmci.c:198): error waiting for event [mpiexec at vulcan13] main (../../../../src/pm/hydra/ui/mpich/mpiexec.c:344): process manager error waiting for completion Thanks. Amin Hassani, CIS department at UAB, Birmingham, AL, USA. _______________________________________________ discuss mailing list discuss at mpich.org To manage subscription options or unsubscribe: https://lists.mpich.org/mailman/listinfo/discuss -- Antonio J. Pe?a Postdoctoral Appointee Mathematics and Computer Science Division Argonne National Laboratory 9700 South Cass Avenue, Bldg. 240, Of. 3148 Argonne, IL 60439-4847 apenya at mcs.anl.gov www.mcs.anl.gov/~apenya _______________________________________________ discuss mailing list discuss at mpich.org To manage subscription options or unsubscribe: https://lists.mpich.org/mailman/listinfo/discuss _______________________________________________ discuss mailing list discuss at mpich.org To manage subscription options or unsubscribe: https://lists.mpich.org/mailman/listinfo/discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- _______________________________________________ discuss mailing list discuss at mpich.org To manage subscription options or unsubscribe: https://lists.mpich.org/mailman/listinfo/discuss From ahassani at cis.uab.edu Tue Nov 25 14:58:02 2014 From: ahassani at cis.uab.edu (Amin Hassani) Date: Tue, 25 Nov 2014 14:58:02 -0600 Subject: [mpich-discuss] having problem running MPICH on multiple nodes In-Reply-To: <5474EAA0.9090800@mcs.anl.gov> References: <5474EAA0.9090800@mcs.anl.gov> Message-ID: This is the simplest code I have that doesn't run. #include #include #include #include #include int main(int argc, char** argv) { int rank, size; int i, j, k; double t1, t2; int rc; MPI_Init(&argc, &argv); MPI_Comm world = MPI_COMM_WORLD, newworld, newworld2; MPI_Comm_rank(world, &rank); MPI_Comm_size(world, &size); t2 = 1; MPI_Allreduce(&t2, &t_avg, 1, MPI_DOUBLE, MPI_SUM, world); t_avg = t_avg / size; MPI_Finalize(); return 0; }? Amin Hassani, CIS department at UAB, Birmingham, AL, USA. On Tue, Nov 25, 2014 at 2:46 PM, "Antonio J. Pe?a" wrote: > > Hi Amin, > > Can you share with us a minimal piece of code with which you can reproduce > this issue? > > Thanks, > Antonio > > > > On 11/25/2014 12:52 PM, Amin Hassani wrote: > > Hi, > > I am having problem running MPICH, on multiple nodes. When I run an > multiple MPI processes on one node, it totally works, but when I try to run > on multiple nodes, it fails with the error below. > My machines have Debian OS, Both infiniband and TCP interconnects. I'm > guessing it has something do to with the TCP network, but I can run openmpi > on these machines with no problem. But for some reason I cannot run MPICH > on multiple nodes. Please let me know if more info is needed from my side. > I'm guessing there are some configuration that I am missing. I used MPICH > 3.1.3 for this test. I googled this problem but couldn't find any solution. > > ?In my MPI program, I am doing a simple allreduce over MPI_COMM_WORLD?. > > ?my host file (hosts-hydra) is something like this: > oakmnt-0-a:1 > oakmnt-0-b:1 ? > > ?I get this error:? > > $ mpirun -hostfile hosts-hydra -np 2 test_dup > Assertion failed in file ../src/mpi/coll/helper_fns.c at line 490: > status->MPI_TAG == recvtag > Assertion failed in file ../src/mpi/coll/helper_fns.c at line 490: > status->MPI_TAG == recvtag > internal ABORT - process 1 > internal ABORT - process 0 > > > =================================================================================== > = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES > = PID 30744 RUNNING AT oakmnt-0-b > = EXIT CODE: 1 > = CLEANING UP REMAINING PROCESSES > = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES > > =================================================================================== > [mpiexec at vulcan13] HYDU_sock_read > (../../../../src/pm/hydra/utils/sock/sock.c:239): read error (Bad file > descriptor) > [mpiexec at vulcan13] control_cb > (../../../../src/pm/hydra/pm/pmiserv/pmiserv_cb.c:199): unable to read > command from proxy > [mpiexec at vulcan13] HYDT_dmxu_poll_wait_for_event > (../../../../src/pm/hydra/tools/demux/demux_poll.c:76): callback returned > error status > [mpiexec at vulcan13] HYD_pmci_wait_for_completion > (../../../../src/pm/hydra/pm/pmiserv/pmiserv_pmci.c:198): error waiting for > event > [mpiexec at vulcan13] main > (../../../../src/pm/hydra/ui/mpich/mpiexec.c:344): process manager error > waiting for completion > > Thanks. > Amin Hassani, > CIS department at UAB, > Birmingham, AL, USA. > > > _______________________________________________ > discuss mailing list discuss at mpich.org > To manage subscription options or unsubscribe:https://lists.mpich.org/mailman/listinfo/discuss > > > > -- > Antonio J. Pe?a > Postdoctoral Appointee > Mathematics and Computer Science Division > Argonne National Laboratory > 9700 South Cass Avenue, Bldg. 240, Of. 3148 > Argonne, IL 60439-4847apenya at mcs.anl.govwww.mcs.anl.gov/~apenya > > > _______________________________________________ > discuss mailing list discuss at mpich.org > To manage subscription options or unsubscribe: > https://lists.mpich.org/mailman/listinfo/discuss > -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- _______________________________________________ discuss mailing list discuss at mpich.org To manage subscription options or unsubscribe: https://lists.mpich.org/mailman/listinfo/discuss From apenya at mcs.anl.gov Tue Nov 25 14:46:24 2014 From: apenya at mcs.anl.gov (=?UTF-8?B?IkFudG9uaW8gSi4gUGXDsWEi?=) Date: Tue, 25 Nov 2014 14:46:24 -0600 Subject: [mpich-discuss] having problem running MPICH on multiple nodes In-Reply-To: References: Message-ID: <5474EAA0.9090800@mcs.anl.gov> Hi Amin, Can you share with us a minimal piece of code with which you can reproduce this issue? Thanks, Antonio On 11/25/2014 12:52 PM, Amin Hassani wrote: > Hi, > > I am having problem running MPICH, on multiple nodes. When I run an > multiple MPI processes on one node, it totally works, but when I try > to run on multiple nodes, it fails with the error below. > My machines have Debian OS, Both infiniband and TCP interconnects. I'm > guessing it has something do to with the TCP network, but I can run > openmpi on these machines with no problem. But for some reason I > cannot run MPICH on multiple nodes. Please let me know if more info is > needed from my side. I'm guessing there are some configuration that I > am missing. I used MPICH 3.1.3 for this test. I googled this problem > but couldn't find any solution. > > ?In my MPI program, I am doing a simple allreduce over MPI_COMM_WORLD?. > > ?my host file (hosts-hydra) is something like this: > oakmnt-0-a:1 > oakmnt-0-b:1 ? > > ?I get this error:? > > $ mpirun -hostfile hosts-hydra -np 2 test_dup > Assertion failed in file ../src/mpi/coll/helper_fns.c at line 490: > status->MPI_TAG == recvtag > Assertion failed in file ../src/mpi/coll/helper_fns.c at line 490: > status->MPI_TAG == recvtag > internal ABORT - process 1 > internal ABORT - process 0 > > =================================================================================== > = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES > = PID 30744 RUNNING AT oakmnt-0-b > = EXIT CODE: 1 > = CLEANING UP REMAINING PROCESSES > = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES > =================================================================================== > [mpiexec at vulcan13] HYDU_sock_read > (../../../../src/pm/hydra/utils/sock/sock.c:239): read error (Bad file > descriptor) > [mpiexec at vulcan13] control_cb > (../../../../src/pm/hydra/pm/pmiserv/pmiserv_cb.c:199): unable to read > command from proxy > [mpiexec at vulcan13] HYDT_dmxu_poll_wait_for_event > (../../../../src/pm/hydra/tools/demux/demux_poll.c:76): callback > returned error status > [mpiexec at vulcan13] HYD_pmci_wait_for_completion > (../../../../src/pm/hydra/pm/pmiserv/pmiserv_pmci.c:198): error > waiting for event > [mpiexec at vulcan13] main > (../../../../src/pm/hydra/ui/mpich/mpiexec.c:344): process manager > error waiting for completion > > Thanks. > Amin Hassani, > CIS department at UAB, > Birmingham, AL, USA. > > > _______________________________________________ > discuss mailing list discuss at mpich.org > To manage subscription options or unsubscribe: > https://lists.mpich.org/mailman/listinfo/discuss -- Antonio J. Pe?a Postdoctoral Appointee Mathematics and Computer Science Division Argonne National Laboratory 9700 South Cass Avenue, Bldg. 240, Of. 3148 Argonne, IL 60439-4847 apenya at mcs.anl.gov www.mcs.anl.gov/~apenya -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- _______________________________________________ discuss mailing list discuss at mpich.org To manage subscription options or unsubscribe: https://lists.mpich.org/mailman/listinfo/discuss From victor.vysotskiy at teokem.lu.se Tue Nov 25 04:14:13 2014 From: victor.vysotskiy at teokem.lu.se (Victor Vysotskiy) Date: Tue, 25 Nov 2014 10:14:13 +0000 Subject: [mpich-discuss] Assertion failed in file src/mpid/ch3/src/ch3u_handle_send_req.c at line 61 (RMA && Derived datatypes) Message-ID: <8D58A4B5E6148C419C6AD6334962375DDD9AE1D3@UWMBX04.uw.lu.se> Dear Xin, I just checked out the latest nightly snapshot 'v3.2a2-16-g8a0887b9'. So, everything works out on my laptop. Thank you very much for your help and support! With best regards, Victor. _______________________________________________ discuss mailing list discuss at mpich.org To manage subscription options or unsubscribe: https://lists.mpich.org/mailman/listinfo/discuss From ahassani at cis.uab.edu Tue Nov 25 12:52:03 2014 From: ahassani at cis.uab.edu (Amin Hassani) Date: Tue, 25 Nov 2014 12:52:03 -0600 Subject: [mpich-discuss] having problem running MPICH on multiple nodes Message-ID: Hi, I am having problem running MPICH, on multiple nodes. When I run an multiple MPI processes on one node, it totally works, but when I try to run on multiple nodes, it fails with the error below. My machines have Debian OS, Both infiniband and TCP interconnects. I'm guessing it has something do to with the TCP network, but I can run openmpi on these machines with no problem. But for some reason I cannot run MPICH on multiple nodes. Please let me know if more info is needed from my side. I'm guessing there are some configuration that I am missing. I used MPICH 3.1.3 for this test. I googled this problem but couldn't find any solution. ?In my MPI program, I am doing a simple allreduce over MPI_COMM_WORLD?. ?my host file (hosts-hydra) is something like this: oakmnt-0-a:1 oakmnt-0-b:1? ?I get this error:? $ mpirun -hostfile hosts-hydra -np 2 test_dup Assertion failed in file ../src/mpi/coll/helper_fns.c at line 490: status->MPI_TAG == recvtag Assertion failed in file ../src/mpi/coll/helper_fns.c at line 490: status->MPI_TAG == recvtag internal ABORT - process 1 internal ABORT - process 0 =================================================================================== = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES = PID 30744 RUNNING AT oakmnt-0-b = EXIT CODE: 1 = CLEANING UP REMAINING PROCESSES = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES =================================================================================== [mpiexec at vulcan13] HYDU_sock_read (../../../../src/pm/hydra/utils/sock/sock.c:239): read error (Bad file descriptor) [mpiexec at vulcan13] control_cb (../../../../src/pm/hydra/pm/pmiserv/pmiserv_cb.c:199): unable to read command from proxy [mpiexec at vulcan13] HYDT_dmxu_poll_wait_for_event (../../../../src/pm/hydra/tools/demux/demux_poll.c:76): callback returned error status [mpiexec at vulcan13] HYD_pmci_wait_for_completion (../../../../src/pm/hydra/pm/pmiserv/pmiserv_pmci.c:198): error waiting for event [mpiexec at vulcan13] main (../../../../src/pm/hydra/ui/mpich/mpiexec.c:344): process manager error waiting for completion Thanks. Amin Hassani, CIS department at UAB, Birmingham, AL, USA. -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- _______________________________________________ discuss mailing list discuss at mpich.org To manage subscription options or unsubscribe: https://lists.mpich.org/mailman/listinfo/discuss From rndfax at yandex.ru Mon Nov 24 15:45:31 2014 From: rndfax at yandex.ru (Kuleshov Aleksey) Date: Tue, 25 Nov 2014 00:45:31 +0300 Subject: [mpich-discuss] Assertion in MX netmod In-Reply-To: <5473987F.5030206@mcs.anl.gov> References: <4319971416670754@web3j.yandex.ru> <107521416685978@web5o.yandex.ru> <54734A1B.8080807@mcs.anl.gov> <3416861416853409@web3g.yandex.ru> <5473987F.5030206@mcs.anl.gov> Message-ID: <749771416865531@web23h.yandex.ru> > Our policy is to keep unsupported code for a while in our releases in a > best effort practice, at least while it seems to be working and does not > bother us in our developments. The reality is that we do not further > have any hardware nor specific funding to keep supporting this netmod. Ok, got it. >> ?2) netmod newmad has the same routing as mx (which calls assertion) but newmad is still in 3.2a2 => MPICH still has broken code unless something was fixed in subroutings? > > Are you saying that you can reproduce the issue with the newmad netmod? No. > Otherwise, similar code paths do not necessarily mean that we will be > hitting the same bug. I really hope so! The last assertion was called in sending, when the path had it way from user program's MPI_Send directly to the netmod's function 'send' and newmad and mx codes are almost identical on this way. I said, "unless something was fixed". May be it was fixed after 3.1.2 release, may be it is in MX netmod -- I don't know. Now I see that 3.2a2 version has new mxm netmod which also uses that routing which causes assertion. So I'll try to use the latest MPICH code and port MX to it and see if I hit the same assertion. > We do extensive automated testing in multiple > netmods, architectures, compilers, and compiling configurations. Without > being able to reproduce the problem in other than MX, we cannot conclude > other than that the bug was specifically located in that netmod. In case > you confirm you are reproducing the same bug in the newmad netmod, we > will contact the external person who contributed and used to maintain it. Ok. I tried newmad several months ago with MPICH (during very short time -- never tried to use it on these tests), but, unfortunately, I don't have time to resurrect it and try it on these tests. Anyway, thank you for conversation. It helped me to move on -- to the newest MPICH version! _______________________________________________ discuss mailing list discuss at mpich.org To manage subscription options or unsubscribe: https://lists.mpich.org/mailman/listinfo/discuss From apenya at mcs.anl.gov Mon Nov 24 14:43:43 2014 From: apenya at mcs.anl.gov (=?UTF-8?B?IkFudG9uaW8gSi4gUGXDsWEi?=) Date: Mon, 24 Nov 2014 14:43:43 -0600 Subject: [mpich-discuss] Assertion in MX netmod In-Reply-To: <3416861416853409@web3g.yandex.ru> References: <4319971416670754@web3j.yandex.ru> <107521416685978@web5o.yandex.ru> <54734A1B.8080807@mcs.anl.gov> <3416861416853409@web3g.yandex.ru> Message-ID: <5473987F.5030206@mcs.anl.gov> On 11/24/2014 12:23 PM, Kuleshov Aleksey wrote: > Thank you Antonio for this information. This is very sad, because: > 1) in the releases 3.1.2 and 3.1.3 (which are stable releases!) MPICH has broken netmod? Our policy is to keep unsupported code for a while in our releases in a best effort practice, at least while it seems to be working and does not bother us in our developments. The reality is that we do not further have any hardware nor specific funding to keep supporting this netmod. > 2) netmod newmad has the same routing as mx (which calls assertion) but newmad is still in 3.2a2 => MPICH still has broken code unless something was fixed in subroutings? Are you saying that you can reproduce the issue with the newmad netmod? Otherwise, similar code paths do not necessarily mean that we will be hitting the same bug. We do extensive automated testing in multiple netmods, architectures, compilers, and compiling configurations. Without being able to reproduce the problem in other than MX, we cannot conclude other than that the bug was specifically located in that netmod. In case you confirm you are reproducing the same bug in the newmad netmod, we will contact the external person who contributed and used to maintain it. Antonio > > 24.11.2014, 18:09, "Antonio J. Pe?a" : >> Dear Kuleshov, >> >> In order to accomodate resources for more recent networking APIs we >> dropped support for the mx netmod, which in fact has been completely >> removed in our most recent 3.2 releases. So, unfortunately, we are not >> able to assist you with this issue. >> >> Best regards, >> Antonio >> >> On 11/22/2014 01:52 PM, Kuleshov Aleksey wrote: >>> And the same problem with different approach: >>> I downloaded from http://www.mcs.anl.gov/research/projects/mpi/mpi-test/tsuite.html mpi2test.tar.gz, built it and try >>> to run pingping test: >>>> MPITEST_VERBOSE=1 MPICH_NEMESIS_NETMOD=mx mpiexec -f /tmp/m -n 2 /tests/pingping >>> [stdout] >>> Get new datatypes: send = MPI_INT, recv = MPI_INT >>> Get new datatypes: send = MPI_INT, recv = MPI_INT >>> Sending count = 1 of sendtype MPI_INT of total size 4 bytes >>> Sending count = 1 of sendtype MPI_INT of total size 4 bytes >>> Get new datatypes: send = MPI_DOUBLE, recv = MPI_DOUBLE >>> Get new datatypes: send = MPI_DOUBLE, recv = MPI_DOUBLE >>> Sending count = 1 of sendtype MPI_DOUBLE of total size 8 bytes >>> Sending count = 1 of sendtype MPI_DOUBLE of total size 8 bytes >>> Get new datatypes: send = MPI_FLOAT_INT, recv = MPI_FLOAT_INT >>> Sending count = 1 of sendtype MPI_FLOAT_INT of total size 8 bytes >>> Get new datatypes: send = MPI_FLOAT_INT, recv = MPI_FLOAT_INT >>> Sending count = 1 of sendtype MPI_FLOAT_INT of total size 8 bytes >>> Get new datatypes: send = dup of MPI_INT, recv = dup of MPI_INT >>> Get new datatypes: send = dup of MPI_INT, recv = dup of MPI_INT >>> Sending count = 1 of sendtype dup of MPI_INT of total size 4 bytes >>> Sending count = 1 of sendtype dup of MPI_INT of total size 4 bytes >>> Get new datatypes: send = int-vector, recv = MPI_INT >>> Sending count = 1 of sendtype int-vector of total size 4 bytes >>> Assertion failed in file ../src/mpid/ch3/channels/nemesis/netmod/mx/mx_send.c at line 435: n_iov > 0 >>> internal ABORT - process 0 >>> [/stdout] >>> >>> 22.11.2014, 18:39, "Kuleshov Aleksey" : >>>> Hello! Can you please help me with problem? >>>> >>>> I'm working on custom myriexpress library and I'm using MX netmod in MPICH v.3.1.2. >>>> For testing purposes I built OSU Micro Benchmarks v3.8. >>>> >>>> To run it on 7 nodes I execute test osu_alltoall as follows: >>>>> MPICH_NEMESIS_NETMOD=mx mpiexec -f /tmp/m -n 7 /osu_alltoall >>>> It passed successfully (I also tried it on 2, 3, 4, 5 and 6 nodes - everything is alright). >>>> >>>> But now I want to run it on 8 nodes: >>>>> MPICH_NEMESIS_NETMOD=mx mpiexec -f /tmp/m -n 8 /osu_alltoall >>>> [stdout] >>>> # OSU MPI All-to-All Personalized Exchange Latency Test v3.8 >>>> # Size Avg Latency(us) >>>> Assertion failed in file ../src/mpid/ch3/channels/nemesis/netmod/mx/mx_poll.c at line 784: n_iov > 0 >>>> Assertion failed in file ../src/mpid/ch3/channels/nemesis/netmod/mx/mx_poll.c at line 784: n_iov > 0 >>>> Assertion failed in file ../src/mpid/ch3/channels/nemesis/netmod/mx/mx_poll.c at line 784: n_iov > 0 >>>> Assertion failed in file ../src/mpid/ch3/channels/nemesis/netmod/mx/mx_poll.c at line 784: n_iov > 0 >>>> Assertion failed in file ../src/mpid/ch3/channels/nemesis/netmod/mx/mx_poll.c at line 784: n_iov > 0 >>>> Assertion failed in file ../src/mpid/ch3/channels/nemesis/netmod/mx/mx_poll.c at line 784: n_iov > 0 >>>> Assertion failed in file ../src/mpid/ch3/channels/nemesis/netmod/mx/mx_poll.c at line 784: n_iov > 0 >>>> Assertion failed in file ../src/mpid/ch3/channels/nemesis/netmod/mx/mx_poll.c at line 784: n_iov > 0 >>>> internal ABORT - process 4 >>>> internal ABORT - process 7 >>>> internal ABORT - process 2 >>>> internal ABORT - process 6 >>>> internal ABORT - process 3 >>>> internal ABORT - process 0 >>>> internal ABORT - process 5 >>>> internal ABORT - process 1 >>>> [/stdout] >>>> >>>> So, what does these assertions mean? >>>> Is it something wrong with MX netmod? >>>> Or in myriexpress library? >>>> Or in test osu_alltoall itself? >>>> >>>> BTW, osu_alltoall on 8 nodes passed successfully for TCP netmod. >>>> _______________________________________________ >>>> discuss mailing list discuss at mpich.org >>>> To manage subscription options or unsubscribe: >>>> https://lists.mpich.org/mailman/listinfo/discuss >>> _______________________________________________ >>> discuss mailing list discuss at mpich.org >>> To manage subscription options or unsubscribe: >>> https://lists.mpich.org/mailman/listinfo/discuss >> -- >> Antonio J. Pe?a >> Postdoctoral Appointee >> Mathematics and Computer Science Division >> Argonne National Laboratory >> 9700 South Cass Avenue, Bldg. 240, Of. 3148 >> Argonne, IL 60439-4847 >> apenya at mcs.anl.gov >> www.mcs.anl.gov/~apenya >> >> _______________________________________________ >> discuss mailing list discuss at mpich.org >> To manage subscription options or unsubscribe: >> https://lists.mpich.org/mailman/listinfo/discuss > _______________________________________________ > discuss mailing list discuss at mpich.org > To manage subscription options or unsubscribe: > https://lists.mpich.org/mailman/listinfo/discuss -- Antonio J. Pe?a Postdoctoral Appointee Mathematics and Computer Science Division Argonne National Laboratory 9700 South Cass Avenue, Bldg. 240, Of. 3148 Argonne, IL 60439-4847 apenya at mcs.anl.gov www.mcs.anl.gov/~apenya _______________________________________________ discuss mailing list discuss at mpich.org To manage subscription options or unsubscribe: https://lists.mpich.org/mailman/listinfo/discuss From xinzhao3 at illinois.edu Mon Nov 24 14:12:12 2014 From: xinzhao3 at illinois.edu (Zhao, Xin) Date: Mon, 24 Nov 2014 20:12:12 +0000 Subject: [mpich-discuss] Assertion failed in file src/mpid/ch3/src/ch3u_handle_send_req.c at line 61 (RMA && Derived datatypes) In-Reply-To: <8D58A4B5E6148C419C6AD6334962375DDD9A5C0C@UWMBX04.uw.lu.se> References: <8D58A4B5E6148C419C6AD6334962375DDD9A4C54@UWMBX04.uw.lu.se>, <8D58A4B5E6148C419C6AD6334962375DDD9A5B7E@UWMBX04.uw.lu.se>, <8D58A4B5E6148C419C6AD6334962375DDD9A5B95@UWMBX04.uw.lu.se>, <0A407957589BAB4F924824150C4293EF5881150F@CITESMBX3.ad.uillinois.edu>, <8D58A4B5E6148C419C6AD6334962375DDD9A5C0C@UWMBX04.uw.lu.se> Message-ID: <0A407957589BAB4F924824150C4293EF5881FFFD@CITESMBX3.ad.uillinois.edu> Hi Victor, The bug is recently fixed in mpich/master (see https://trac.mpich.org/projects/mpich/ticket/2204). Could you try tonight's nightly snapshot? Thanks, Xin ________________________________________ From: Victor Vysotskiy [victor.vysotskiy at teokem.lu.se] Sent: Wednesday, November 19, 2014 3:00 AM To: Zhao, Xin Subject: RE: [mpich-discuss] Assertion failed in file src/mpid/ch3/src/ch3u_handle_send_req.c at line 61 (RMA && Derived datatypes) Hi Xin, >I think it is due to a bug in our MPICH RMA thanks for your email! It is a great news, that you have found a problematic place inside MPICH. >I created a ticket for this: https://trac.mpich.org/projects/mpich/ticket/2204, you can track the progress of this bug on it. I will keep an eye on it. With best regards, Victor. ________________________________________ From: Zhao, Xin [xinzhao3 at illinois.edu] Sent: Wednesday, November 19, 2014 4:40 AM To: discuss at mpich.org Cc: Victor Vysotskiy Subject: RE: [mpich-discuss] Assertion failed in file src/mpid/ch3/src/ch3u_handle_send_req.c at line 61 (RMA && Derived datatypes) Hi Victor, I looked your test code and I think it is due to a bug in our MPICH RMA that the request handler is not re-entrant safe, which makes win_ptr->at_completion_counter being decremented twice for one GET operation. We will fix it as soon as possible. I created a ticket for this: https://trac.mpich.org/projects/mpich/ticket/2204, you can track the progress of this bug on it. Thanks, Xin ________________________________________ From: Victor Vysotskiy [victor.vysotskiy at teokem.lu.se] Sent: Tuesday, November 18, 2014 6:00 AM To: discuss at mpich.org Subject: Re: [mpich-discuss] Assertion failed in file src/mpid/ch3/src/ch3u_handle_send_req.c at line 61 (RMA && Derived datatypes) Hi Pavan, >FYI, this is what we heard back from the Mellanox folks: >I think, issue could be because of older MXM (part of MOFED) being used. We can ask him to try latest MXM from HPCX (http://bgate.mellanox.com/products/hpcx) Indeed, I just checked the latest software stack, including: - hpcx-v1.2.0-258-icc-OFED-3.12-redhat6.5; - MPICH v3.2a2 ('--with-device=ch3:nemesis:mxm'); And, there is no problem with 'assertion failed in ch3u_handle_send_req.c' anymore! Many thanks for your help and support! With best regards, Victor. _______________________________________________ discuss mailing list discuss at mpich.org To manage subscription options or unsubscribe: https://lists.mpich.org/mailman/listinfo/discuss _______________________________________________ discuss mailing list discuss at mpich.org To manage subscription options or unsubscribe: https://lists.mpich.org/mailman/listinfo/discuss From rndfax at yandex.ru Mon Nov 24 12:23:29 2014 From: rndfax at yandex.ru (Kuleshov Aleksey) Date: Mon, 24 Nov 2014 21:23:29 +0300 Subject: [mpich-discuss] Assertion in MX netmod In-Reply-To: <54734A1B.8080807@mcs.anl.gov> References: <4319971416670754@web3j.yandex.ru> <107521416685978@web5o.yandex.ru> <54734A1B.8080807@mcs.anl.gov> Message-ID: <3416861416853409@web3g.yandex.ru> Thank you Antonio for this information. This is very sad, because: 1) in the releases 3.1.2 and 3.1.3 (which are stable releases!) MPICH has broken netmod? 2) netmod newmad has the same routing as mx (which calls assertion) but newmad is still in 3.2a2 => MPICH still has broken code unless something was fixed in subroutings? 24.11.2014, 18:09, "Antonio J. Pe?a" : > Dear Kuleshov, > > In order to accomodate resources for more recent networking APIs we > dropped support for the mx netmod, which in fact has been completely > removed in our most recent 3.2 releases. So, unfortunately, we are not > able to assist you with this issue. > > Best regards, > ???Antonio > > On 11/22/2014 01:52 PM, Kuleshov Aleksey wrote: >> ?And the same problem with different approach: >> ?I downloaded from http://www.mcs.anl.gov/research/projects/mpi/mpi-test/tsuite.html mpi2test.tar.gz, built it and try >> ?to run pingping test: >>> ?MPITEST_VERBOSE=1 MPICH_NEMESIS_NETMOD=mx mpiexec -f /tmp/m -n 2 /tests/pingping >> ?[stdout] >> ?Get new datatypes: send = MPI_INT, recv = MPI_INT >> ?Get new datatypes: send = MPI_INT, recv = MPI_INT >> ?Sending count = 1 of sendtype MPI_INT of total size 4 bytes >> ?Sending count = 1 of sendtype MPI_INT of total size 4 bytes >> ?Get new datatypes: send = MPI_DOUBLE, recv = MPI_DOUBLE >> ?Get new datatypes: send = MPI_DOUBLE, recv = MPI_DOUBLE >> ?Sending count = 1 of sendtype MPI_DOUBLE of total size 8 bytes >> ?Sending count = 1 of sendtype MPI_DOUBLE of total size 8 bytes >> ?Get new datatypes: send = MPI_FLOAT_INT, recv = MPI_FLOAT_INT >> ?Sending count = 1 of sendtype MPI_FLOAT_INT of total size 8 bytes >> ?Get new datatypes: send = MPI_FLOAT_INT, recv = MPI_FLOAT_INT >> ?Sending count = 1 of sendtype MPI_FLOAT_INT of total size 8 bytes >> ?Get new datatypes: send = dup of MPI_INT, recv = dup of MPI_INT >> ?Get new datatypes: send = dup of MPI_INT, recv = dup of MPI_INT >> ?Sending count = 1 of sendtype dup of MPI_INT of total size 4 bytes >> ?Sending count = 1 of sendtype dup of MPI_INT of total size 4 bytes >> ?Get new datatypes: send = int-vector, recv = MPI_INT >> ?Sending count = 1 of sendtype int-vector of total size 4 bytes >> ?Assertion failed in file ../src/mpid/ch3/channels/nemesis/netmod/mx/mx_send.c at line 435: n_iov > 0 >> ?internal ABORT - process 0 >> ?[/stdout] >> >> ?22.11.2014, 18:39, "Kuleshov Aleksey" : >>> ?Hello! Can you please help me with problem? >>> >>> ?I'm working on custom myriexpress library and I'm using MX netmod in MPICH v.3.1.2. >>> ?For testing purposes I built OSU Micro Benchmarks v3.8. >>> >>> ?To run it on 7 nodes I execute test osu_alltoall as follows: >>>> ???MPICH_NEMESIS_NETMOD=mx mpiexec -f /tmp/m -n 7 /osu_alltoall >>> ?It passed successfully (I also tried it on 2, 3, 4, 5 and 6 nodes - everything is alright). >>> >>> ?But now I want to run it on 8 nodes: >>>> ???MPICH_NEMESIS_NETMOD=mx mpiexec -f /tmp/m -n 8 /osu_alltoall >>> ?[stdout] >>> ?# OSU MPI All-to-All Personalized Exchange Latency Test v3.8 >>> ?# Size ??????Avg Latency(us) >>> ?Assertion failed in file ../src/mpid/ch3/channels/nemesis/netmod/mx/mx_poll.c at line 784: n_iov > 0 >>> ?Assertion failed in file ../src/mpid/ch3/channels/nemesis/netmod/mx/mx_poll.c at line 784: n_iov > 0 >>> ?Assertion failed in file ../src/mpid/ch3/channels/nemesis/netmod/mx/mx_poll.c at line 784: n_iov > 0 >>> ?Assertion failed in file ../src/mpid/ch3/channels/nemesis/netmod/mx/mx_poll.c at line 784: n_iov > 0 >>> ?Assertion failed in file ../src/mpid/ch3/channels/nemesis/netmod/mx/mx_poll.c at line 784: n_iov > 0 >>> ?Assertion failed in file ../src/mpid/ch3/channels/nemesis/netmod/mx/mx_poll.c at line 784: n_iov > 0 >>> ?Assertion failed in file ../src/mpid/ch3/channels/nemesis/netmod/mx/mx_poll.c at line 784: n_iov > 0 >>> ?Assertion failed in file ../src/mpid/ch3/channels/nemesis/netmod/mx/mx_poll.c at line 784: n_iov > 0 >>> ?internal ABORT - process 4 >>> ?internal ABORT - process 7 >>> ?internal ABORT - process 2 >>> ?internal ABORT - process 6 >>> ?internal ABORT - process 3 >>> ?internal ABORT - process 0 >>> ?internal ABORT - process 5 >>> ?internal ABORT - process 1 >>> ?[/stdout] >>> >>> ?So, what does these assertions mean? >>> ?Is it something wrong with MX netmod? >>> ?Or in myriexpress library? >>> ?Or in test osu_alltoall itself? >>> >>> ?BTW, osu_alltoall on 8 nodes passed successfully for TCP netmod. >>> ?_______________________________________________ >>> ?discuss mailing list ????discuss at mpich.org >>> ?To manage subscription options or unsubscribe: >>> ?https://lists.mpich.org/mailman/listinfo/discuss >> ?_______________________________________________ >> ?discuss mailing list ????discuss at mpich.org >> ?To manage subscription options or unsubscribe: >> ?https://lists.mpich.org/mailman/listinfo/discuss > > -- > Antonio J. Pe?a > Postdoctoral Appointee > Mathematics and Computer Science Division > Argonne National Laboratory > 9700 South Cass Avenue, Bldg. 240, Of. 3148 > Argonne, IL 60439-4847 > apenya at mcs.anl.gov > www.mcs.anl.gov/~apenya > > _______________________________________________ > discuss mailing list ????discuss at mpich.org > To manage subscription options or unsubscribe: > https://lists.mpich.org/mailman/listinfo/discuss _______________________________________________ discuss mailing list discuss at mpich.org To manage subscription options or unsubscribe: https://lists.mpich.org/mailman/listinfo/discuss From apenya at mcs.anl.gov Mon Nov 24 09:09:15 2014 From: apenya at mcs.anl.gov (=?UTF-8?B?IkFudG9uaW8gSi4gUGXDsWEi?=) Date: Mon, 24 Nov 2014 09:09:15 -0600 Subject: [mpich-discuss] Assertion in MX netmod In-Reply-To: <107521416685978@web5o.yandex.ru> References: <4319971416670754@web3j.yandex.ru> <107521416685978@web5o.yandex.ru> Message-ID: <54734A1B.8080807@mcs.anl.gov> Dear Kuleshov, In order to accomodate resources for more recent networking APIs we dropped support for the mx netmod, which in fact has been completely removed in our most recent 3.2 releases. So, unfortunately, we are not able to assist you with this issue. Best regards, Antonio On 11/22/2014 01:52 PM, Kuleshov Aleksey wrote: > And the same problem with different approach: > I downloaded from http://www.mcs.anl.gov/research/projects/mpi/mpi-test/tsuite.html mpi2test.tar.gz, built it and try > to run pingping test: > >> MPITEST_VERBOSE=1 MPICH_NEMESIS_NETMOD=mx mpiexec -f /tmp/m -n 2 /tests/pingping > [stdout] > Get new datatypes: send = MPI_INT, recv = MPI_INT > Get new datatypes: send = MPI_INT, recv = MPI_INT > Sending count = 1 of sendtype MPI_INT of total size 4 bytes > Sending count = 1 of sendtype MPI_INT of total size 4 bytes > Get new datatypes: send = MPI_DOUBLE, recv = MPI_DOUBLE > Get new datatypes: send = MPI_DOUBLE, recv = MPI_DOUBLE > Sending count = 1 of sendtype MPI_DOUBLE of total size 8 bytes > Sending count = 1 of sendtype MPI_DOUBLE of total size 8 bytes > Get new datatypes: send = MPI_FLOAT_INT, recv = MPI_FLOAT_INT > Sending count = 1 of sendtype MPI_FLOAT_INT of total size 8 bytes > Get new datatypes: send = MPI_FLOAT_INT, recv = MPI_FLOAT_INT > Sending count = 1 of sendtype MPI_FLOAT_INT of total size 8 bytes > Get new datatypes: send = dup of MPI_INT, recv = dup of MPI_INT > Get new datatypes: send = dup of MPI_INT, recv = dup of MPI_INT > Sending count = 1 of sendtype dup of MPI_INT of total size 4 bytes > Sending count = 1 of sendtype dup of MPI_INT of total size 4 bytes > Get new datatypes: send = int-vector, recv = MPI_INT > Sending count = 1 of sendtype int-vector of total size 4 bytes > Assertion failed in file ../src/mpid/ch3/channels/nemesis/netmod/mx/mx_send.c at line 435: n_iov > 0 > internal ABORT - process 0 > [/stdout] > > 22.11.2014, 18:39, "Kuleshov Aleksey" : >> Hello! Can you please help me with problem? >> >> I'm working on custom myriexpress library and I'm using MX netmod in MPICH v.3.1.2. >> For testing purposes I built OSU Micro Benchmarks v3.8. >> >> To run it on 7 nodes I execute test osu_alltoall as follows: >>> MPICH_NEMESIS_NETMOD=mx mpiexec -f /tmp/m -n 7 /osu_alltoall >> It passed successfully (I also tried it on 2, 3, 4, 5 and 6 nodes - everything is alright). >> >> But now I want to run it on 8 nodes: >>> MPICH_NEMESIS_NETMOD=mx mpiexec -f /tmp/m -n 8 /osu_alltoall >> [stdout] >> # OSU MPI All-to-All Personalized Exchange Latency Test v3.8 >> # Size Avg Latency(us) >> Assertion failed in file ../src/mpid/ch3/channels/nemesis/netmod/mx/mx_poll.c at line 784: n_iov > 0 >> Assertion failed in file ../src/mpid/ch3/channels/nemesis/netmod/mx/mx_poll.c at line 784: n_iov > 0 >> Assertion failed in file ../src/mpid/ch3/channels/nemesis/netmod/mx/mx_poll.c at line 784: n_iov > 0 >> Assertion failed in file ../src/mpid/ch3/channels/nemesis/netmod/mx/mx_poll.c at line 784: n_iov > 0 >> Assertion failed in file ../src/mpid/ch3/channels/nemesis/netmod/mx/mx_poll.c at line 784: n_iov > 0 >> Assertion failed in file ../src/mpid/ch3/channels/nemesis/netmod/mx/mx_poll.c at line 784: n_iov > 0 >> Assertion failed in file ../src/mpid/ch3/channels/nemesis/netmod/mx/mx_poll.c at line 784: n_iov > 0 >> Assertion failed in file ../src/mpid/ch3/channels/nemesis/netmod/mx/mx_poll.c at line 784: n_iov > 0 >> internal ABORT - process 4 >> internal ABORT - process 7 >> internal ABORT - process 2 >> internal ABORT - process 6 >> internal ABORT - process 3 >> internal ABORT - process 0 >> internal ABORT - process 5 >> internal ABORT - process 1 >> [/stdout] >> >> So, what does these assertions mean? >> Is it something wrong with MX netmod? >> Or in myriexpress library? >> Or in test osu_alltoall itself? >> >> BTW, osu_alltoall on 8 nodes passed successfully for TCP netmod. >> _______________________________________________ >> discuss mailing list discuss at mpich.org >> To manage subscription options or unsubscribe: >> https://lists.mpich.org/mailman/listinfo/discuss > _______________________________________________ > discuss mailing list discuss at mpich.org > To manage subscription options or unsubscribe: > https://lists.mpich.org/mailman/listinfo/discuss -- Antonio J. Pe?a Postdoctoral Appointee Mathematics and Computer Science Division Argonne National Laboratory 9700 South Cass Avenue, Bldg. 240, Of. 3148 Argonne, IL 60439-4847 apenya at mcs.anl.gov www.mcs.anl.gov/~apenya _______________________________________________ discuss mailing list discuss at mpich.org To manage subscription options or unsubscribe: https://lists.mpich.org/mailman/listinfo/discuss From rndfax at yandex.ru Sat Nov 22 09:39:14 2014 From: rndfax at yandex.ru (Kuleshov Aleksey) Date: Sat, 22 Nov 2014 18:39:14 +0300 Subject: [mpich-discuss] Assertion in MX netmod Message-ID: <4319971416670754@web3j.yandex.ru> Hello! Can you please help me with problem? I'm working on custom myriexpress library and I'm using MX netmod in MPICH v.3.1.2. For testing purposes I built OSU Micro Benchmarks v3.8. To run it on 7 nodes I execute test osu_alltoall as follows: > MPICH_NEMESIS_NETMOD=mx mpiexec -f /tmp/m -n 7 /osu_alltoall It passed successfully (I also tried it on 2, 3, 4, 5 and 6 nodes - everything is alright). But now I want to run it on 8 nodes: > MPICH_NEMESIS_NETMOD=mx mpiexec -f /tmp/m -n 8 /osu_alltoall [stdout] # OSU MPI All-to-All Personalized Exchange Latency Test v3.8 # Size Avg Latency(us) Assertion failed in file ../src/mpid/ch3/channels/nemesis/netmod/mx/mx_poll.c at line 784: n_iov > 0 Assertion failed in file ../src/mpid/ch3/channels/nemesis/netmod/mx/mx_poll.c at line 784: n_iov > 0 Assertion failed in file ../src/mpid/ch3/channels/nemesis/netmod/mx/mx_poll.c at line 784: n_iov > 0 Assertion failed in file ../src/mpid/ch3/channels/nemesis/netmod/mx/mx_poll.c at line 784: n_iov > 0 Assertion failed in file ../src/mpid/ch3/channels/nemesis/netmod/mx/mx_poll.c at line 784: n_iov > 0 Assertion failed in file ../src/mpid/ch3/channels/nemesis/netmod/mx/mx_poll.c at line 784: n_iov > 0 Assertion failed in file ../src/mpid/ch3/channels/nemesis/netmod/mx/mx_poll.c at line 784: n_iov > 0 Assertion failed in file ../src/mpid/ch3/channels/nemesis/netmod/mx/mx_poll.c at line 784: n_iov > 0 internal ABORT - process 4 internal ABORT - process 7 internal ABORT - process 2 internal ABORT - process 6 internal ABORT - process 3 internal ABORT - process 0 internal ABORT - process 5 internal ABORT - process 1 [/stdout] So, what does these assertions mean? Is it something wrong with MX netmod? Or in myriexpress library? Or in test osu_alltoall itself? BTW, osu_alltoall on 8 nodes passed successfully for TCP netmod. _______________________________________________ discuss mailing list discuss at mpich.org To manage subscription options or unsubscribe: https://lists.mpich.org/mailman/listinfo/discuss From rndfax at yandex.ru Sat Nov 22 13:52:58 2014 From: rndfax at yandex.ru (Kuleshov Aleksey) Date: Sat, 22 Nov 2014 22:52:58 +0300 Subject: [mpich-discuss] Assertion in MX netmod In-Reply-To: <4319971416670754@web3j.yandex.ru> References: <4319971416670754@web3j.yandex.ru> Message-ID: <107521416685978@web5o.yandex.ru> And the same problem with different approach: I downloaded from http://www.mcs.anl.gov/research/projects/mpi/mpi-test/tsuite.html mpi2test.tar.gz, built it and try to run pingping test: > MPITEST_VERBOSE=1 MPICH_NEMESIS_NETMOD=mx mpiexec -f /tmp/m -n 2 /tests/pingping [stdout] Get new datatypes: send = MPI_INT, recv = MPI_INT Get new datatypes: send = MPI_INT, recv = MPI_INT Sending count = 1 of sendtype MPI_INT of total size 4 bytes Sending count = 1 of sendtype MPI_INT of total size 4 bytes Get new datatypes: send = MPI_DOUBLE, recv = MPI_DOUBLE Get new datatypes: send = MPI_DOUBLE, recv = MPI_DOUBLE Sending count = 1 of sendtype MPI_DOUBLE of total size 8 bytes Sending count = 1 of sendtype MPI_DOUBLE of total size 8 bytes Get new datatypes: send = MPI_FLOAT_INT, recv = MPI_FLOAT_INT Sending count = 1 of sendtype MPI_FLOAT_INT of total size 8 bytes Get new datatypes: send = MPI_FLOAT_INT, recv = MPI_FLOAT_INT Sending count = 1 of sendtype MPI_FLOAT_INT of total size 8 bytes Get new datatypes: send = dup of MPI_INT, recv = dup of MPI_INT Get new datatypes: send = dup of MPI_INT, recv = dup of MPI_INT Sending count = 1 of sendtype dup of MPI_INT of total size 4 bytes Sending count = 1 of sendtype dup of MPI_INT of total size 4 bytes Get new datatypes: send = int-vector, recv = MPI_INT Sending count = 1 of sendtype int-vector of total size 4 bytes Assertion failed in file ../src/mpid/ch3/channels/nemesis/netmod/mx/mx_send.c at line 435: n_iov > 0 internal ABORT - process 0 [/stdout] 22.11.2014, 18:39, "Kuleshov Aleksey" : > Hello! Can you please help me with problem? > > I'm working on custom myriexpress library and I'm using MX netmod in MPICH v.3.1.2. > For testing purposes I built OSU Micro Benchmarks v3.8. > > To run it on 7 nodes I execute test osu_alltoall as follows: >> ?MPICH_NEMESIS_NETMOD=mx mpiexec -f /tmp/m -n 7 /osu_alltoall > > It passed successfully (I also tried it on 2, 3, 4, 5 and 6 nodes - everything is alright). > > But now I want to run it on 8 nodes: >> ?MPICH_NEMESIS_NETMOD=mx mpiexec -f /tmp/m -n 8 /osu_alltoall > > [stdout] > # OSU MPI All-to-All Personalized Exchange Latency Test v3.8 > # Size ??????Avg Latency(us) > Assertion failed in file ../src/mpid/ch3/channels/nemesis/netmod/mx/mx_poll.c at line 784: n_iov > 0 > Assertion failed in file ../src/mpid/ch3/channels/nemesis/netmod/mx/mx_poll.c at line 784: n_iov > 0 > Assertion failed in file ../src/mpid/ch3/channels/nemesis/netmod/mx/mx_poll.c at line 784: n_iov > 0 > Assertion failed in file ../src/mpid/ch3/channels/nemesis/netmod/mx/mx_poll.c at line 784: n_iov > 0 > Assertion failed in file ../src/mpid/ch3/channels/nemesis/netmod/mx/mx_poll.c at line 784: n_iov > 0 > Assertion failed in file ../src/mpid/ch3/channels/nemesis/netmod/mx/mx_poll.c at line 784: n_iov > 0 > Assertion failed in file ../src/mpid/ch3/channels/nemesis/netmod/mx/mx_poll.c at line 784: n_iov > 0 > Assertion failed in file ../src/mpid/ch3/channels/nemesis/netmod/mx/mx_poll.c at line 784: n_iov > 0 > internal ABORT - process 4 > internal ABORT - process 7 > internal ABORT - process 2 > internal ABORT - process 6 > internal ABORT - process 3 > internal ABORT - process 0 > internal ABORT - process 5 > internal ABORT - process 1 > [/stdout] > > So, what does these assertions mean? > Is it something wrong with MX netmod? > Or in myriexpress library? > Or in test osu_alltoall itself? > > BTW, osu_alltoall on 8 nodes passed successfully for TCP netmod. > _______________________________________________ > discuss mailing list ????discuss at mpich.org > To manage subscription options or unsubscribe: > https://lists.mpich.org/mailman/listinfo/discuss _______________________________________________ discuss mailing list discuss at mpich.org To manage subscription options or unsubscribe: https://lists.mpich.org/mailman/listinfo/discuss From apenya at anl.gov Thu Nov 20 16:50:06 2014 From: apenya at anl.gov (=?iso-8859-1?Q?Antonio_J._Pe=F1a?=) Date: Thu, 20 Nov 2014 16:50:06 -0600 Subject: [mpich-discuss] MPI_Iprobe bug in MPICH for BGQ? In-Reply-To: <6F4D5A685397B940825208C64CF853A747800895@HALAS.anl.gov> References: <6F4D5A685397B940825208C64CF853A747800895@HALAS.anl.gov> Message-ID: <006b01d00514$5424ff60$fc6efe20$@anl.gov> Hi Florin, I don?t think you?re hitting any bug, and I?d say both behaviors are correct. Note that when you do a sleep the MPI implementation is not making any progress, so if the Isend call didn?t push the message immediately to the network, which is likely to be more the case on an HPC network than on a socket-based one, the sleep is unproductive, and so the Iprobe returns nothing arrived. Best, Toni -------------------------------------------------------------- Antonio J. Pe?a Postdoctoral Appointee Mathematics and Computer Science Division (MCS) Argonne National Laboratory 9700 S. Cass Avenue Argonne, IL 60439 Building 340 - Office 3148 apenya at mcs.anl.gov (+1) 630-252-7928 From: Isaila, Florin D. [mailto:fisaila at mcs.anl.gov] Sent: Thursday, November 20, 2014 4:20 PM To: mpich-discuss at mcs.anl.gov Subject: [mpich-discuss] MPI_Iprobe bug in MPICH for BGQ? Hi, when I run the program from below on 1 node on BGQ (Vesta), the message is not received (flag is 0). However on a Ubuntu, the message is received (flag is non-zero). If I add another Iprobe (uncomment the Iprobe in the code below) the message is received on both BGQ and Ubuntu. Note that the program sleeps for 1 second after the Isend. Is it a bug? This happens for both MPICH-3.1.3 and MPICH-3.1. #include "mpi.h" #include #include int main(int argc, char **argv) { int send_int, recv_int, tag, flag; MPI_Status status; MPI_Request req; MPI_Init(&argc, &argv); tag = 0; send_int = 100; MPI_Isend(&send_int, 1, MPI_INT, 0, tag, MPI_COMM_WORLD, &req ); sleep(1); MPI_Iprobe(MPI_ANY_SOURCE , MPI_ANY_TAG, MPI_COMM_WORLD, &flag, &status ); //MPI_Iprobe(MPI_ANY_SOURCE , tag, MPI_COMM_WORLD, &flag, &status ); if (flag) { MPI_Recv( &recv_int, 1, MPI_INT, MPI_ANY_SOURCE, tag, MPI_COMM_WORLD, &status); printf("Received = %d\n", recv_int); } else printf("Message not received yet"); MPI_Waitall(1, &req, MPI_STATUSES_IGNORE); MPI_Finalize(); return 0; } Thanks Florin -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- _______________________________________________ discuss mailing list discuss at mpich.org To manage subscription options or unsubscribe: https://lists.mpich.org/mailman/listinfo/discuss From Hirak_Roy at mentor.com Thu Nov 20 11:56:03 2014 From: Hirak_Roy at mentor.com (Roy, Hirak) Date: Thu, 20 Nov 2014 17:56:03 +0000 Subject: [mpich-discuss] Re : Client hangs if server dies in dynamic process management Message-ID: <8B38871795FD7042B826C1D1670246FBE24289FE@EU-MBX-01.mgc.mentorg.com> Hi Huiwei, Thanks for acknowledgement. http://trac.mpich.org/projects/mpich/ticket/2205 Could you please let me know what would be the target release for this? Thanks, Hirak -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- _______________________________________________ discuss mailing list discuss at mpich.org To manage subscription options or unsubscribe: https://lists.mpich.org/mailman/listinfo/discuss From fisaila at mcs.anl.gov Thu Nov 20 12:09:43 2014 From: fisaila at mcs.anl.gov (Isaila, Florin D.) Date: Thu, 20 Nov 2014 18:09:43 +0000 Subject: [mpich-discuss] f77 bindings and profiling In-Reply-To: <546E0ABC.4090100@mcs.anl.gov> References: <6F4D5A685397B940825208C64CF853A7477F881F@HALAS.anl.gov>, <08621548-8BFB-4458-9936-97765341598D@mcs.anl.gov> <6F4D5A685397B940825208C64CF853A7477F8875@HALAS.anl.gov>, <546CCBDB.5030405@mcs.anl.gov> <6F4D5A685397B940825208C64CF853A747800839@HALAS.anl.gov>, <546E0ABC.4090100@mcs.anl.gov> Message-ID: <6F4D5A685397B940825208C64CF853A747800866@HALAS.anl.gov> ldd output: fisaila at howard:f77_program$ gfortran init_finalize.f -I/homes/fisaila/software/mpich/include -I/homes/fisaila/software/mpich/include -L/homes/fisaila/software/mpich/lib -L.. -ltarget -lmpifort -Wl,-rpath -Wl,/homes/fisaila/software/mpich/lib -Wl,--enable-new-dtags -lmpi fisaila at howard:f77_program$ ldd a.out linux-vdso.so.1 => (0x00007fff5c1ff000) libmpifort.so.12 => /homes/fisaila/software/mpich/lib/libmpifort.so.12 (0x00007fd2b4957000) libgfortran.so.3 => /usr/lib/x86_64-linux-gnu/libgfortran.so.3 (0x00007fd2b460f000) libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007fd2b424f000) libmpi.so.12 => /homes/fisaila/software/mpich/lib/libmpi.so.12 (0x00007fd2b394c000) libquadmath.so.0 => /usr/lib/x86_64-linux-gnu/libquadmath.so.0 (0x00007fd2b3716000) libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007fd2b3419000) /lib64/ld-linux-x86-64.so.2 (0x00007fd2b4b91000) librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007fd2b3211000) libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007fd2b2ff4000) libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007fd2b2ddd000) ________________________________________ From: Kenneth Raffenetti [raffenet at mcs.anl.gov] Sent: Thursday, November 20, 2014 9:37 AM To: discuss at mpich.org Subject: Re: [mpich-discuss] f77 bindings and profiling Can you paste the output of ldd on your binary? Ken On 11/20/2014 09:22 AM, Isaila, Florin D. wrote: > Hi Ken, > > it is not working this way. > > Florin > ________________________________________ > From: Kenneth Raffenetti [raffenet at mcs.anl.gov] > Sent: Wednesday, November 19, 2014 10:56 AM > To: discuss at mpich.org > Cc: mpich-discuss at mcs.anl.gov > Subject: Re: [mpich-discuss] f77 bindings and profiling > > On 11/18/2014 09:47 AM, Isaila, Florin D. wrote: >> fisaila at howard:f77$ mpif77 -show fpi.f >> gfortran fpi.f -I/homes/fisaila/software/mpich/include -I/homes/fisaila/software/mpich/include -L/homes/fisaila/software/mpich/lib -lmpifort -lfoo -Wl,-rpath -Wl,/homes/fisaila/software/mpich/lib -Wl,--enable-new-dtags -lmpi > > This looks to be the problem. PROFILE_PRELIB needs to be added *before* > libmpifort. Some time after 3.1, we re-organized which symbols go into > which libraries and now libmpifort.so contains PMPI_Init and all its > weak aliases (MPI_Init, mpi_init_, etc.) This means the MPI_Init symbol > is resolved before your library is searched. > > Can you manually try a compile line with -lfoo before -lmpifort and > confirm that it works as expected? I.e. > > gfortran fpi.f -I/homes/fisaila/software/mpich/include > -I/homes/fisaila/software/mpich/include > -L/homes/fisaila/software/mpich/lib -lfoo -lmpifort -Wl,-rpath > -Wl,/homes/fisaila/software/mpich/lib -Wl,--enable-new-dtags -lmpi > > Ken > _______________________________________________ > discuss mailing list discuss at mpich.org > To manage subscription options or unsubscribe: > https://lists.mpich.org/mailman/listinfo/discuss > _______________________________________________ > discuss mailing list discuss at mpich.org > To manage subscription options or unsubscribe: > https://lists.mpich.org/mailman/listinfo/discuss > _______________________________________________ discuss mailing list discuss at mpich.org To manage subscription options or unsubscribe: https://lists.mpich.org/mailman/listinfo/discuss _______________________________________________ discuss mailing list discuss at mpich.org To manage subscription options or unsubscribe: https://lists.mpich.org/mailman/listinfo/discuss From fisaila at mcs.anl.gov Thu Nov 20 09:22:45 2014 From: fisaila at mcs.anl.gov (Isaila, Florin D.) Date: Thu, 20 Nov 2014 15:22:45 +0000 Subject: [mpich-discuss] f77 bindings and profiling In-Reply-To: <546CCBDB.5030405@mcs.anl.gov> References: <6F4D5A685397B940825208C64CF853A7477F881F@HALAS.anl.gov>, <08621548-8BFB-4458-9936-97765341598D@mcs.anl.gov> <6F4D5A685397B940825208C64CF853A7477F8875@HALAS.anl.gov>, <546CCBDB.5030405@mcs.anl.gov> Message-ID: <6F4D5A685397B940825208C64CF853A747800839@HALAS.anl.gov> Hi Ken, it is not working this way. Florin ________________________________________ From: Kenneth Raffenetti [raffenet at mcs.anl.gov] Sent: Wednesday, November 19, 2014 10:56 AM To: discuss at mpich.org Cc: mpich-discuss at mcs.anl.gov Subject: Re: [mpich-discuss] f77 bindings and profiling On 11/18/2014 09:47 AM, Isaila, Florin D. wrote: > fisaila at howard:f77$ mpif77 -show fpi.f > gfortran fpi.f -I/homes/fisaila/software/mpich/include -I/homes/fisaila/software/mpich/include -L/homes/fisaila/software/mpich/lib -lmpifort -lfoo -Wl,-rpath -Wl,/homes/fisaila/software/mpich/lib -Wl,--enable-new-dtags -lmpi This looks to be the problem. PROFILE_PRELIB needs to be added *before* libmpifort. Some time after 3.1, we re-organized which symbols go into which libraries and now libmpifort.so contains PMPI_Init and all its weak aliases (MPI_Init, mpi_init_, etc.) This means the MPI_Init symbol is resolved before your library is searched. Can you manually try a compile line with -lfoo before -lmpifort and confirm that it works as expected? I.e. gfortran fpi.f -I/homes/fisaila/software/mpich/include -I/homes/fisaila/software/mpich/include -L/homes/fisaila/software/mpich/lib -lfoo -lmpifort -Wl,-rpath -Wl,/homes/fisaila/software/mpich/lib -Wl,--enable-new-dtags -lmpi Ken _______________________________________________ discuss mailing list discuss at mpich.org To manage subscription options or unsubscribe: https://lists.mpich.org/mailman/listinfo/discuss _______________________________________________ discuss mailing list discuss at mpich.org To manage subscription options or unsubscribe: https://lists.mpich.org/mailman/listinfo/discuss From raffenet at mcs.anl.gov Thu Nov 20 09:37:32 2014 From: raffenet at mcs.anl.gov (Kenneth Raffenetti) Date: Thu, 20 Nov 2014 09:37:32 -0600 Subject: [mpich-discuss] f77 bindings and profiling In-Reply-To: <6F4D5A685397B940825208C64CF853A747800839@HALAS.anl.gov> References: <6F4D5A685397B940825208C64CF853A7477F881F@HALAS.anl.gov>, <08621548-8BFB-4458-9936-97765341598D@mcs.anl.gov> <6F4D5A685397B940825208C64CF853A7477F8875@HALAS.anl.gov>, <546CCBDB.5030405@mcs.anl.gov> <6F4D5A685397B940825208C64CF853A747800839@HALAS.anl.gov> Message-ID: <546E0ABC.4090100@mcs.anl.gov> Can you paste the output of ldd on your binary? Ken On 11/20/2014 09:22 AM, Isaila, Florin D. wrote: > Hi Ken, > > it is not working this way. > > Florin > ________________________________________ > From: Kenneth Raffenetti [raffenet at mcs.anl.gov] > Sent: Wednesday, November 19, 2014 10:56 AM > To: discuss at mpich.org > Cc: mpich-discuss at mcs.anl.gov > Subject: Re: [mpich-discuss] f77 bindings and profiling > > On 11/18/2014 09:47 AM, Isaila, Florin D. wrote: >> fisaila at howard:f77$ mpif77 -show fpi.f >> gfortran fpi.f -I/homes/fisaila/software/mpich/include -I/homes/fisaila/software/mpich/include -L/homes/fisaila/software/mpich/lib -lmpifort -lfoo -Wl,-rpath -Wl,/homes/fisaila/software/mpich/lib -Wl,--enable-new-dtags -lmpi > > This looks to be the problem. PROFILE_PRELIB needs to be added *before* > libmpifort. Some time after 3.1, we re-organized which symbols go into > which libraries and now libmpifort.so contains PMPI_Init and all its > weak aliases (MPI_Init, mpi_init_, etc.) This means the MPI_Init symbol > is resolved before your library is searched. > > Can you manually try a compile line with -lfoo before -lmpifort and > confirm that it works as expected? I.e. > > gfortran fpi.f -I/homes/fisaila/software/mpich/include > -I/homes/fisaila/software/mpich/include > -L/homes/fisaila/software/mpich/lib -lfoo -lmpifort -Wl,-rpath > -Wl,/homes/fisaila/software/mpich/lib -Wl,--enable-new-dtags -lmpi > > Ken > _______________________________________________ > discuss mailing list discuss at mpich.org > To manage subscription options or unsubscribe: > https://lists.mpich.org/mailman/listinfo/discuss > _______________________________________________ > discuss mailing list discuss at mpich.org > To manage subscription options or unsubscribe: > https://lists.mpich.org/mailman/listinfo/discuss > _______________________________________________ discuss mailing list discuss at mpich.org To manage subscription options or unsubscribe: https://lists.mpich.org/mailman/listinfo/discuss From raffenet at mcs.anl.gov Wed Nov 19 10:56:59 2014 From: raffenet at mcs.anl.gov (Kenneth Raffenetti) Date: Wed, 19 Nov 2014 10:56:59 -0600 Subject: [mpich-discuss] f77 bindings and profiling In-Reply-To: <6F4D5A685397B940825208C64CF853A7477F8875@HALAS.anl.gov> References: <6F4D5A685397B940825208C64CF853A7477F881F@HALAS.anl.gov>, <08621548-8BFB-4458-9936-97765341598D@mcs.anl.gov> <6F4D5A685397B940825208C64CF853A7477F8875@HALAS.anl.gov> Message-ID: <546CCBDB.5030405@mcs.anl.gov> On 11/18/2014 09:47 AM, Isaila, Florin D. wrote: > fisaila at howard:f77$ mpif77 -show fpi.f > gfortran fpi.f -I/homes/fisaila/software/mpich/include -I/homes/fisaila/software/mpich/include -L/homes/fisaila/software/mpich/lib -lmpifort -lfoo -Wl,-rpath -Wl,/homes/fisaila/software/mpich/lib -Wl,--enable-new-dtags -lmpi This looks to be the problem. PROFILE_PRELIB needs to be added *before* libmpifort. Some time after 3.1, we re-organized which symbols go into which libraries and now libmpifort.so contains PMPI_Init and all its weak aliases (MPI_Init, mpi_init_, etc.) This means the MPI_Init symbol is resolved before your library is searched. Can you manually try a compile line with -lfoo before -lmpifort and confirm that it works as expected? I.e. gfortran fpi.f -I/homes/fisaila/software/mpich/include -I/homes/fisaila/software/mpich/include -L/homes/fisaila/software/mpich/lib -lfoo -lmpifort -Wl,-rpath -Wl,/homes/fisaila/software/mpich/lib -Wl,--enable-new-dtags -lmpi Ken _______________________________________________ discuss mailing list discuss at mpich.org To manage subscription options or unsubscribe: https://lists.mpich.org/mailman/listinfo/discuss From huiweilu at mcs.anl.gov Wed Nov 19 15:05:07 2014 From: huiweilu at mcs.anl.gov (Lu, Huiwei) Date: Wed, 19 Nov 2014 21:05:07 +0000 Subject: [mpich-discuss] Client hangs if server dies in dynamic process management In-Reply-To: <8B38871795FD7042B826C1D1670246FBE2415CE0@EU-MBX-01.mgc.mentorg.com> References: <8B38871795FD7042B826C1D1670246FBE2415CE0@EU-MBX-01.mgc.mentorg.com> Message-ID: <3169E329-B0D1-4B9D-9E6D-E5288AC43287@anl.gov> Hi, Hirak, Yes I can repeat the bug on both MacOS and Ubuntu with socket channel. I have created a ticket for it. You can track the progress here: http://trac.mpich.org/projects/mpich/ticket/2205 Thanks for reporting the bug. ? Huiwei > On Nov 18, 2014, at 12:49 PM, Roy, Hirak wrote: > > > Hi Huiwei, > > 1> Did you start your nameserver ? > 2> Did the server program crash? > I see the same hang (incomplete MPI_Finalize in client). > > Here is my command line: > ? hydra_namserver & > ? mpiexec ?n 1 ?nameserver ./server > ? mpiexec ?n 1 ?nameserver ./client > > > > MPICH Version: 3.2a2 > MPICH Release date: Sun Nov 16 11:09:31 CST 2014 > MPICH Device: ch3:sock > MPICH configure: --prefix /home/hroy/local//mpich-3.2a2/linux_x86_64 --disable-f77 --disable-fc --disable-f90modules --disable-cxx --enable-fast=nochkmsg --enable-fast=notiming --enable-fast=ndebug --enable-fast=O3 --with-device=ch3:sock --enable-g=dbg --disable-fortran --without-valgrind CFLAGS=-O3 -fPIC CXXFLAGS=-O3 -fPIC > MPICH CC: /u/prod/gnu/gcc/20121129/gcc-4.5.0-linux_x86_64/bin/gcc -O3 -fPIC -g -O3 > MPICH CXX: no -O3 -fPIC -g > MPICH F77: no -g > MPICH FC: no -g > > > Thanks, > Hirak > > Could you try with the latest mpich-3.2a2? > The client exit successfully on my Macbook with sock channel. > > ? > Huiwei > > > On Nov 16, 2014, at 10:44 PM, Hirak Roy wrote: > > > > Hi All, > > > > Here is my sample program. I am using channel sock of mpich-3.0.4. > > > > I am running it as > > > mpiexec -n 1 ./server.out > > > mpiexec -n 1 ./client.out > > > > Here my client program (client.c) hangs in MPI_Finalize. > > There is an assert in the server.c where server exits. > > > > There is no way to detect that in client. > > Even if we detect that using some timeout strategy, the client hangs in the finalize step. > > Could you please suggest what is going wrong here or is this a bug in sock channel? > > > > Thanks, > > Hirak > > _______________________________________________ > > discuss mailing list discuss at mpich.org > > To manage subscription options or unsubscribe: > > https://lists.mpich.org/mailman/listinfo/discuss > > _______________________________________________ > discuss mailing list discuss at mpich.org > To manage subscription options or unsubscribe: > https://lists.mpich.org/mailman/listinfo/discuss _______________________________________________ discuss mailing list discuss at mpich.org To manage subscription options or unsubscribe: https://lists.mpich.org/mailman/listinfo/discuss From jan.bierbaum at tudos.org Wed Nov 19 04:20:30 2014 From: jan.bierbaum at tudos.org (Jan Bierbaum) Date: Wed, 19 Nov 2014 11:20:30 +0100 Subject: [mpich-discuss] f77 bindings and profiling In-Reply-To: <6F4D5A685397B940825208C64CF853A7477F88C7@HALAS.anl.gov> References: <6F4D5A685397B940825208C64CF853A7477F881F@HALAS.anl.gov> <08621548-8BFB-4458-9936-97765341598D@mcs.anl.gov> <6F4D5A685397B940825208C64CF853A7477F8875@HALAS.anl.gov> , <7DD38ACE-B6AE-405B-9B6E-826A4D0461C9@mcs.anl.gov> <6F4D5A685397B940825208C64CF853A7477F88C7@HALAS.anl.gov> Message-ID: <546C6EEE.70506@tudos.org> Hi! On 18.11.2014 20:02, Isaila, Florin D. wrote: > I would like to write the MPI_Init and call PMPI_Init, as I want all > MPI codes to work with that. I expect that my MPI_Init replaces the week > MPI_Init symbol in the mpi library for C programs (this works) , but > also replaces the MPI_Init used in the implementation of mpi_init for > Fortran programs (this does not). I did the same thing recently and it does work fine with Fortran code if you explicitly link against 'libfmpich.a' (in the 'lib' directory of your MPICH installation). The order is important though: mpicc libfmpich.a You may need to give the full path to 'libfmpich.a'. The 'mpicc' wrapper will make sure that all necessary MPICH libraries are passed to the linker as well. Regards, Jan _______________________________________________ discuss mailing list discuss at mpich.org To manage subscription options or unsubscribe: https://lists.mpich.org/mailman/listinfo/discuss From fisaila at mcs.anl.gov Wed Nov 19 09:43:15 2014 From: fisaila at mcs.anl.gov (Isaila, Florin D.) Date: Wed, 19 Nov 2014 15:43:15 +0000 Subject: [mpich-discuss] f77 bindings and profiling In-Reply-To: <546C6EEE.70506@tudos.org> References: <6F4D5A685397B940825208C64CF853A7477F881F@HALAS.anl.gov> <08621548-8BFB-4458-9936-97765341598D@mcs.anl.gov> <6F4D5A685397B940825208C64CF853A7477F8875@HALAS.anl.gov> , <7DD38ACE-B6AE-405B-9B6E-826A4D0461C9@mcs.anl.gov> <6F4D5A685397B940825208C64CF853A7477F88C7@HALAS.anl.gov>, <546C6EEE.70506@tudos.org> Message-ID: <6F4D5A685397B940825208C64CF853A7477FD77E@HALAS.anl.gov> Thanks Jan, this works indeed for me with mpich-3.1 (by using mpif77 compiler, I think this is what you meant). However, it does not work with mpich-3.1.3. Regards Florin ________________________________________ From: Jan Bierbaum [jan.bierbaum at tudos.org] Sent: Wednesday, November 19, 2014 4:20 AM To: discuss at mpich.org Subject: Re: [mpich-discuss] f77 bindings and profiling Hi! On 18.11.2014 20:02, Isaila, Florin D. wrote: > I would like to write the MPI_Init and call PMPI_Init, as I want all > MPI codes to work with that. I expect that my MPI_Init replaces the week > MPI_Init symbol in the mpi library for C programs (this works) , but > also replaces the MPI_Init used in the implementation of mpi_init for > Fortran programs (this does not). I did the same thing recently and it does work fine with Fortran code if you explicitly link against 'libfmpich.a' (in the 'lib' directory of your MPICH installation). The order is important though: mpicc libfmpich.a You may need to give the full path to 'libfmpich.a'. The 'mpicc' wrapper will make sure that all necessary MPICH libraries are passed to the linker as well. Regards, Jan _______________________________________________ discuss mailing list discuss at mpich.org To manage subscription options or unsubscribe: https://lists.mpich.org/mailman/listinfo/discuss _______________________________________________ discuss mailing list discuss at mpich.org To manage subscription options or unsubscribe: https://lists.mpich.org/mailman/listinfo/discuss From xinzhao3 at illinois.edu Tue Nov 18 21:40:59 2014 From: xinzhao3 at illinois.edu (Zhao, Xin) Date: Wed, 19 Nov 2014 03:40:59 +0000 Subject: [mpich-discuss] Assertion failed in file src/mpid/ch3/src/ch3u_handle_send_req.c at line 61 (RMA && Derived datatypes) In-Reply-To: <8D58A4B5E6148C419C6AD6334962375DDD9A5B95@UWMBX04.uw.lu.se> References: <8D58A4B5E6148C419C6AD6334962375DDD9A4C54@UWMBX04.uw.lu.se>, <8D58A4B5E6148C419C6AD6334962375DDD9A5B7E@UWMBX04.uw.lu.se>, <8D58A4B5E6148C419C6AD6334962375DDD9A5B95@UWMBX04.uw.lu.se> Message-ID: <0A407957589BAB4F924824150C4293EF5881150F@CITESMBX3.ad.uillinois.edu> Hi Victor, I looked your test code and I think it is due to a bug in our MPICH RMA that the request handler is not re-entrant safe, which makes win_ptr->at_completion_counter being decremented twice for one GET operation. We will fix it as soon as possible. I created a ticket for this: https://trac.mpich.org/projects/mpich/ticket/2204, you can track the progress of this bug on it. Thanks, Xin ________________________________________ From: Victor Vysotskiy [victor.vysotskiy at teokem.lu.se] Sent: Tuesday, November 18, 2014 6:00 AM To: discuss at mpich.org Subject: Re: [mpich-discuss] Assertion failed in file src/mpid/ch3/src/ch3u_handle_send_req.c at line 61 (RMA && Derived datatypes) Hi Pavan, >FYI, this is what we heard back from the Mellanox folks: >I think, issue could be because of older MXM (part of MOFED) being used. We can ask him to try latest MXM from HPCX (http://bgate.mellanox.com/products/hpcx) Indeed, I just checked the latest software stack, including: - hpcx-v1.2.0-258-icc-OFED-3.12-redhat6.5; - MPICH v3.2a2 ('--with-device=ch3:nemesis:mxm'); And, there is no problem with 'assertion failed in ch3u_handle_send_req.c' anymore! Many thanks for your help and support! With best regards, Victor. _______________________________________________ discuss mailing list discuss at mpich.org To manage subscription options or unsubscribe: https://lists.mpich.org/mailman/listinfo/discuss _______________________________________________ discuss mailing list discuss at mpich.org To manage subscription options or unsubscribe: https://lists.mpich.org/mailman/listinfo/discuss From fisaila at mcs.anl.gov Tue Nov 18 13:02:09 2014 From: fisaila at mcs.anl.gov (Isaila, Florin D.) Date: Tue, 18 Nov 2014 19:02:09 +0000 Subject: [mpich-discuss] f77 bindings and profiling In-Reply-To: <7DD38ACE-B6AE-405B-9B6E-826A4D0461C9@mcs.anl.gov> References: <6F4D5A685397B940825208C64CF853A7477F881F@HALAS.anl.gov> <08621548-8BFB-4458-9936-97765341598D@mcs.anl.gov> <6F4D5A685397B940825208C64CF853A7477F8875@HALAS.anl.gov> , <7DD38ACE-B6AE-405B-9B6E-826A4D0461C9@mcs.anl.gov> Message-ID: <6F4D5A685397B940825208C64CF853A7477F88C7@HALAS.anl.gov> Junchao, as Rajeev mentions, I would like to write the MPI_Init and call PMPI_Init, as I want all MPI codes to work with that. I expect that my MPI_Init replaces the week MPI_Init symbol in the mpi library for C programs (this works) , but also replaces the MPI_Init used in the implementation of mpi_init for Fortran programs (this does not). Florin ________________________________________ From: Rajeev Thakur [thakur at mcs.anl.gov] Sent: Tuesday, November 18, 2014 12:49 PM To: discuss at mpich.org Subject: Re: [mpich-discuss] f77 bindings and profiling Isn't it the other way around: implement MPI_ and call PMPI_ ? And if you are mixing Fortran and C, won't you need the underscore? Rajeev On Nov 18, 2014, at 12:40 PM, Junchao Zhang wrote: > Florin, > As we discussed, it looks your aim is to provide MPI profiling (implemented in C) to Fortran+MPI code. > Try this: Implement your profiling layer in PMPI_Xxxx(), and call MPI_Xxxx() in it. You do not need to change cases or add trailing underscores. Then, insert your library after libmpifort.a, before libmpi.a in linking > > > --Junchao Zhang > > On Tue, Nov 18, 2014 at 9:47 AM, Isaila, Florin D. wrote: > Hi, > > it works what Rajeev suggests, defining the C function as mpi_init_ (one underscore for g77). In this case I would need a wrapper for and one for Fortran. > > However I understand from Section 14.2.1 of the MPI3 document (thanks Junchao for pointing to this) that I should be able to define just an MPI_Init wrapper in C and the call of MPI_Init(0, 0) from the Fortran implementation of mpi_init should be also redirected to my function. > > Pavan, I use MPICH-3.1.3. There is slight difference when I run the example you gave me, the libpmpi library does not appear: > fisaila at howard:f77$ mpif77 -show fpi.f > gfortran fpi.f -I/homes/fisaila/software/mpich/include -I/homes/fisaila/software/mpich/include -L/homes/fisaila/software/mpich/lib -lmpifort -Wl,-rpath -Wl,/homes/fisaila/software/mpich/lib -Wl,--enable-new-dtags -lmpi > > fisaila at howard:f77$ export PROFILE_PRELIB="-lfoo" > > fisaila at howard:f77$ mpif77 -show fpi.f > gfortran fpi.f -I/homes/fisaila/software/mpich/include -I/homes/fisaila/software/mpich/include -L/homes/fisaila/software/mpich/lib -lmpifort -lfoo -Wl,-rpath -Wl,/homes/fisaila/software/mpich/lib -Wl,--enable-new-dtags -lmpi > > The mpi and mpifort have the following symbols: > > fisaila at howard:f77_program$ nm /homes/fisaila/software/mpich/lib/libmpi.a | grep MPI_Init > 000000000000146d W MPI_Init > 000000000000146d T PMPI_Init > > > fisaila at howard:f77_program$ nm /homes/fisaila/software/mpich/lib/libmpifort.a | grep -i MPI_Init > 0000000000000000 W MPI_INIT > 0000000000000000 W PMPI_INIT > U PMPI_Init > 0000000000000000 W mpi_init > 0000000000000000 W mpi_init_ > 0000000000000000 W mpi_init__ > 0000000000000000 W pmpi_init > 0000000000000000 T pmpi_init_ > 0000000000000000 W pmpi_init__ > > Thanks > Florin > > > ________________________________________ > From: Rajeev Thakur [thakur at mcs.anl.gov] > Sent: Monday, November 17, 2014 3:39 PM > To: discuss at mpich.org > Cc: mpich-discuss at mcs.anl.gov > Subject: Re: [mpich-discuss] f77 bindings and profiling > > You need to add the right number of underscores at the end of the C function depending on the Fortran compiler you are using. For gfortran I think it is two underscores. So define the C function as mpi_init__. If that doesn't work, use one underscore. MPICH detects all this automatically at configure time. > > Rajeev > > On Nov 17, 2014, at 3:28 PM, "Isaila, Florin D." wrote: > > > Hi , > > > > I am trying to use MPI profiling to make mpi_init from a F77 program call my MPI_Init (written in C), but I do not manage to achieve that. In this simple F77 program: > > program main > > include 'mpif.h' > > integer error > > call mpi_init(error) > > call mpi_finalize(error) > > end > > > > I try to make the mpi_init call: > > int MPI_Init (int *argc, char ***argv){ int ret; > > printf("My function!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!"); > > ret = PMPI_Init(argc, argv); > > return ret; > > } > > > > My MPI_Init belongs to a library libtarget.a I created. I use -profile for compiling and I created the target.conf containing: > > PROFILE_PRELIB="-L/homes/fisaila/benches/try/staticlib_mpif77 -ltarget" > > in the right place. > > > > The library appears in the command line before the mpich library: > > mpif77 -show -g -profile=target init_finalize.f -o init_finalize > > gfortran -g init_finalize.f -o init_finalize -I/homes/fisaila/software/mpich/include -L/homes/fisaila/software/mpich/lib -L/homes/fisaila/benches/try/staticlib_mpif77 -ltarget -Wl,-rpath -Wl,/homes/fisaila/software/mpich/lib -lmpich -lopa -lmpl -lrt -lpthread > > > > However, the program never gets into my MPI_Init. > > > > Any suggestion about what I am missing? > > > > Thanks > > Florin > > > > > > > > > > > > > > > > > > > > > > _______________________________________________ > > discuss mailing list discuss at mpich.org > > To manage subscription options or unsubscribe: > > https://lists.mpich.org/mailman/listinfo/discuss > > _______________________________________________ > discuss mailing list discuss at mpich.org > To manage subscription options or unsubscribe: > https://lists.mpich.org/mailman/listinfo/discuss > _______________________________________________ > discuss mailing list discuss at mpich.org > To manage subscription options or unsubscribe: > https://lists.mpich.org/mailman/listinfo/discuss > > _______________________________________________ > discuss mailing list discuss at mpich.org > To manage subscription options or unsubscribe: > https://lists.mpich.org/mailman/listinfo/discuss _______________________________________________ discuss mailing list discuss at mpich.org To manage subscription options or unsubscribe: https://lists.mpich.org/mailman/listinfo/discuss _______________________________________________ discuss mailing list discuss at mpich.org To manage subscription options or unsubscribe: https://lists.mpich.org/mailman/listinfo/discuss From Hirak_Roy at mentor.com Tue Nov 18 12:49:36 2014 From: Hirak_Roy at mentor.com (Roy, Hirak) Date: Tue, 18 Nov 2014 18:49:36 +0000 Subject: [mpich-discuss] Client hangs if server dies in dynamic process management Message-ID: <8B38871795FD7042B826C1D1670246FBE2415CE0@EU-MBX-01.mgc.mentorg.com> Hi Huiwei, 1> Did you start your nameserver ? 2> Did the server program crash? I see the same hang (incomplete MPI_Finalize in client). Here is my command line: ? hydra_namserver & ? mpiexec -n 1 -nameserver ./server ? mpiexec -n 1 -nameserver ./client MPICH Version: 3.2a2 MPICH Release date: Sun Nov 16 11:09:31 CST 2014 MPICH Device: ch3:sock MPICH configure: --prefix /home/hroy/local//mpich-3.2a2/linux_x86_64 --disable-f77 --disable-fc --disable-f90modules --disable-cxx --enable-fast=nochkmsg --enable-fast=notiming --enable-fast=ndebug --enable-fast=O3 --with-device=ch3:sock --enable-g=dbg --disable-fortran --without-valgrind CFLAGS=-O3 -fPIC CXXFLAGS=-O3 -fPIC MPICH CC: /u/prod/gnu/gcc/20121129/gcc-4.5.0-linux_x86_64/bin/gcc -O3 -fPIC -g -O3 MPICH CXX: no -O3 -fPIC -g MPICH F77: no -g MPICH FC: no -g Thanks, Hirak ________________________________ Could you try with the latest mpich-3.2a2? The client exit successfully on my Macbook with sock channel. - Huiwei > On Nov 16, 2014, at 10:44 PM, Hirak Roy > wrote: > > Hi All, > > Here is my sample program. I am using channel sock of mpich-3.0.4. > > I am running it as > > mpiexec -n 1 ./server.out > > mpiexec -n 1 ./client.out > > Here my client program (client.c) hangs in MPI_Finalize. > There is an assert in the server.c where server exits. > > There is no way to detect that in client. > Even if we detect that using some timeout strategy, the client hangs in the finalize step. > Could you please suggest what is going wrong here or is this a bug in sock channel? > > Thanks, > Hirak > _______________________________________________ > discuss mailing list discuss at mpich.org > To manage subscription options or unsubscribe: > https://lists.mpich.org/mailman/listinfo/discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- _______________________________________________ discuss mailing list discuss at mpich.org To manage subscription options or unsubscribe: https://lists.mpich.org/mailman/listinfo/discuss From thakur at mcs.anl.gov Tue Nov 18 12:49:06 2014 From: thakur at mcs.anl.gov (Rajeev Thakur) Date: Tue, 18 Nov 2014 12:49:06 -0600 Subject: [mpich-discuss] f77 bindings and profiling In-Reply-To: References: <6F4D5A685397B940825208C64CF853A7477F881F@HALAS.anl.gov> <08621548-8BFB-4458-9936-97765341598D@mcs.anl.gov> <6F4D5A685397B940825208C64CF853A7477F8875@HALAS.anl.gov> Message-ID: <7DD38ACE-B6AE-405B-9B6E-826A4D0461C9@mcs.anl.gov> Isn't it the other way around: implement MPI_ and call PMPI_ ? And if you are mixing Fortran and C, won't you need the underscore? Rajeev On Nov 18, 2014, at 12:40 PM, Junchao Zhang wrote: > Florin, > As we discussed, it looks your aim is to provide MPI profiling (implemented in C) to Fortran+MPI code. > Try this: Implement your profiling layer in PMPI_Xxxx(), and call MPI_Xxxx() in it. You do not need to change cases or add trailing underscores. Then, insert your library after libmpifort.a, before libmpi.a in linking > > > --Junchao Zhang > > On Tue, Nov 18, 2014 at 9:47 AM, Isaila, Florin D. wrote: > Hi, > > it works what Rajeev suggests, defining the C function as mpi_init_ (one underscore for g77). In this case I would need a wrapper for and one for Fortran. > > However I understand from Section 14.2.1 of the MPI3 document (thanks Junchao for pointing to this) that I should be able to define just an MPI_Init wrapper in C and the call of MPI_Init(0, 0) from the Fortran implementation of mpi_init should be also redirected to my function. > > Pavan, I use MPICH-3.1.3. There is slight difference when I run the example you gave me, the libpmpi library does not appear: > fisaila at howard:f77$ mpif77 -show fpi.f > gfortran fpi.f -I/homes/fisaila/software/mpich/include -I/homes/fisaila/software/mpich/include -L/homes/fisaila/software/mpich/lib -lmpifort -Wl,-rpath -Wl,/homes/fisaila/software/mpich/lib -Wl,--enable-new-dtags -lmpi > > fisaila at howard:f77$ export PROFILE_PRELIB="-lfoo" > > fisaila at howard:f77$ mpif77 -show fpi.f > gfortran fpi.f -I/homes/fisaila/software/mpich/include -I/homes/fisaila/software/mpich/include -L/homes/fisaila/software/mpich/lib -lmpifort -lfoo -Wl,-rpath -Wl,/homes/fisaila/software/mpich/lib -Wl,--enable-new-dtags -lmpi > > The mpi and mpifort have the following symbols: > > fisaila at howard:f77_program$ nm /homes/fisaila/software/mpich/lib/libmpi.a | grep MPI_Init > 000000000000146d W MPI_Init > 000000000000146d T PMPI_Init > > > fisaila at howard:f77_program$ nm /homes/fisaila/software/mpich/lib/libmpifort.a | grep -i MPI_Init > 0000000000000000 W MPI_INIT > 0000000000000000 W PMPI_INIT > U PMPI_Init > 0000000000000000 W mpi_init > 0000000000000000 W mpi_init_ > 0000000000000000 W mpi_init__ > 0000000000000000 W pmpi_init > 0000000000000000 T pmpi_init_ > 0000000000000000 W pmpi_init__ > > Thanks > Florin > > > ________________________________________ > From: Rajeev Thakur [thakur at mcs.anl.gov] > Sent: Monday, November 17, 2014 3:39 PM > To: discuss at mpich.org > Cc: mpich-discuss at mcs.anl.gov > Subject: Re: [mpich-discuss] f77 bindings and profiling > > You need to add the right number of underscores at the end of the C function depending on the Fortran compiler you are using. For gfortran I think it is two underscores. So define the C function as mpi_init__. If that doesn't work, use one underscore. MPICH detects all this automatically at configure time. > > Rajeev > > On Nov 17, 2014, at 3:28 PM, "Isaila, Florin D." wrote: > > > Hi , > > > > I am trying to use MPI profiling to make mpi_init from a F77 program call my MPI_Init (written in C), but I do not manage to achieve that. In this simple F77 program: > > program main > > include 'mpif.h' > > integer error > > call mpi_init(error) > > call mpi_finalize(error) > > end > > > > I try to make the mpi_init call: > > int MPI_Init (int *argc, char ***argv){ int ret; > > printf("My function!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!"); > > ret = PMPI_Init(argc, argv); > > return ret; > > } > > > > My MPI_Init belongs to a library libtarget.a I created. I use -profile for compiling and I created the target.conf containing: > > PROFILE_PRELIB="-L/homes/fisaila/benches/try/staticlib_mpif77 -ltarget" > > in the right place. > > > > The library appears in the command line before the mpich library: > > mpif77 -show -g -profile=target init_finalize.f -o init_finalize > > gfortran -g init_finalize.f -o init_finalize -I/homes/fisaila/software/mpich/include -L/homes/fisaila/software/mpich/lib -L/homes/fisaila/benches/try/staticlib_mpif77 -ltarget -Wl,-rpath -Wl,/homes/fisaila/software/mpich/lib -lmpich -lopa -lmpl -lrt -lpthread > > > > However, the program never gets into my MPI_Init. > > > > Any suggestion about what I am missing? > > > > Thanks > > Florin > > > > > > > > > > > > > > > > > > > > > > _______________________________________________ > > discuss mailing list discuss at mpich.org > > To manage subscription options or unsubscribe: > > https://lists.mpich.org/mailman/listinfo/discuss > > _______________________________________________ > discuss mailing list discuss at mpich.org > To manage subscription options or unsubscribe: > https://lists.mpich.org/mailman/listinfo/discuss > _______________________________________________ > discuss mailing list discuss at mpich.org > To manage subscription options or unsubscribe: > https://lists.mpich.org/mailman/listinfo/discuss > > _______________________________________________ > discuss mailing list discuss at mpich.org > To manage subscription options or unsubscribe: > https://lists.mpich.org/mailman/listinfo/discuss _______________________________________________ discuss mailing list discuss at mpich.org To manage subscription options or unsubscribe: https://lists.mpich.org/mailman/listinfo/discuss From jczhang at mcs.anl.gov Tue Nov 18 12:40:01 2014 From: jczhang at mcs.anl.gov (Junchao Zhang) Date: Tue, 18 Nov 2014 12:40:01 -0600 Subject: [mpich-discuss] f77 bindings and profiling In-Reply-To: <6F4D5A685397B940825208C64CF853A7477F8875@HALAS.anl.gov> References: <6F4D5A685397B940825208C64CF853A7477F881F@HALAS.anl.gov> <08621548-8BFB-4458-9936-97765341598D@mcs.anl.gov> <6F4D5A685397B940825208C64CF853A7477F8875@HALAS.anl.gov> Message-ID: Florin, As we discussed, it looks your aim is to provide MPI profiling (implemented in C) to Fortran+MPI code. Try this: Implement your profiling layer in PMPI_Xxxx(), and call MPI_Xxxx() in it. You do not need to change cases or add trailing underscores. Then, insert your library after libmpifort.a, before libmpi.a in linking --Junchao Zhang On Tue, Nov 18, 2014 at 9:47 AM, Isaila, Florin D. wrote: > Hi, > > it works what Rajeev suggests, defining the C function as mpi_init_ (one > underscore for g77). In this case I would need a wrapper for and one for > Fortran. > > However I understand from Section 14.2.1 of the MPI3 document (thanks > Junchao for pointing to this) that I should be able to define just an > MPI_Init wrapper in C and the call of MPI_Init(0, 0) from the Fortran > implementation of mpi_init should be also redirected to my function. > > Pavan, I use MPICH-3.1.3. There is slight difference when I run the > example you gave me, the libpmpi library does not appear: > fisaila at howard:f77$ mpif77 -show fpi.f > gfortran fpi.f -I/homes/fisaila/software/mpich/include > -I/homes/fisaila/software/mpich/include -L/homes/fisaila/software/mpich/lib > -lmpifort -Wl,-rpath -Wl,/homes/fisaila/software/mpich/lib > -Wl,--enable-new-dtags -lmpi > > fisaila at howard:f77$ export PROFILE_PRELIB="-lfoo" > > fisaila at howard:f77$ mpif77 -show fpi.f > gfortran fpi.f -I/homes/fisaila/software/mpich/include > -I/homes/fisaila/software/mpich/include -L/homes/fisaila/software/mpich/lib > -lmpifort -lfoo -Wl,-rpath -Wl,/homes/fisaila/software/mpich/lib > -Wl,--enable-new-dtags -lmpi > > The mpi and mpifort have the following symbols: > > fisaila at howard:f77_program$ nm /homes/fisaila/software/mpich/lib/libmpi.a > | grep MPI_Init > 000000000000146d W MPI_Init > 000000000000146d T PMPI_Init > > > fisaila at howard:f77_program$ nm > /homes/fisaila/software/mpich/lib/libmpifort.a | grep -i MPI_Init > 0000000000000000 W MPI_INIT > 0000000000000000 W PMPI_INIT > U PMPI_Init > 0000000000000000 W mpi_init > 0000000000000000 W mpi_init_ > 0000000000000000 W mpi_init__ > 0000000000000000 W pmpi_init > 0000000000000000 T pmpi_init_ > 0000000000000000 W pmpi_init__ > > Thanks > Florin > > > ________________________________________ > From: Rajeev Thakur [thakur at mcs.anl.gov] > Sent: Monday, November 17, 2014 3:39 PM > To: discuss at mpich.org > Cc: mpich-discuss at mcs.anl.gov > Subject: Re: [mpich-discuss] f77 bindings and profiling > > You need to add the right number of underscores at the end of the C > function depending on the Fortran compiler you are using. For gfortran I > think it is two underscores. So define the C function as mpi_init__. If > that doesn't work, use one underscore. MPICH detects all this automatically > at configure time. > > Rajeev > > On Nov 17, 2014, at 3:28 PM, "Isaila, Florin D." > wrote: > > > Hi , > > > > I am trying to use MPI profiling to make mpi_init from a F77 program > call my MPI_Init (written in C), but I do not manage to achieve that. In > this simple F77 program: > > program main > > include 'mpif.h' > > integer error > > call mpi_init(error) > > call mpi_finalize(error) > > end > > > > I try to make the mpi_init call: > > int MPI_Init (int *argc, char ***argv){ int ret; > > printf("My function!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!"); > > ret = PMPI_Init(argc, argv); > > return ret; > > } > > > > My MPI_Init belongs to a library libtarget.a I created. I use -profile > for compiling and I created the target.conf containing: > > PROFILE_PRELIB="-L/homes/fisaila/benches/try/staticlib_mpif77 -ltarget" > > in the right place. > > > > The library appears in the command line before the mpich library: > > mpif77 -show -g -profile=target init_finalize.f -o init_finalize > > gfortran -g init_finalize.f -o init_finalize > -I/homes/fisaila/software/mpich/include -L/homes/fisaila/software/mpich/lib > -L/homes/fisaila/benches/try/staticlib_mpif77 -ltarget -Wl,-rpath > -Wl,/homes/fisaila/software/mpich/lib -lmpich -lopa -lmpl -lrt -lpthread > > > > However, the program never gets into my MPI_Init. > > > > Any suggestion about what I am missing? > > > > Thanks > > Florin > > > > > > > > > > > > > > > > > > > > > > _______________________________________________ > > discuss mailing list discuss at mpich.org > > To manage subscription options or unsubscribe: > > https://lists.mpich.org/mailman/listinfo/discuss > > _______________________________________________ > discuss mailing list discuss at mpich.org > To manage subscription options or unsubscribe: > https://lists.mpich.org/mailman/listinfo/discuss > _______________________________________________ > discuss mailing list discuss at mpich.org > To manage subscription options or unsubscribe: > https://lists.mpich.org/mailman/listinfo/discuss > -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- _______________________________________________ discuss mailing list discuss at mpich.org To manage subscription options or unsubscribe: https://lists.mpich.org/mailman/listinfo/discuss From fisaila at mcs.anl.gov Tue Nov 18 09:47:43 2014 From: fisaila at mcs.anl.gov (Isaila, Florin D.) Date: Tue, 18 Nov 2014 15:47:43 +0000 Subject: [mpich-discuss] f77 bindings and profiling In-Reply-To: <08621548-8BFB-4458-9936-97765341598D@mcs.anl.gov> References: <6F4D5A685397B940825208C64CF853A7477F881F@HALAS.anl.gov>, <08621548-8BFB-4458-9936-97765341598D@mcs.anl.gov> Message-ID: <6F4D5A685397B940825208C64CF853A7477F8875@HALAS.anl.gov> Hi, it works what Rajeev suggests, defining the C function as mpi_init_ (one underscore for g77). In this case I would need a wrapper for and one for Fortran. However I understand from Section 14.2.1 of the MPI3 document (thanks Junchao for pointing to this) that I should be able to define just an MPI_Init wrapper in C and the call of MPI_Init(0, 0) from the Fortran implementation of mpi_init should be also redirected to my function. Pavan, I use MPICH-3.1.3. There is slight difference when I run the example you gave me, the libpmpi library does not appear: fisaila at howard:f77$ mpif77 -show fpi.f gfortran fpi.f -I/homes/fisaila/software/mpich/include -I/homes/fisaila/software/mpich/include -L/homes/fisaila/software/mpich/lib -lmpifort -Wl,-rpath -Wl,/homes/fisaila/software/mpich/lib -Wl,--enable-new-dtags -lmpi fisaila at howard:f77$ export PROFILE_PRELIB="-lfoo" fisaila at howard:f77$ mpif77 -show fpi.f gfortran fpi.f -I/homes/fisaila/software/mpich/include -I/homes/fisaila/software/mpich/include -L/homes/fisaila/software/mpich/lib -lmpifort -lfoo -Wl,-rpath -Wl,/homes/fisaila/software/mpich/lib -Wl,--enable-new-dtags -lmpi The mpi and mpifort have the following symbols: fisaila at howard:f77_program$ nm /homes/fisaila/software/mpich/lib/libmpi.a | grep MPI_Init 000000000000146d W MPI_Init 000000000000146d T PMPI_Init fisaila at howard:f77_program$ nm /homes/fisaila/software/mpich/lib/libmpifort.a | grep -i MPI_Init 0000000000000000 W MPI_INIT 0000000000000000 W PMPI_INIT U PMPI_Init 0000000000000000 W mpi_init 0000000000000000 W mpi_init_ 0000000000000000 W mpi_init__ 0000000000000000 W pmpi_init 0000000000000000 T pmpi_init_ 0000000000000000 W pmpi_init__ Thanks Florin ________________________________________ From: Rajeev Thakur [thakur at mcs.anl.gov] Sent: Monday, November 17, 2014 3:39 PM To: discuss at mpich.org Cc: mpich-discuss at mcs.anl.gov Subject: Re: [mpich-discuss] f77 bindings and profiling You need to add the right number of underscores at the end of the C function depending on the Fortran compiler you are using. For gfortran I think it is two underscores. So define the C function as mpi_init__. If that doesn't work, use one underscore. MPICH detects all this automatically at configure time. Rajeev On Nov 17, 2014, at 3:28 PM, "Isaila, Florin D." wrote: > Hi , > > I am trying to use MPI profiling to make mpi_init from a F77 program call my MPI_Init (written in C), but I do not manage to achieve that. In this simple F77 program: > program main > include 'mpif.h' > integer error > call mpi_init(error) > call mpi_finalize(error) > end > > I try to make the mpi_init call: > int MPI_Init (int *argc, char ***argv){ int ret; > printf("My function!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!"); > ret = PMPI_Init(argc, argv); > return ret; > } > > My MPI_Init belongs to a library libtarget.a I created. I use -profile for compiling and I created the target.conf containing: > PROFILE_PRELIB="-L/homes/fisaila/benches/try/staticlib_mpif77 -ltarget" > in the right place. > > The library appears in the command line before the mpich library: > mpif77 -show -g -profile=target init_finalize.f -o init_finalize > gfortran -g init_finalize.f -o init_finalize -I/homes/fisaila/software/mpich/include -L/homes/fisaila/software/mpich/lib -L/homes/fisaila/benches/try/staticlib_mpif77 -ltarget -Wl,-rpath -Wl,/homes/fisaila/software/mpich/lib -lmpich -lopa -lmpl -lrt -lpthread > > However, the program never gets into my MPI_Init. > > Any suggestion about what I am missing? > > Thanks > Florin > > > > > > > > > > > _______________________________________________ > discuss mailing list discuss at mpich.org > To manage subscription options or unsubscribe: > https://lists.mpich.org/mailman/listinfo/discuss _______________________________________________ discuss mailing list discuss at mpich.org To manage subscription options or unsubscribe: https://lists.mpich.org/mailman/listinfo/discuss _______________________________________________ discuss mailing list discuss at mpich.org To manage subscription options or unsubscribe: https://lists.mpich.org/mailman/listinfo/discuss From victor.vysotskiy at teokem.lu.se Tue Nov 18 06:00:11 2014 From: victor.vysotskiy at teokem.lu.se (Victor Vysotskiy) Date: Tue, 18 Nov 2014 12:00:11 +0000 Subject: [mpich-discuss] Assertion failed in file src/mpid/ch3/src/ch3u_handle_send_req.c at line 61 (RMA && Derived datatypes) In-Reply-To: <8D58A4B5E6148C419C6AD6334962375DDD9A5B7E@UWMBX04.uw.lu.se> References: <8D58A4B5E6148C419C6AD6334962375DDD9A4C54@UWMBX04.uw.lu.se>, <8D58A4B5E6148C419C6AD6334962375DDD9A5B7E@UWMBX04.uw.lu.se> Message-ID: <8D58A4B5E6148C419C6AD6334962375DDD9A5B95@UWMBX04.uw.lu.se> Hi Pavan, >FYI, this is what we heard back from the Mellanox folks: >I think, issue could be because of older MXM (part of MOFED) being used. We can ask him to try latest MXM from HPCX (http://bgate.mellanox.com/products/hpcx) Indeed, I just checked the latest software stack, including: - hpcx-v1.2.0-258-icc-OFED-3.12-redhat6.5; - MPICH v3.2a2 ('--with-device=ch3:nemesis:mxm'); And, there is no problem with 'assertion failed in ch3u_handle_send_req.c' anymore! Many thanks for your help and support! With best regards, Victor. _______________________________________________ discuss mailing list discuss at mpich.org To manage subscription options or unsubscribe: https://lists.mpich.org/mailman/listinfo/discuss From Siegmar.Gross at informatik.hs-fulda.de Tue Nov 18 06:24:40 2014 From: Siegmar.Gross at informatik.hs-fulda.de (Siegmar Gross) Date: Tue, 18 Nov 2014 13:24:40 +0100 Subject: [mpich-discuss] Error building mpich-master-v3.2a2 on Solariswith Sun C 5.12 Message-ID: <201411181224.sAICOeuU014366@tyr.informatik.hs-fulda.de> Hi Ken, > The attached patch (already in master) should fix this issue. Thanks for > reporting. Yes, it does. Thank you very much for your help. Kind regards Siegmar > Ken > > On 11/17/2014 02:26 AM, Siegmar Gross wrote: > > Hi, > > > > today I tried to build mpich-master-v3.2a2 on Solaris 10 Sparc > > and Solaris 10 x86_64 with Sun C 5.12. The process broke with > > the following errors on both machines. > > > > tyr mpich-master-v3.2a2-SunOS.x86_64.64_cc 611 tail -20 log.make.SunOS.x86_64.64_cc > > "/export2/src/mpich-3.2/mpich-master-v3.2a2/src/openpa/src/primitives/opa_gcc_in tel_32_64_ops.h", line 151: warning: parameter in inline asm statement unused: %1 > > "/export2/src/mpich-3.2/mpich-master-v3.2a2/src/openpa/src/primitives/opa_gcc_in tel_32_64_ops.h", line 159: warning: parameter in inline asm statement unused: %3 > > "/export2/src/mpich-3.2/mpich-master-v3.2a2/src/openpa/src/primitives/opa_gcc_in tel_32_64_ops.h", line 167: warning: parameter in inline asm statement unused: %3 > > "../mpich-master-v3.2a2/src/mpid/ch3/channels/nemesis/include/mpid_nem_datatypes .h", line 156: warning: syntax error: empty member declaration > > "../mpich-master-v3.2a2/src/mpid/ch3/channels/nemesis/include/mpid_nem_datatypes .h", line 161: warning: syntax error: empty member declaration > > "../mpich-master-v3.2a2/src/mpid/ch3/include/mpidi_recvq_statistics.h", line 13: warning: syntax error: empty declaration > > "../mpich-master-v3.2a2/src/mpid/ch3/src/ch3u_recvq.c", line 95: warning: syntax error: empty declaration > > "../mpich-master-v3.2a2/src/mpid/ch3/src/ch3u_recvq.c", line 96: warning: syntax error: empty declaration > > "../mpich-master-v3.2a2/src/mpid/ch3/src/ch3u_recvq.c", line 97: warning: syntax error: empty declaration > > "../mpich-master-v3.2a2/src/mpid/ch3/src/ch3u_recvq.c", line 98: warning: syntax error: empty declaration > > "../mpich-master-v3.2a2/src/mpid/ch3/src/ch3u_recvq.c", line 99: warning: syntax error: empty declaration > > "../mpich-master-v3.2a2/src/mpid/ch3/src/ch3u_recvq.c", line 100: warning: syntax error: empty declaration > > "../mpich-master-v3.2a2/src/mpid/ch3/src/ch3u_recvq.c", line 103: warning: syntax error: empty declaration > > "../mpich-master-v3.2a2/src/mpid/ch3/src/ch3u_recvq.c", line 598: undefined symbol: false > > cc: acomp failed for ../mpich-master-v3.2a2/src/mpid/ch3/src/ch3u_recvq.c > > make[2]: *** [src/mpid/ch3/src/lib_libmpi_la-ch3u_recvq.lo] Error 1 > > make[2]: Leaving directory `/export2/src/mpich-3.2/mpich-master-v3.2a2-SunOS.x86_64.64_cc' > > make[1]: *** [all-recursive] Error 1 > > make[1]: Leaving directory `/export2/src/mpich-3.2/mpich-master-v3.2a2-SunOS.x86_64.64_cc' > > make: *** [all] Error 2 > > tyr mpich-master-v3.2a2-SunOS.x86_64.64_cc 612 > > > > I used the following configure command. > > > > tyr mpich-master-v3.2a2-SunOS.x86_64.64_cc 619 head config.log | grep mpich > > $ ../mpich-master-v3.2a2/configure --prefix=/usr/local/mpich-3.2_64_cc --libdir=/usr/local/mpich-3.2_64_cc/lib64 --includedir=/usr/local/mpich-3.2_64_cc/include64 CC=cc CXX=CC > > F77=f77 FC=f95 CFLAGS=-m64 CXXFLAGS=-m64 FFLAGS=-m64 FCFLAGS=-m64 LDFLAGS=-m64 -L/usr/lib/amd64 -R/usr/lib/amd64 --enable-f77 --enable-fc --enable-cxx --enable-romio > > --enable-debuginfo --enable-smpcoll --enable-threads=runtime --with-thread-package=posix --enable-shared > > tyr mpich-master-v3.2a2-SunOS.x86_64.64_cc 620 > > > > > > > > I was able to build the package with gcc-4.9.2. Can somebody > > fix the errors for Sun C 5.12? Thank you very much for any > > help in advance. > > > > > > Kind regards > > > > Siegmar > > > > _______________________________________________ > > discuss mailing list discuss at mpich.org > > To manage subscription options or unsubscribe: > > https://lists.mpich.org/mailman/listinfo/discuss > > _______________________________________________ discuss mailing list discuss at mpich.org To manage subscription options or unsubscribe: https://lists.mpich.org/mailman/listinfo/discuss From thakur at mcs.anl.gov Mon Nov 17 15:39:10 2014 From: thakur at mcs.anl.gov (Rajeev Thakur) Date: Mon, 17 Nov 2014 15:39:10 -0600 Subject: [mpich-discuss] f77 bindings and profiling In-Reply-To: <6F4D5A685397B940825208C64CF853A7477F881F@HALAS.anl.gov> References: <6F4D5A685397B940825208C64CF853A7477F881F@HALAS.anl.gov> Message-ID: <08621548-8BFB-4458-9936-97765341598D@mcs.anl.gov> You need to add the right number of underscores at the end of the C function depending on the Fortran compiler you are using. For gfortran I think it is two underscores. So define the C function as mpi_init__. If that doesn't work, use one underscore. MPICH detects all this automatically at configure time. Rajeev On Nov 17, 2014, at 3:28 PM, "Isaila, Florin D." wrote: > Hi , > > I am trying to use MPI profiling to make mpi_init from a F77 program call my MPI_Init (written in C), but I do not manage to achieve that. In this simple F77 program: > program main > include 'mpif.h' > integer error > call mpi_init(error) > call mpi_finalize(error) > end > > I try to make the mpi_init call: > int MPI_Init (int *argc, char ***argv){ int ret; > printf("My function!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!"); > ret = PMPI_Init(argc, argv); > return ret; > } > > My MPI_Init belongs to a library libtarget.a I created. I use -profile for compiling and I created the target.conf containing: > PROFILE_PRELIB="-L/homes/fisaila/benches/try/staticlib_mpif77 -ltarget" > in the right place. > > The library appears in the command line before the mpich library: > mpif77 -show -g -profile=target init_finalize.f -o init_finalize > gfortran -g init_finalize.f -o init_finalize -I/homes/fisaila/software/mpich/include -L/homes/fisaila/software/mpich/lib -L/homes/fisaila/benches/try/staticlib_mpif77 -ltarget -Wl,-rpath -Wl,/homes/fisaila/software/mpich/lib -lmpich -lopa -lmpl -lrt -lpthread > > However, the program never gets into my MPI_Init. > > Any suggestion about what I am missing? > > Thanks > Florin > > > > > > > > > > > _______________________________________________ > discuss mailing list discuss at mpich.org > To manage subscription options or unsubscribe: > https://lists.mpich.org/mailman/listinfo/discuss _______________________________________________ discuss mailing list discuss at mpich.org To manage subscription options or unsubscribe: https://lists.mpich.org/mailman/listinfo/discuss From raffenet at mcs.anl.gov Mon Nov 17 22:11:26 2014 From: raffenet at mcs.anl.gov (Kenneth Raffenetti) Date: Mon, 17 Nov 2014 22:11:26 -0600 Subject: [mpich-discuss] Error building mpich-master-v3.2a2 on Solaris with Sun C 5.12 In-Reply-To: <201411170826.sAH8QCbM020982@tyr.informatik.hs-fulda.de> References: <201411170826.sAH8QCbM020982@tyr.informatik.hs-fulda.de> Message-ID: <546AC6EE.9060303@mcs.anl.gov> Hi Siegmar, The attached patch (already in master) should fix this issue. Thanks for reporting. Ken On 11/17/2014 02:26 AM, Siegmar Gross wrote: > Hi, > > today I tried to build mpich-master-v3.2a2 on Solaris 10 Sparc > and Solaris 10 x86_64 with Sun C 5.12. The process broke with > the following errors on both machines. > > tyr mpich-master-v3.2a2-SunOS.x86_64.64_cc 611 tail -20 log.make.SunOS.x86_64.64_cc > "/export2/src/mpich-3.2/mpich-master-v3.2a2/src/openpa/src/primitives/opa_gcc_intel_32_64_ops.h", line 151: warning: parameter in inline asm statement unused: %1 > "/export2/src/mpich-3.2/mpich-master-v3.2a2/src/openpa/src/primitives/opa_gcc_intel_32_64_ops.h", line 159: warning: parameter in inline asm statement unused: %3 > "/export2/src/mpich-3.2/mpich-master-v3.2a2/src/openpa/src/primitives/opa_gcc_intel_32_64_ops.h", line 167: warning: parameter in inline asm statement unused: %3 > "../mpich-master-v3.2a2/src/mpid/ch3/channels/nemesis/include/mpid_nem_datatypes.h", line 156: warning: syntax error: empty member declaration > "../mpich-master-v3.2a2/src/mpid/ch3/channels/nemesis/include/mpid_nem_datatypes.h", line 161: warning: syntax error: empty member declaration > "../mpich-master-v3.2a2/src/mpid/ch3/include/mpidi_recvq_statistics.h", line 13: warning: syntax error: empty declaration > "../mpich-master-v3.2a2/src/mpid/ch3/src/ch3u_recvq.c", line 95: warning: syntax error: empty declaration > "../mpich-master-v3.2a2/src/mpid/ch3/src/ch3u_recvq.c", line 96: warning: syntax error: empty declaration > "../mpich-master-v3.2a2/src/mpid/ch3/src/ch3u_recvq.c", line 97: warning: syntax error: empty declaration > "../mpich-master-v3.2a2/src/mpid/ch3/src/ch3u_recvq.c", line 98: warning: syntax error: empty declaration > "../mpich-master-v3.2a2/src/mpid/ch3/src/ch3u_recvq.c", line 99: warning: syntax error: empty declaration > "../mpich-master-v3.2a2/src/mpid/ch3/src/ch3u_recvq.c", line 100: warning: syntax error: empty declaration > "../mpich-master-v3.2a2/src/mpid/ch3/src/ch3u_recvq.c", line 103: warning: syntax error: empty declaration > "../mpich-master-v3.2a2/src/mpid/ch3/src/ch3u_recvq.c", line 598: undefined symbol: false > cc: acomp failed for ../mpich-master-v3.2a2/src/mpid/ch3/src/ch3u_recvq.c > make[2]: *** [src/mpid/ch3/src/lib_libmpi_la-ch3u_recvq.lo] Error 1 > make[2]: Leaving directory `/export2/src/mpich-3.2/mpich-master-v3.2a2-SunOS.x86_64.64_cc' > make[1]: *** [all-recursive] Error 1 > make[1]: Leaving directory `/export2/src/mpich-3.2/mpich-master-v3.2a2-SunOS.x86_64.64_cc' > make: *** [all] Error 2 > tyr mpich-master-v3.2a2-SunOS.x86_64.64_cc 612 > > I used the following configure command. > > tyr mpich-master-v3.2a2-SunOS.x86_64.64_cc 619 head config.log | grep mpich > $ ../mpich-master-v3.2a2/configure --prefix=/usr/local/mpich-3.2_64_cc --libdir=/usr/local/mpich-3.2_64_cc/lib64 --includedir=/usr/local/mpich-3.2_64_cc/include64 CC=cc CXX=CC > F77=f77 FC=f95 CFLAGS=-m64 CXXFLAGS=-m64 FFLAGS=-m64 FCFLAGS=-m64 LDFLAGS=-m64 -L/usr/lib/amd64 -R/usr/lib/amd64 --enable-f77 --enable-fc --enable-cxx --enable-romio > --enable-debuginfo --enable-smpcoll --enable-threads=runtime --with-thread-package=posix --enable-shared > tyr mpich-master-v3.2a2-SunOS.x86_64.64_cc 620 > > > > I was able to build the package with gcc-4.9.2. Can somebody > fix the errors for Sun C 5.12? Thank you very much for any > help in advance. > > > Kind regards > > Siegmar > > _______________________________________________ > discuss mailing list discuss at mpich.org > To manage subscription options or unsubscribe: > https://lists.mpich.org/mailman/listinfo/discuss > From raffenet at mcs.anl.gov Mon Nov 17 22:06:21 2014 From: raffenet at mcs.anl.gov (Ken Raffenetti) Date: Mon, 17 Nov 2014 22:06:21 -0600 Subject: [PATCH] use 0 to indicate false in while expression Message-ID: --- src/mpid/ch3/src/ch3u_recvq.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/src/mpid/ch3/src/ch3u_recvq.c b/src/mpid/ch3/src/ch3u_recvq.c index 48ba6496..ff769f70 100644 --- a/src/mpid/ch3/src/ch3u_recvq.c +++ b/src/mpid/ch3/src/ch3u_recvq.c @@ -595,7 +595,7 @@ MPID_Request * MPIDI_CH3U_Recvq_FDU_or_AEP(int source, int tag, prev_rreq = rreq; rreq = rreq->dev.next; } while (rreq); - } while (false); + } while (0); } } MPIR_T_PVAR_TIMER_END(RECVQ, time_matching_unexpectedq); -- 1.9.1 --------------010402000901060402040207 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Content-Disposition: inline _______________________________________________ discuss mailing list discuss at mpich.org To manage subscription options or unsubscribe: https://lists.mpich.org/mailman/listinfo/discuss --------------010402000901060402040207-- From wbland at anl.gov Mon Nov 17 10:57:09 2014 From: wbland at anl.gov (Bland, Wesley B.) Date: Mon, 17 Nov 2014 16:57:09 +0000 Subject: [mpich-discuss] nemesis busy wait and High CPU-load In-Reply-To: <8B38871795FD7042B826C1D1670246FBE2415086@EU-MBX-01.mgc.mentorg.com> References: <8B38871795FD7042B826C1D1670246FBE2415086@EU-MBX-01.mgc.mentorg.com> Message-ID: Yes. There?s work going on toward this at the moment. 1103 is actually closed, but I think you can track the progress of this issue with #79. This is on the roadmap for MPICH 3.3 since it requires a fairly large amount of work that wasn?t ready for MPICH 3.2. That?s roughly scheduled for a 2016 release, but there will undoubtedly be alphas and betas before that. If you keep an eye on the ticket, you?ll know when it?s fixed and you can try out some of the nightly builds. Thanks, Wesley On Nov 17, 2014, at 10:40 AM, Roy, Hirak > wrote: Hi All, It looks like the issue is still open described in the following two tickets in mpich-3.1.3 http://trac.mpich.org/projects/mpich/ticket/1103 http://trac.mpich.org/projects/mpich/ticket/79 Is there any plan to work on these? The issue is critical and stops us from using nemesis. Thanks, Hirak _______________________________________________ discuss mailing list discuss at mpich.org To manage subscription options or unsubscribe: https://lists.mpich.org/mailman/listinfo/discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- _______________________________________________ discuss mailing list discuss at mpich.org To manage subscription options or unsubscribe: https://lists.mpich.org/mailman/listinfo/discuss From balaji at anl.gov Mon Nov 17 15:35:56 2014 From: balaji at anl.gov (Balaji, Pavan) Date: Mon, 17 Nov 2014 21:35:56 +0000 Subject: [mpich-discuss] f77 bindings and profiling In-Reply-To: <6F4D5A685397B940825208C64CF853A7477F881F@HALAS.anl.gov> References: <6F4D5A685397B940825208C64CF853A7477F881F@HALAS.anl.gov> Message-ID: <78F582CE-336E-4800-B3EC-FDC03589D529@anl.gov> Florin, What version of MPICH are you using? There should be an -lmpifort in the library list. Here?s what I get with mpich-3.2a2, which is correct: % mpif77 -show ./examples/f77/fpi.f /usr/local/bin/gfortran -Wl,-flat_namespace ./examples/f77/fpi.f -I/usr/local/Cellar/mpich2/3.1.3/include -I/usr/local/Cellar/mpich2/3.1.3/include -L/usr/local/Cellar/mpich2/3.1.3/lib -lmpifort -lmpi -lpmpi % PROFILE_PRELIB=-lfoo mpif77 -show ./examples/f77/fpi.f /usr/local/bin/gfortran -Wl,-flat_namespace ./examples/f77/fpi.f -I/usr/local/Cellar/mpich2/3.1.3/include -I/usr/local/Cellar/mpich2/3.1.3/include -L/usr/local/Cellar/mpich2/3.1.3/lib -lmpifort -lfoo -lmpi -lpmpi In the latter case, the Fortran symbols are in the libmpifort library, which uses libfoo when it tries to find the corresponding C symbols. ? Pavan > On Nov 17, 2014, at 3:28 PM, Isaila, Florin D. wrote: > > Hi , > > I am trying to use MPI profiling to make mpi_init from a F77 program call my MPI_Init (written in C), but I do not manage to achieve that. In this simple F77 program: > program main > include 'mpif.h' > integer error > call mpi_init(error) > call mpi_finalize(error) > end > > I try to make the mpi_init call: > int MPI_Init (int *argc, char ***argv){ int ret; > printf("My function!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!"); > ret = PMPI_Init(argc, argv); > return ret; > } > > My MPI_Init belongs to a library libtarget.a I created. I use -profile for compiling and I created the target.conf containing: > PROFILE_PRELIB="-L/homes/fisaila/benches/try/staticlib_mpif77 -ltarget" > in the right place. > > The library appears in the command line before the mpich library: > mpif77 -show -g -profile=target init_finalize.f -o init_finalize > gfortran -g init_finalize.f -o init_finalize -I/homes/fisaila/software/mpich/include -L/homes/fisaila/software/mpich/lib -L/homes/fisaila/benches/try/staticlib_mpif77 -ltarget -Wl,-rpath -Wl,/homes/fisaila/software/mpich/lib -lmpich -lopa -lmpl -lrt -lpthread > > However, the program never gets into my MPI_Init. > > Any suggestion about what I am missing? > > Thanks > Florin > > > > > > > > > > > _______________________________________________ > discuss mailing list discuss at mpich.org > To manage subscription options or unsubscribe: > https://lists.mpich.org/mailman/listinfo/discuss -- Pavan Balaji ?? http://www.mcs.anl.gov/~balaji _______________________________________________ discuss mailing list discuss at mpich.org To manage subscription options or unsubscribe: https://lists.mpich.org/mailman/listinfo/discuss From huiweilu at mcs.anl.gov Mon Nov 17 08:49:18 2014 From: huiweilu at mcs.anl.gov (Lu, Huiwei) Date: Mon, 17 Nov 2014 14:49:18 +0000 Subject: [mpich-discuss] Client hangs if server dies in dynamic process management In-Reply-To: <54697D27.4000406@mentor.com> References: <54697D27.4000406@mentor.com> Message-ID: <0A37EA20-9DF6-4441-B7E0-632F3A025ACC@anl.gov> Could you try with the latest mpich-3.2a2? The client exit successfully on my Macbook with sock channel. ? Huiwei > On Nov 16, 2014, at 10:44 PM, Hirak Roy wrote: > > Hi All, > > Here is my sample program. I am using channel sock of mpich-3.0.4. > > I am running it as > > mpiexec -n 1 ./server.out > > mpiexec -n 1 ./client.out > > Here my client program (client.c) hangs in MPI_Finalize. > There is an assert in the server.c where server exits. > > There is no way to detect that in client. > Even if we detect that using some timeout strategy, the client hangs in the finalize step. > Could you please suggest what is going wrong here or is this a bug in sock channel? > > Thanks, > Hirak > _______________________________________________ > discuss mailing list discuss at mpich.org > To manage subscription options or unsubscribe: > https://lists.mpich.org/mailman/listinfo/discuss _______________________________________________ discuss mailing list discuss at mpich.org To manage subscription options or unsubscribe: https://lists.mpich.org/mailman/listinfo/discuss From Hirak_Roy at mentor.com Mon Nov 17 10:40:32 2014 From: Hirak_Roy at mentor.com (Roy, Hirak) Date: Mon, 17 Nov 2014 16:40:32 +0000 Subject: [mpich-discuss] nemesis busy wait and High CPU-load Message-ID: <8B38871795FD7042B826C1D1670246FBE2415086@EU-MBX-01.mgc.mentorg.com> Hi All, It looks like the issue is still open described in the following two tickets in mpich-3.1.3 http://trac.mpich.org/projects/mpich/ticket/1103 http://trac.mpich.org/projects/mpich/ticket/79 Is there any plan to work on these? The issue is critical and stops us from using nemesis. Thanks, Hirak -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- _______________________________________________ discuss mailing list discuss at mpich.org To manage subscription options or unsubscribe: https://lists.mpich.org/mailman/listinfo/discuss From hirak_roy at mentor.com Sun Nov 16 22:44:23 2014 From: hirak_roy at mentor.com (Hirak Roy) Date: Mon, 17 Nov 2014 10:14:23 +0530 Subject: [mpich-discuss] Client hangs if server dies in dynamic process management Message-ID: <54697D27.4000406@mentor.com> Hi All, Here is my sample program. I am using channel sock of mpich-3.0.4. I am running it as > mpiexec -n 1 ./server.out > mpiexec -n 1 ./client.out Here my client program (client.c) hangs in MPI_Finalize. There is an assert in the server.c where server exits. There is no way to detect that in client. Even if we detect that using some timeout strategy, the client hangs in the finalize step. Could you please suggest what is going wrong here or is this a bug in sock channel? Thanks, Hirak -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: client.c Type: text/x-csrc Size: 1395 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: server.c Type: text/x-csrc Size: 1318 bytes Desc: not available URL: -------------- next part -------------- _______________________________________________ discuss mailing list discuss at mpich.org To manage subscription options or unsubscribe: https://lists.mpich.org/mailman/listinfo/discuss From Siegmar.Gross at informatik.hs-fulda.de Mon Nov 17 02:26:12 2014 From: Siegmar.Gross at informatik.hs-fulda.de (Siegmar Gross) Date: Mon, 17 Nov 2014 09:26:12 +0100 Subject: [mpich-discuss] Error building mpich-master-v3.2a2 on Solaris with Sun C 5.12 Message-ID: <201411170826.sAH8QCbM020982@tyr.informatik.hs-fulda.de> Hi, today I tried to build mpich-master-v3.2a2 on Solaris 10 Sparc and Solaris 10 x86_64 with Sun C 5.12. The process broke with the following errors on both machines. tyr mpich-master-v3.2a2-SunOS.x86_64.64_cc 611 tail -20 log.make.SunOS.x86_64.64_cc "/export2/src/mpich-3.2/mpich-master-v3.2a2/src/openpa/src/primitives/opa_gcc_intel_32_64_ops.h", line 151: warning: parameter in inline asm statement unused: %1 "/export2/src/mpich-3.2/mpich-master-v3.2a2/src/openpa/src/primitives/opa_gcc_intel_32_64_ops.h", line 159: warning: parameter in inline asm statement unused: %3 "/export2/src/mpich-3.2/mpich-master-v3.2a2/src/openpa/src/primitives/opa_gcc_intel_32_64_ops.h", line 167: warning: parameter in inline asm statement unused: %3 "../mpich-master-v3.2a2/src/mpid/ch3/channels/nemesis/include/mpid_nem_datatypes.h", line 156: warning: syntax error: empty member declaration "../mpich-master-v3.2a2/src/mpid/ch3/channels/nemesis/include/mpid_nem_datatypes.h", line 161: warning: syntax error: empty member declaration "../mpich-master-v3.2a2/src/mpid/ch3/include/mpidi_recvq_statistics.h", line 13: warning: syntax error: empty declaration "../mpich-master-v3.2a2/src/mpid/ch3/src/ch3u_recvq.c", line 95: warning: syntax error: empty declaration "../mpich-master-v3.2a2/src/mpid/ch3/src/ch3u_recvq.c", line 96: warning: syntax error: empty declaration "../mpich-master-v3.2a2/src/mpid/ch3/src/ch3u_recvq.c", line 97: warning: syntax error: empty declaration "../mpich-master-v3.2a2/src/mpid/ch3/src/ch3u_recvq.c", line 98: warning: syntax error: empty declaration "../mpich-master-v3.2a2/src/mpid/ch3/src/ch3u_recvq.c", line 99: warning: syntax error: empty declaration "../mpich-master-v3.2a2/src/mpid/ch3/src/ch3u_recvq.c", line 100: warning: syntax error: empty declaration "../mpich-master-v3.2a2/src/mpid/ch3/src/ch3u_recvq.c", line 103: warning: syntax error: empty declaration "../mpich-master-v3.2a2/src/mpid/ch3/src/ch3u_recvq.c", line 598: undefined symbol: false cc: acomp failed for ../mpich-master-v3.2a2/src/mpid/ch3/src/ch3u_recvq.c make[2]: *** [src/mpid/ch3/src/lib_libmpi_la-ch3u_recvq.lo] Error 1 make[2]: Leaving directory `/export2/src/mpich-3.2/mpich-master-v3.2a2-SunOS.x86_64.64_cc' make[1]: *** [all-recursive] Error 1 make[1]: Leaving directory `/export2/src/mpich-3.2/mpich-master-v3.2a2-SunOS.x86_64.64_cc' make: *** [all] Error 2 tyr mpich-master-v3.2a2-SunOS.x86_64.64_cc 612 I used the following configure command. tyr mpich-master-v3.2a2-SunOS.x86_64.64_cc 619 head config.log | grep mpich $ ../mpich-master-v3.2a2/configure --prefix=/usr/local/mpich-3.2_64_cc --libdir=/usr/local/mpich-3.2_64_cc/lib64 --includedir=/usr/local/mpich-3.2_64_cc/include64 CC=cc CXX=CC F77=f77 FC=f95 CFLAGS=-m64 CXXFLAGS=-m64 FFLAGS=-m64 FCFLAGS=-m64 LDFLAGS=-m64 -L/usr/lib/amd64 -R/usr/lib/amd64 --enable-f77 --enable-fc --enable-cxx --enable-romio --enable-debuginfo --enable-smpcoll --enable-threads=runtime --with-thread-package=posix --enable-shared tyr mpich-master-v3.2a2-SunOS.x86_64.64_cc 620 I was able to build the package with gcc-4.9.2. Can somebody fix the errors for Sun C 5.12? Thank you very much for any help in advance. Kind regards Siegmar _______________________________________________ discuss mailing list discuss at mpich.org To manage subscription options or unsubscribe: https://lists.mpich.org/mailman/listinfo/discuss From vtqanh at gmail.com Sun Nov 16 09:04:55 2014 From: vtqanh at gmail.com (Anh Vo) Date: Sun, 16 Nov 2014 07:04:55 -0800 Subject: [mpich-discuss] the function MPI_Scatterv In-Reply-To: References: Message-ID: Try init tab_indice and tab_taille to NULL (it does not matter that it is initialized for non-root, but the compiler likely isn't smart enough to tell) 2014-11-15 12:38 GMT-08:00 Chafik sanaa : > when i running the following program, i have this error (error C4703: > variable de pointeur locale potentiellement non initialis?e 'tab_indice' > utilis?e and error C4703: variable de pointeur locale potentiellement non > initialis?e 'tab_taille' utilis?e ): > program: > ///////////////////////////////////////////// > // MPI_Scatterv.c > // test de la fonction MPI_Scatterv > ///////////////////////////////////////////// > > #include > #include > #include > > #define DIMENSION 10 > > int main(int argc, char** argv) > { > int myrank, i, n_procs; > int * sendbuf; //buffer ? disperser > int * tab_indice; //indice de d?but de chaque subdivision > int * tab_taille; //nombre d'?l?ments ? envoyer > // pour chaque processus > int * rbuf; //buffer de reception > int taille; //taille de la partie re?ue > > //Initialisation de MPI > MPI_Init(&argc, &argv); > MPI_Comm_size(MPI_COMM_WORLD, &n_procs); > MPI_Comm_rank(MPI_COMM_WORLD, &myrank); > > if (myrank == 0) { > //allocation m?moire > sendbuf = (int *)malloc(n_procs*DIMENSION*sizeof(int)); > tab_indice = (int *)malloc(n_procs*sizeof(int)); > tab_taille = (int *)malloc(n_procs*sizeof(int)); > //remplissage du buffer ? disperser > for (i = 0; i //initialisation des subdivisions > for (i = 0; i //nombre d'?l?ments ? envoyer > tab_taille[i] = DIMENSION - i; > //indice de d?but du processus i > // = celui de i-1 + nombre d'?l?ments > // envoy?s par i-1 > if (i != 0) tab_indice[i] = tab_indice[i - 1] + tab_taille[i - 1]; > else tab_indice[i] = 0; > } > } > > //communication de la taille de la partie re?ue ? chaque processus > MPI_Scatter(tab_taille, 1, MPI_INT, &taille, 1, MPI_INT, 0, > MPI_COMM_WORLD); > //allocation du buffer de reception > rbuf = (int *)malloc(taille*sizeof(int)); > > //dispersion > MPI_Scatterv(&sendbuf, tab_taille, tab_indice, MPI_INT, rbuf, > DIMENSION, MPI_INT, 0, MPI_COMM_WORLD); > > //affichage > printf("processus %d : [ ", myrank); > for (i = 0; i printf("]\n"); > > //desallocation m?moire > free(rbuf); > if (myrank == 0) { > free(sendbuf); > free(tab_indice); > free(tab_taille); > } > > MPI_Finalize(); > exit(EXIT_SUCCESS); > } > > _______________________________________________ > discuss mailing list discuss at mpich.org > To manage subscription options or unsubscribe: > https://lists.mpich.org/mailman/listinfo/discuss > -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- _______________________________________________ discuss mailing list discuss at mpich.org To manage subscription options or unsubscribe: https://lists.mpich.org/mailman/listinfo/discuss From san.chafik at gmail.com Sat Nov 15 14:38:35 2014 From: san.chafik at gmail.com (Chafik sanaa) Date: Sat, 15 Nov 2014 21:38:35 +0100 Subject: [mpich-discuss] the function MPI_Scatterv Message-ID: when i running the following program, i have this error (error C4703: variable de pointeur locale potentiellement non initialis?e 'tab_indice' utilis?e and error C4703: variable de pointeur locale potentiellement non initialis?e 'tab_taille' utilis?e ): program: ///////////////////////////////////////////// // MPI_Scatterv.c // test de la fonction MPI_Scatterv ///////////////////////////////////////////// #include #include #include #define DIMENSION 10 int main(int argc, char** argv) { int myrank, i, n_procs; int * sendbuf; //buffer ? disperser int * tab_indice; //indice de d?but de chaque subdivision int * tab_taille; //nombre d'?l?ments ? envoyer // pour chaque processus int * rbuf; //buffer de reception int taille; //taille de la partie re?ue //Initialisation de MPI MPI_Init(&argc, &argv); MPI_Comm_size(MPI_COMM_WORLD, &n_procs); MPI_Comm_rank(MPI_COMM_WORLD, &myrank); if (myrank == 0) { //allocation m?moire sendbuf = (int *)malloc(n_procs*DIMENSION*sizeof(int)); tab_indice = (int *)malloc(n_procs*sizeof(int)); tab_taille = (int *)malloc(n_procs*sizeof(int)); //remplissage du buffer ? disperser for (i = 0; i -------------- next part -------------- _______________________________________________ discuss mailing list discuss at mpich.org To manage subscription options or unsubscribe: https://lists.mpich.org/mailman/listinfo/discuss From balaji at anl.gov Fri Nov 14 05:35:48 2014 From: balaji at anl.gov (Balaji, Pavan) Date: Fri, 14 Nov 2014 11:35:48 +0000 Subject: [mpich-discuss] Assertion failed in file src/mpid/ch3/src/ch3u_handle_send_req.c at line 61 (RMA && Derived datatypes)&In-Reply-To=<0A407957589BAB4F924824150C4293EF588072EC@CITESMBX2.ad.uillinois.edu> In-Reply-To: <8D58A4B5E6148C419C6AD6334962375DDD9A5A27@UWMBX04.uw.lu.se> References: <8D58A4B5E6148C419C6AD6334962375DDD9A5A27@UWMBX04.uw.lu.se> Message-ID: > On Nov 14, 2014, at 2:55 AM, Victor Vysotskiy wrote: > finally I got reply from admins about the status of MXM on our cluster: > > >MXM is part of the Mellanox OFED stack, which is not used on Triolith. > >Please note that NSC recommends using Intel MPI when possible. You can install MXM on the vanilla OFED, even without Mellanox OFED. I don?t know why it?s failing to detect the network adapter, though. I?ve pinged the Mellanox folks to see if they can help. > Of course, the Intel MPI is installed and available. However, since Intel MPI is based on MPICH, it suffers from the same problems as MPICH does. Right. Intel folks occassionally do release updates. For the next update, we?ll ask them to make sure to pick up this patch. But it?s going to be a few months out. Regards, ? Pavan -- Pavan Balaji ?? http://www.mcs.anl.gov/~balaji _______________________________________________ discuss mailing list discuss at mpich.org To manage subscription options or unsubscribe: https://lists.mpich.org/mailman/listinfo/discuss From balaji at anl.gov Fri Nov 14 13:41:06 2014 From: balaji at anl.gov (Balaji, Pavan) Date: Fri, 14 Nov 2014 19:41:06 +0000 Subject: [mpich-discuss] Assertion failed in file src/mpid/ch3/src/ch3u_handle_send_req.c at line 61 (RMA && Derived datatypes)&In-Reply-To=<0A407957589BAB4F924824150C4293EF588072EC@CITESMBX2.ad.uillinois.edu> In-Reply-To: References: <8D58A4B5E6148C419C6AD6334962375DDD9A5A27@UWMBX04.uw.lu.se> Message-ID: <321F1A54-10D2-44DE-B451-F263C3D020DB@anl.gov> FYI, this is what we heard back from the Mellanox folks: I think, issue could be because of older MXM (part of MOFED) being used. We can ask him to try latest MXM from HPCX (http://bgate.mellanox.com/products/hpcx) ? Pavan > On Nov 14, 2014, at 5:35 AM, Balaji, Pavan wrote: > > >> On Nov 14, 2014, at 2:55 AM, Victor Vysotskiy wrote: >> finally I got reply from admins about the status of MXM on our cluster: >> >>> MXM is part of the Mellanox OFED stack, which is not used on Triolith. >>> Please note that NSC recommends using Intel MPI when possible. > > You can install MXM on the vanilla OFED, even without Mellanox OFED. I don?t know why it?s failing to detect the network adapter, though. I?ve pinged the Mellanox folks to see if they can help. > >> Of course, the Intel MPI is installed and available. However, since Intel MPI is based on MPICH, it suffers from the same problems as MPICH does. > > Right. Intel folks occassionally do release updates. For the next update, we?ll ask them to make sure to pick up this patch. But it?s going to be a few months out. > > Regards, > > ? Pavan > > -- > Pavan Balaji ?? > http://www.mcs.anl.gov/~balaji > > _______________________________________________ > discuss mailing list discuss at mpich.org > To manage subscription options or unsubscribe: > https://lists.mpich.org/mailman/listinfo/discuss -- Pavan Balaji ?? http://www.mcs.anl.gov/~balaji _______________________________________________ discuss mailing list discuss at mpich.org To manage subscription options or unsubscribe: https://lists.mpich.org/mailman/listinfo/discuss From balaji at anl.gov Wed Nov 12 09:20:12 2014 From: balaji at anl.gov (Balaji, Pavan) Date: Wed, 12 Nov 2014 15:20:12 +0000 Subject: [mpich-discuss] Assertion failed in file src/mpid/ch3/src/ch3u_handle_send_req.c at line 61 (RMA && Derived datatypes)&In-Reply-To=<0A407957589BAB4F924824150C4293EF588072EC@CITESMBX2.ad.uillinois.edu> In-Reply-To: <8D58A4B5E6148C419C6AD6334962375DDD9A590B@UWMBX04.uw.lu.se> References: <8D58A4B5E6148C419C6AD6334962375DDD9A590B@UWMBX04.uw.lu.se> Message-ID: Victor, Hmm. I?m not sure. I?ve cc?ed Devendar @ Mellanox. Devendar, can you help? Please see http://lists.mpich.org/pipermail/discuss/2014-November/003396.html for the error Victor is seeing. Regards, ? Pavan > On Nov 12, 2014, at 4:42 AM, Victor Vysotskiy wrote: > > Dear Pavan, > > Yes, our cluster is equipped with the Mellanox FDR IB (56 Gb/s): > > %ibv_devinfo > hca_id: mlx4_0 > transport: InfiniBand (0) > fw_ver: 2.30.3200 > node_guid: 0002:c903:0034:5980 > sys_image_guid: 0002:c903:0034:5983 > vendor_id: 0x02c9 > vendor_part_id: 4099 > hw_ver: 0x0 > board_id: HP_0230240019 > phys_port_cnt: 2 > port: 1 > state: PORT_ACTIVE (4) > max_mtu: 4096 (5) > active_mtu: 4096 (5) > sm_lid: 1 > port_lid: 202 > port_lmc: 0x00 > link_layer: InfiniBand > > port: 2 > state: PORT_DOWN (1) > max_mtu: 4096 (5) > active_mtu: 4096 (5) > sm_lid: 0 > port_lid: 0 > port_lmc: 0x00 > link_layer: InfiniBand > > With best regards, > Victor. > > P.s. I am sorry for a mess with message title. Could you please guide me how to properly reply a message posted on the MPICH mailing list. I have tried to use In-Reply-To tag without any success. > > _______________________________________________ > discuss mailing list discuss at mpich.org > To manage subscription options or unsubscribe: > https://lists.mpich.org/mailman/listinfo/discuss -- Pavan Balaji ?? http://www.mcs.anl.gov/~balaji _______________________________________________ discuss mailing list discuss at mpich.org To manage subscription options or unsubscribe: https://lists.mpich.org/mailman/listinfo/discuss From victor.vysotskiy at teokem.lu.se Fri Nov 14 02:55:03 2014 From: victor.vysotskiy at teokem.lu.se (Victor Vysotskiy) Date: Fri, 14 Nov 2014 08:55:03 +0000 Subject: [mpich-discuss] Assertion failed in file src/mpid/ch3/src/ch3u_handle_send_req.c at line 61 (RMA && Derived datatypes)&In-Reply-To=<0A407957589BAB4F924824150C4293EF588072EC@CITESMBX2.ad.uillinois.edu> Message-ID: <8D58A4B5E6148C419C6AD6334962375DDD9A5A27@UWMBX04.uw.lu.se> Hi Pavan, finally I got reply from admins about the status of MXM on our cluster: >MXM is part of the Mellanox OFED stack, which is not used on Triolith. >Please note that NSC recommends using Intel MPI when possible. Of course, the Intel MPI is installed and available. However, since Intel MPI is based on MPICH, it suffers from the same problems as MPICH does. Best, Victor. -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- _______________________________________________ discuss mailing list discuss at mpich.org To manage subscription options or unsubscribe: https://lists.mpich.org/mailman/listinfo/discuss From fisaila at mcs.anl.gov Wed Nov 12 08:48:44 2014 From: fisaila at mcs.anl.gov (Isaila, Florin D.) Date: Wed, 12 Nov 2014 14:48:44 +0000 Subject: [mpich-discuss] mpiexec and DISPLAY for remote hosts In-Reply-To: <82D2D1D7-063C-4DB0-A3D5-58EAE64CFDA6@anl.gov> References: <6F4D5A685397B940825208C64CF853A7477F85D3@HALAS.anl.gov> <,> <6F4D5A685397B940825208C64CF853A7477F85EA@HALAS.anl.gov> <6F4D5A685397B940825208C64CF853A7477F8601@HALAS.anl.gov>, <82D2D1D7-063C-4DB0-A3D5-58EAE64CFDA6@anl.gov> Message-ID: <6F4D5A685397B940825208C64CF853A7477F8650@HALAS.anl.gov> -enable-x works. Thanks Pavan ________________________________________ From: Balaji, Pavan [balaji at anl.gov] Sent: Tuesday, November 11, 2014 6:44 PM To: discuss at mpich.org Subject: Re: [mpich-discuss] mpiexec and DISPLAY for remote hosts Did you try passing -enable-x to mpiexec? ? Pavan > On Nov 11, 2014, at 5:00 PM, Isaila, Florin D. wrote: > > Hi, > > I am having troubles running mpiexec (from MPICH 3.1) on remote hosts with xterm (for debugging purposes) from an Ubuntu box. > > The following command works perfectly in local: > mpiexec -n 1 xterm > > However, if I use a machine file: > mpiexec -f macfile -n 1 xterm > xterm Xt error: Can't open display: > xterm: DISPLAY is not set > > =================================================================================== > = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES > = PID 19225 RUNNING AT thwomp.mcs.anl.gov > = EXIT CODE: 1 > = CLEANING UP REMAINING PROCESSES > = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES > =================================================================================== > > X forwarding is activated: > fisaila at howard:bin$ grep X11Forwarding /etc/ssh/sshd_config > X11Forwarding yes > > The following command works: > fisaila at howard:bin$ ssh thwomp.mcs.anl.gov xterm > > Any suggestion how to solve this? > > Thanks > Florin > > > _______________________________________________ > discuss mailing list discuss at mpich.org > To manage subscription options or unsubscribe: > https://lists.mpich.org/mailman/listinfo/discuss -- Pavan Balaji ?? http://www.mcs.anl.gov/~balaji _______________________________________________ discuss mailing list discuss at mpich.org To manage subscription options or unsubscribe: https://lists.mpich.org/mailman/listinfo/discuss _______________________________________________ discuss mailing list discuss at mpich.org To manage subscription options or unsubscribe: https://lists.mpich.org/mailman/listinfo/discuss From san.chafik at gmail.com Wed Nov 12 08:57:40 2014 From: san.chafik at gmail.com (Chafik sanaa) Date: Wed, 12 Nov 2014 15:57:40 +0100 Subject: [mpich-discuss] (mpich2nemesis.dl In-Reply-To: References: Message-ID: thank you very much 2014-11-12 14:11 GMT+01:00 Junchao Zhang : > You need to use disp[] > < MPI_Scatterv(&a, rcounts, 0, MPI_INT, &b, 100, MPI_INT, 0, > MPI_COMM_WORLD); > --- > > MPI_Scatterv(&a, rcounts, disp, MPI_INT, &b, 100, MPI_INT, 0, > MPI_COMM_WORLD); > > --Junchao Zhang > > On Wed, Nov 12, 2014 at 5:23 AM, Chafik sanaa > wrote: > >> when i running the following program, i have this error (Exception non >> g?r?e ? 0x00B54BE2 (mpich2nemesis.dll) dans samedi1.exe : 0xC0000005 : >> Violation d'acc?s lors de la lecture de l'emplacement 0x00000000.) >> program : >> #include >> #include "mpi.h" >> >> int main(int argc, char *argv[]) >> { >> >> int a[8] = { 1, 2, 3, 4, 5, 6, 7, 8 }; >> int rcounts[4] = { 2, 2, 2, 2 }; >> int disp[4] = { 0, 2, 4, 6 }; >> int b[8] = { 0, 0, 0, 0, 0, 0, 0, 0 }; >> int procid; >> MPI_Init(&argc, &argv); >> MPI_Comm_rank(MPI_COMM_WORLD, &procid); >> MPI_Scatterv(&a, rcounts, 0, MPI_INT, &b, 100, MPI_INT, 0, >> MPI_COMM_WORLD); >> printf("%d", b[0]); >> printf("sanaa"); >> MPI_Finalize(); >> return 0; >> } >> >> _______________________________________________ >> discuss mailing list discuss at mpich.org >> To manage subscription options or unsubscribe: >> https://lists.mpich.org/mailman/listinfo/discuss >> > > > _______________________________________________ > discuss mailing list discuss at mpich.org > To manage subscription options or unsubscribe: > https://lists.mpich.org/mailman/listinfo/discuss > -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- _______________________________________________ discuss mailing list discuss at mpich.org To manage subscription options or unsubscribe: https://lists.mpich.org/mailman/listinfo/discuss From jczhang at mcs.anl.gov Wed Nov 12 07:05:13 2014 From: jczhang at mcs.anl.gov (Junchao Zhang) Date: Wed, 12 Nov 2014 07:05:13 -0600 Subject: [mpich-discuss] the function MPI_Scatterv In-Reply-To: References: Message-ID: See bottom of http://mpi.deino.net/mpi_functions/MPI_Scatterv.html --Junchao Zhang On Wed, Nov 12, 2014 at 3:17 AM, Chafik sanaa wrote: > I want a simple example with the MPI_Scatterv() function, because most of > the programs in internet does not work > > _______________________________________________ > discuss mailing list discuss at mpich.org > To manage subscription options or unsubscribe: > https://lists.mpich.org/mailman/listinfo/discuss > -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- _______________________________________________ discuss mailing list discuss at mpich.org To manage subscription options or unsubscribe: https://lists.mpich.org/mailman/listinfo/discuss From jczhang at mcs.anl.gov Wed Nov 12 07:11:50 2014 From: jczhang at mcs.anl.gov (Junchao Zhang) Date: Wed, 12 Nov 2014 07:11:50 -0600 Subject: [mpich-discuss] (mpich2nemesis.dl In-Reply-To: References: Message-ID: You need to use disp[] < MPI_Scatterv(&a, rcounts, 0, MPI_INT, &b, 100, MPI_INT, 0, MPI_COMM_WORLD); --- > MPI_Scatterv(&a, rcounts, disp, MPI_INT, &b, 100, MPI_INT, 0, MPI_COMM_WORLD); --Junchao Zhang On Wed, Nov 12, 2014 at 5:23 AM, Chafik sanaa wrote: > when i running the following program, i have this error (Exception non > g?r?e ? 0x00B54BE2 (mpich2nemesis.dll) dans samedi1.exe : 0xC0000005 : > Violation d'acc?s lors de la lecture de l'emplacement 0x00000000.) > program : > #include > #include "mpi.h" > > int main(int argc, char *argv[]) > { > > int a[8] = { 1, 2, 3, 4, 5, 6, 7, 8 }; > int rcounts[4] = { 2, 2, 2, 2 }; > int disp[4] = { 0, 2, 4, 6 }; > int b[8] = { 0, 0, 0, 0, 0, 0, 0, 0 }; > int procid; > MPI_Init(&argc, &argv); > MPI_Comm_rank(MPI_COMM_WORLD, &procid); > MPI_Scatterv(&a, rcounts, 0, MPI_INT, &b, 100, MPI_INT, 0, > MPI_COMM_WORLD); > printf("%d", b[0]); > printf("sanaa"); > MPI_Finalize(); > return 0; > } > > _______________________________________________ > discuss mailing list discuss at mpich.org > To manage subscription options or unsubscribe: > https://lists.mpich.org/mailman/listinfo/discuss > -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- _______________________________________________ discuss mailing list discuss at mpich.org To manage subscription options or unsubscribe: https://lists.mpich.org/mailman/listinfo/discuss From victor.vysotskiy at teokem.lu.se Wed Nov 12 04:42:30 2014 From: victor.vysotskiy at teokem.lu.se (Victor Vysotskiy) Date: Wed, 12 Nov 2014 10:42:30 +0000 Subject: [mpich-discuss] Assertion failed in file src/mpid/ch3/src/ch3u_handle_send_req.c at line 61 (RMA && Derived datatypes)&In-Reply-To=<0A407957589BAB4F924824150C4293EF588072EC@CITESMBX2.ad.uillinois.edu> Message-ID: <8D58A4B5E6148C419C6AD6334962375DDD9A590B@UWMBX04.uw.lu.se> Dear Pavan, Yes, our cluster is equipped with the Mellanox FDR IB (56 Gb/s): %ibv_devinfo hca_id: mlx4_0 transport: InfiniBand (0) fw_ver: 2.30.3200 node_guid: 0002:c903:0034:5980 sys_image_guid: 0002:c903:0034:5983 vendor_id: 0x02c9 vendor_part_id: 4099 hw_ver: 0x0 board_id: HP_0230240019 phys_port_cnt: 2 port: 1 state: PORT_ACTIVE (4) max_mtu: 4096 (5) active_mtu: 4096 (5) sm_lid: 1 port_lid: 202 port_lmc: 0x00 link_layer: InfiniBand port: 2 state: PORT_DOWN (1) max_mtu: 4096 (5) active_mtu: 4096 (5) sm_lid: 0 port_lid: 0 port_lmc: 0x00 link_layer: InfiniBand With best regards, Victor. P.s. I am sorry for a mess with message title. Could you please guide me how to properly reply a message posted on the MPICH mailing list. I have tried to use In-Reply-To tag without any success. _______________________________________________ discuss mailing list discuss at mpich.org To manage subscription options or unsubscribe: https://lists.mpich.org/mailman/listinfo/discuss From san.chafik at gmail.com Wed Nov 12 05:23:50 2014 From: san.chafik at gmail.com (Chafik sanaa) Date: Wed, 12 Nov 2014 12:23:50 +0100 Subject: [mpich-discuss] (mpich2nemesis.dl Message-ID: when i running the following program, i have this error (Exception non g?r?e ? 0x00B54BE2 (mpich2nemesis.dll) dans samedi1.exe : 0xC0000005 : Violation d'acc?s lors de la lecture de l'emplacement 0x00000000.) program : #include #include "mpi.h" int main(int argc, char *argv[]) { int a[8] = { 1, 2, 3, 4, 5, 6, 7, 8 }; int rcounts[4] = { 2, 2, 2, 2 }; int disp[4] = { 0, 2, 4, 6 }; int b[8] = { 0, 0, 0, 0, 0, 0, 0, 0 }; int procid; MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &procid); MPI_Scatterv(&a, rcounts, 0, MPI_INT, &b, 100, MPI_INT, 0, MPI_COMM_WORLD); printf("%d", b[0]); printf("sanaa"); MPI_Finalize(); return 0; } -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- _______________________________________________ discuss mailing list discuss at mpich.org To manage subscription options or unsubscribe: https://lists.mpich.org/mailman/listinfo/discuss From san.chafik at gmail.com Wed Nov 12 03:17:28 2014 From: san.chafik at gmail.com (Chafik sanaa) Date: Wed, 12 Nov 2014 10:17:28 +0100 Subject: [mpich-discuss] the function MPI_Scatterv Message-ID: I want a simple example with the MPI_Scatterv() function, because most of the programs in internet does not work -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- _______________________________________________ discuss mailing list discuss at mpich.org To manage subscription options or unsubscribe: https://lists.mpich.org/mailman/listinfo/discuss From victor.vysotskiy at teokem.lu.se Wed Nov 12 04:28:51 2014 From: victor.vysotskiy at teokem.lu.se (Victor Vysotskiy) Date: Wed, 12 Nov 2014 10:28:51 +0000 Subject: [mpich-discuss] Assertion failed in file src/mpid/ch3/src/ch3u_handle_send_req.c at line 61 (RMA && Derived datatypes) Message-ID: <8D58A4B5E6148C419C6AD6334962375DDD9A5901@UWMBX04.uw.lu.se> Dear Pavan, Dear Xin, I just downloaded the latest nightly MPICH3 tarball ('mpich-master-v3.1.3-174-gb0f5772f') and compiled it on our IB cluster using the Intel's compilers v15.0: %mpichversion MPICH Version: 3.1.3 MPICH Release date: Wed Nov 12 00:00:34 CST 2014 MPICH Device: ch3:nemesis MPICH configure: --prefix=/nobackup/global/x_vicvy/mpich3-dev.ib --with-device=ch3:nemesis:ib CC=icc CXX=icpc FC=ifort F77=ifort MPICH CC: icc -O2 MPICH CXX: icpc -O2 MPICH F77: ifort -O2 MPICH FC: ifort -O2 Unfortunately, the test-bed code still crashes with the same error message: %mpirun -np 8 ./mpi_tvec2_rma 64 400000 Allocating memory: win_buf=195 (Mb), loc_buf=24 (Mb) Assertion failed in file src/mpid/ch3/src/ch3u_handle_send_req.c at line 71: win_ptr->at_completion_counter >= 0 internal ABORT - process 0 =================================================================================== = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES = PID 65516 RUNNING AT n3 = EXIT CODE: 1 = CLEANING UP REMAINING PROCESSES = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES =================================================================================== Here I should mention an observed change: the code always crashes with 8 processes, but sometimes it works out with 4 processes (!). However, even with 4 processes you can reproduce the problem by running test-bed several times in a row; i.e.; for i in {1..5}; do mpirun -np 4 ./mpi_tvec2_rma 64 400000; done In order to be sure that the problem is generic, I also fetched the latest commits from mpich/master repo and recompiled MPICH3 on my laptop by using GCC v4.9: %git log --pretty=oneline --abbrev-commit -5 b0f5772 Revert RMA ADI change for req-based RMA operations. eedd51e Delete unused variable. 9107068 Delete no longer needed file. c235c75 Delete no longer used epoch states. f695c96 Bug-fixing: set window state to MPIDI_RMA_NONE when UNLOCK finishes %mpichversion MPICH Version: 3.1.3 MPICH Release date: unreleased development copy MPICH Device: ch3:nemesis MPICH configure: --prefix=/opt/mpi/mpich3-dev/ --no-create --no-recursion MPICH CC: gcc -O2 MPICH CXX: g++ -O2 MPICH F77: gfortran -O2 MPICH FC: gfortran -O2 Even on my laptop, the problem still remains: mpirun -np 4 ./mpi_tvec2_rma 64 400000 Allocating memory: win_buf=195 (Mb), loc_buf=48 (Mb) Allocating memory: win_buf=195 (Mb), loc_buf=48 (Mb) Allocating memory: win_buf=195 (Mb), loc_buf=48 (Mb) Allocating memory: win_buf=195 (Mb), loc_buf=48 (Mb) Assertion failed in file src/mpid/ch3/src/ch3u_handle_send_req.c at line 71: win_ptr->at_completion_counter >= 0 internal ABORT - process 1 ... It would be great if you can check and fix the issue, if needed. With best regards, Victor. P.s. Just in case, please see the first message for test-bed: http://lists.mpich.org/pipermail/discuss/attachments/20141110/b5b34433/attachment.obj _______________________________________________ discuss mailing list discuss at mpich.org To manage subscription options or unsubscribe: https://lists.mpich.org/mailman/listinfo/discuss From balaji at anl.gov Tue Nov 11 18:44:22 2014 From: balaji at anl.gov (Balaji, Pavan) Date: Wed, 12 Nov 2014 00:44:22 +0000 Subject: [mpich-discuss] mpiexec and DISPLAY for remote hosts In-Reply-To: <6F4D5A685397B940825208C64CF853A7477F8601@HALAS.anl.gov> References: <6F4D5A685397B940825208C64CF853A7477F85D3@HALAS.anl.gov> <,> <6F4D5A685397B940825208C64CF853A7477F85EA@HALAS.anl.gov> <6F4D5A685397B940825208C64CF853A7477F8601@HALAS.anl.gov> Message-ID: <82D2D1D7-063C-4DB0-A3D5-58EAE64CFDA6@anl.gov> Did you try passing -enable-x to mpiexec? ? Pavan > On Nov 11, 2014, at 5:00 PM, Isaila, Florin D. wrote: > > Hi, > > I am having troubles running mpiexec (from MPICH 3.1) on remote hosts with xterm (for debugging purposes) from an Ubuntu box. > > The following command works perfectly in local: > mpiexec -n 1 xterm > > However, if I use a machine file: > mpiexec -f macfile -n 1 xterm > xterm Xt error: Can't open display: > xterm: DISPLAY is not set > > =================================================================================== > = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES > = PID 19225 RUNNING AT thwomp.mcs.anl.gov > = EXIT CODE: 1 > = CLEANING UP REMAINING PROCESSES > = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES > =================================================================================== > > X forwarding is activated: > fisaila at howard:bin$ grep X11Forwarding /etc/ssh/sshd_config > X11Forwarding yes > > The following command works: > fisaila at howard:bin$ ssh thwomp.mcs.anl.gov xterm > > Any suggestion how to solve this? > > Thanks > Florin > > > _______________________________________________ > discuss mailing list discuss at mpich.org > To manage subscription options or unsubscribe: > https://lists.mpich.org/mailman/listinfo/discuss -- Pavan Balaji ?? http://www.mcs.anl.gov/~balaji _______________________________________________ discuss mailing list discuss at mpich.org To manage subscription options or unsubscribe: https://lists.mpich.org/mailman/listinfo/discuss From huiweilu at mcs.anl.gov Tue Nov 11 18:57:18 2014 From: huiweilu at mcs.anl.gov (Lu, Huiwei) Date: Wed, 12 Nov 2014 00:57:18 +0000 Subject: [mpich-discuss] mpiexec and DISPLAY for remote hosts In-Reply-To: <82D2D1D7-063C-4DB0-A3D5-58EAE64CFDA6@anl.gov> References: <6F4D5A685397B940825208C64CF853A7477F85D3@HALAS.anl.gov> <,> <6F4D5A685397B940825208C64CF853A7477F85EA@HALAS.anl.gov> <6F4D5A685397B940825208C64CF853A7477F8601@HALAS.anl.gov> <82D2D1D7-063C-4DB0-A3D5-58EAE64CFDA6@anl.gov> Message-ID: <49710369-0258-4974-ADD8-F7EE0C535601@anl.gov> You can also try with 'ssh -X? to connect. ? Huiwei > On Nov 11, 2014, at 6:44 PM, Balaji, Pavan wrote: > > > Did you try passing -enable-x to mpiexec? > > ? Pavan > >> On Nov 11, 2014, at 5:00 PM, Isaila, Florin D. wrote: >> >> Hi, >> >> I am having troubles running mpiexec (from MPICH 3.1) on remote hosts with xterm (for debugging purposes) from an Ubuntu box. >> >> The following command works perfectly in local: >> mpiexec -n 1 xterm >> >> However, if I use a machine file: >> mpiexec -f macfile -n 1 xterm >> xterm Xt error: Can't open display: >> xterm: DISPLAY is not set >> >> =================================================================================== >> = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES >> = PID 19225 RUNNING AT thwomp.mcs.anl.gov >> = EXIT CODE: 1 >> = CLEANING UP REMAINING PROCESSES >> = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES >> =================================================================================== >> >> X forwarding is activated: >> fisaila at howard:bin$ grep X11Forwarding /etc/ssh/sshd_config >> X11Forwarding yes >> >> The following command works: >> fisaila at howard:bin$ ssh thwomp.mcs.anl.gov xterm >> >> Any suggestion how to solve this? >> >> Thanks >> Florin >> >> >> _______________________________________________ >> discuss mailing list discuss at mpich.org >> To manage subscription options or unsubscribe: >> https://lists.mpich.org/mailman/listinfo/discuss > > -- > Pavan Balaji ?? > http://www.mcs.anl.gov/~balaji > > _______________________________________________ > discuss mailing list discuss at mpich.org > To manage subscription options or unsubscribe: > https://lists.mpich.org/mailman/listinfo/discuss _______________________________________________ discuss mailing list discuss at mpich.org To manage subscription options or unsubscribe: https://lists.mpich.org/mailman/listinfo/discuss From balaji at anl.gov Tue Nov 11 10:35:12 2014 From: balaji at anl.gov (Balaji, Pavan) Date: Tue, 11 Nov 2014 16:35:12 +0000 Subject: [mpich-discuss] Assertion failed in file src/mpid/ch3/src/ch3u_handle_send_req.c at line 61 (RMA && Derived datatypes)&In-Reply-To=<0A407957589BAB4F924824150C4293EF588072EC@CITESMBX2.ad.uillinois.edu> In-Reply-To: <8D58A4B5E6148C419C6AD6334962375DDD9A5878@UWMBX04.uw.lu.se> References: <8D58A4B5E6148C419C6AD6334962375DDD9A5878@UWMBX04.uw.lu.se> Message-ID: <62389563-7400-468E-B6C8-4CE7141FC3BC@anl.gov> > On Nov 11, 2014, at 10:31 AM, Victor Vysotskiy wrote: > %mpirun -np 2 ./a.out > [1415719636.245549] [n1:37817:0] shm.c:65 MXM WARN Could not open the KNEM device file at /dev/knem : No such file or directory. Won't use knem. > > [1415719636.245549] [n1:37818:0] shm.c:65 MXM WARN Could not open the KNEM device file at /dev/knem : No such file or directory. Won't use knem. You can get rid of this by setting the environment variable MXM_LOG_LEVEL=error > [1415719636.263914] [n1:37817:0] ib_dev.c:443 MXM ERROR ibv_query_device() returned 38: No such file or directory > > [1415719636.264134] [n1:37818:0] ib_dev.c:443 MXM ERROR ibv_query_device() returned 38: No such file or directory These are weird. I?m assuming you have a Mellanox IB network? ? Pavan -- Pavan Balaji ?? http://www.mcs.anl.gov/~balaji _______________________________________________ discuss mailing list discuss at mpich.org To manage subscription options or unsubscribe: https://lists.mpich.org/mailman/listinfo/discuss From reuti at staff.uni-marburg.de Tue Nov 11 18:12:02 2014 From: reuti at staff.uni-marburg.de (Reuti) Date: Wed, 12 Nov 2014 01:12:02 +0100 Subject: [mpich-discuss] mpiexec and DISPLAY for remote hosts In-Reply-To: <6F4D5A685397B940825208C64CF853A7477F8601@HALAS.anl.gov> References: <6F4D5A685397B940825208C64CF853A7477F85D3@HALAS.anl.gov>, <6F4D5A685397B940825208C64CF853A7477F85EA@HALAS.anl.gov> <6F4D5A685397B940825208C64CF853A7477F8601@HALAS.anl.gov> Message-ID: ssh_config or sshd_config? It should go to the first one. Von meinem iPod gesendet Am 12.11.2014 um 00:00 schrieb "Isaila, Florin D." : > Hi, > > I am having troubles running mpiexec (from MPICH 3.1) on remote > hosts with xterm (for debugging purposes) from an Ubuntu box. > > The following command works perfectly in local: > mpiexec -n 1 xterm > > However, if I use a machine file: > mpiexec -f macfile -n 1 xterm > xterm Xt error: Can't open display: > xterm: DISPLAY is not set > > === > === > === > === > === > ==================================================================== > = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES > = PID 19225 RUNNING AT thwomp.mcs.anl.gov > = EXIT CODE: 1 > = CLEANING UP REMAINING PROCESSES > = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES > === > === > === > === > === > ==================================================================== > > X forwarding is activated: > fisaila at howard:bin$ grep X11Forwarding /etc/ssh/sshd_config > X11Forwarding yes > > The following command works: > fisaila at howard:bin$ ssh thwomp.mcs.anl.gov xterm > > Any suggestion how to solve this? > > Thanks > Florin > > > _______________________________________________ > discuss mailing list discuss at mpich.org > To manage subscription options or unsubscribe: > https://lists.mpich.org/mailman/listinfo/discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- _______________________________________________ discuss mailing list discuss at mpich.org To manage subscription options or unsubscribe: https://lists.mpich.org/mailman/listinfo/discuss From victor.vysotskiy at teokem.lu.se Tue Nov 11 10:31:01 2014 From: victor.vysotskiy at teokem.lu.se (Victor Vysotskiy) Date: Tue, 11 Nov 2014 16:31:01 +0000 Subject: [mpich-discuss] Assertion failed in file src/mpid/ch3/src/ch3u_handle_send_req.c at line 61 (RMA && Derived datatypes)&In-Reply-To=<0A407957589BAB4F924824150C4293EF588072EC@CITESMBX2.ad.uillinois.edu> Message-ID: <8D58A4B5E6148C419C6AD6334962375DDD9A5878@UWMBX04.uw.lu.se> Dear Pavan, thank you for hints! Ok, I was able to install MXM locally without root privileges. Then I have configured and compiled MPICH3 with MXM. Everything went smoothly and finally I got a working binaries and libs. Unfortunately, root is still needed because a simple MPI HELLO WORLD code now crashes with the following error message: %mpirun -np 2 ./a.out [1415719636.245549] [n1:37817:0] shm.c:65 MXM WARN Could not open the KNEM device file at /dev/knem : No such file or directory. Won't use knem. [1415719636.245549] [n1:37818:0] shm.c:65 MXM WARN Could not open the KNEM device file at /dev/knem : No such file or directory. Won't use knem. [1415719636.263914] [n1:37817:0] ib_dev.c:443 MXM ERROR ibv_query_device() returned 38: No such file or directory [1415719636.264134] [n1:37818:0] ib_dev.c:443 MXM ERROR ibv_query_device() returned 38: No such file or directory Fatal error in MPI_Init: Other MPI error, error stack: MPIR_Init_thread(498).........: MPID_Init(187)................: channel initialization failed MPIDI_CH3_Init(89)............: MPID_nem_init(320)............: MPID_nem_mxm_init(157)........: MPID_nem_mxm_vc_terminate(451): mxm_init failed (Input/output error) Fatal error in MPI_Init: Other MPI error, error stack: MPIR_Init_thread(498).........: MPID_Init(187)................: channel initialization failed MPIDI_CH3_Init(89)............: MPID_nem_init(320)............: MPID_nem_mxm_init(157)........: MPID_nem_mxm_vc_terminate(451): mxm_init failed (Input/output error) %ldd a.out linux-vdso.so.1 => (0x00007ffff29ff000) libmpi.so.12 => /nobackup/global/x_vicvy/mpich3-dev-bin.mxm/lib/libmpi.so.12 (0x00007f9fd4fa1000) libm.so.6 => /lib64/libm.so.6 (0x00007f9fd4cfe000) libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x00007f9fd4ae8000) libc.so.6 => /lib64/libc.so.6 (0x00007f9fd4754000) libdl.so.2 => /lib64/libdl.so.2 (0x00007f9fd454f000) libmxm.so.2 => /nobackup/global/x_vicvy/hpcx-v1.2.0-255-icc-MLNX_OFED_LINUX-2.3-1.5.0-redhat6.5/mxm/lib/libmxm.so.2 (0x00007f9fd41f5000) libz.so.1 => /lib64/libz.so.1 (0x00007f9fd3fdf000) libibverbs.so.1 => /usr/lib64/libibverbs.so.1 (0x00007f9fd3dd1000) librt.so.1 => /lib64/librt.so.1 (0x00007f9fd3bc9000) libgpfs.so => /usr/lib64/libgpfs.so (0x00007f9fd39ba000) libpthread.so.0 => /lib64/libpthread.so.0 (0x00007f9fd379c000) libifport.so.5 => /software/apps/intel/composer_xe_2015.0.090/compiler/lib/intel64/libifport.so.5 (0x00007f9fd356f000) libifcore.so.5 => /software/apps/intel/composer_xe_2015.0.090/compiler/lib/intel64/libifcore.so.5 (0x00007f9fd3239000) libimf.so => /software/apps/intel/composer_xe_2015.0.090/compiler/lib/intel64/libimf.so (0x00007f9fd2d7e000) libsvml.so => /software/apps/intel/composer_xe_2015.0.090/compiler/lib/intel64/libsvml.so (0x00007f9fd212f000) libintlc.so.5 => /software/apps/intel/composer_xe_2015.0.090/compiler/lib/intel64/libintlc.so.5 (0x00007f9fd1ed5000) /lib64/ld-linux-x86-64.so.2 (0x00007f9fd5499000) libirng.so => /software/apps/intel/composer_xe_2015.0.090/compiler/lib/intel64/libirng.so (0x00007f9fd1ccd000) Since ?nemesis:ch3:ib? is fixed, I will check it out. With best regards, Victor. -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- _______________________________________________ discuss mailing list discuss at mpich.org To manage subscription options or unsubscribe: https://lists.mpich.org/mailman/listinfo/discuss From san.chafik at gmail.com Tue Nov 11 09:42:52 2014 From: san.chafik at gmail.com (Chafik sanaa) Date: Tue, 11 Nov 2014 16:42:52 +0100 Subject: [mpich-discuss] (no subject) Message-ID: Hi, What should I do? i have the folowing error (error C4703: variable de pointeur locale potentiellement non initialis?e 'sendbuf' utilis?e and error C4703: variable de pointeur locale potentiellement non initialis?e 'tab_indice' utilis?e ) when I execute this program: #include #include #include #define DIMENSION 10 int main(int argc, char** argv) { int myrank, i, n_procs; int * sendbuf; //buffer ? disperser int * tab_indice; //indice de d?but de chaque subdivision int * tab_taille; //nombre d'?l?ments ? envoyer // pour chaque processus int * rbuf; //buffer de reception int taille; //taille de la partie re?ue //Initialisation de MPI MPI_Init(&argc, &argv); MPI_Comm_size(MPI_COMM_WORLD, &n_procs); MPI_Comm_rank(MPI_COMM_WORLD, &myrank); if (myrank == 0) { //allocation m?moire sendbuf = (int *)malloc(n_procs*DIMENSION*sizeof(int)); tab_indice = (int *)malloc(n_procs*sizeof(int)); tab_taille = (int *)malloc(n_procs*sizeof(int)); //remplissage du buffer ? disperser printf("la taille de la boucle %d \n", n_procs*DIMENSION); for (i = 0; i -------------- next part -------------- _______________________________________________ discuss mailing list discuss at mpich.org To manage subscription options or unsubscribe: https://lists.mpich.org/mailman/listinfo/discuss From balaji at anl.gov Tue Nov 11 10:29:53 2014 From: balaji at anl.gov (Balaji, Pavan) Date: Tue, 11 Nov 2014 16:29:53 +0000 Subject: [mpich-discuss] (no subject) In-Reply-To: References: Message-ID: Please only use this list for MPICH-related questions, not general programming questions. ? Pavan > On Nov 11, 2014, at 9:42 AM, Chafik sanaa wrote: > > Hi, > What should I do? i have the folowing error (error C4703: variable de pointeur locale potentiellement non initialis?e 'sendbuf' utilis?e and error C4703: variable de pointeur locale potentiellement non initialis?e 'tab_indice' utilis?e ) when I execute this program: > #include > #include > #include > > #define DIMENSION 10 > > int main(int argc, char** argv) > { > int myrank, i, n_procs; > int * sendbuf; //buffer ? disperser > int * tab_indice; //indice de d?but de chaque subdivision > int * tab_taille; //nombre d'?l?ments ? envoyer > // pour chaque processus > int * rbuf; //buffer de reception > int taille; //taille de la partie re?ue > > //Initialisation de MPI > MPI_Init(&argc, &argv); > MPI_Comm_size(MPI_COMM_WORLD, &n_procs); > MPI_Comm_rank(MPI_COMM_WORLD, &myrank); > > if (myrank == 0) > { > //allocation m?moire > sendbuf = (int *)malloc(n_procs*DIMENSION*sizeof(int)); > tab_indice = (int *)malloc(n_procs*sizeof(int)); > tab_taille = (int *)malloc(n_procs*sizeof(int)); > //remplissage du buffer ? disperser > printf("la taille de la boucle %d \n", n_procs*DIMENSION); > for (i = 0; i //initialisation des subdivisions > for (i = 0; i { > //nombre d'?l?ments ? envoyer > tab_taille[i] = DIMENSION - i; > printf("tab_taille[%d]=%d \n", i, tab_taille[i]); > //indice de d?but du processus i > // = celui de i-1 + nombre d'?l?ments > // envoy?s par i-1 > if (i != 0) tab_indice[i] = tab_indice[i - 1] + tab_taille[i - 1]; > else tab_indice[i] = 0; > printf("tab_indice[%d]=%d \n", i, tab_indice[i]); > } > } > //communication de la taille de la partie re?ue ? chaque processus > MPI_Scatter(tab_taille, 1, MPI_INT, &taille, 1, MPI_INT, 0, MPI_COMM_WORLD); > //allocation du buffer de reception > rbuf = (int *)malloc(taille*sizeof(int)); > //dispersion > MPI_Scatterv(sendbuf, tab_taille, tab_indice, MPI_INT, rbuf,DIMENSION, MPI_INT, 0, MPI_COMM_WORLD); > //affichage > printf("processus %d : [ ", myrank); > for (i = 0; i printf("]\n"); > > //desallocation m?moire > free(rbuf); > if (myrank == 0) { > free(sendbuf); > free(tab_indice); > free(tab_taille); > } > > MPI_Finalize(); > exit(EXIT_SUCCESS); > } > _______________________________________________ > discuss mailing list discuss at mpich.org > To manage subscription options or unsubscribe: > https://lists.mpich.org/mailman/listinfo/discuss -- Pavan Balaji ?? http://www.mcs.anl.gov/~balaji _______________________________________________ discuss mailing list discuss at mpich.org To manage subscription options or unsubscribe: https://lists.mpich.org/mailman/listinfo/discuss From victor.vysotskiy at teokem.lu.se Tue Nov 11 07:18:53 2014 From: victor.vysotskiy at teokem.lu.se (Victor Vysotskiy) Date: Tue, 11 Nov 2014 13:18:53 +0000 Subject: [mpich-discuss] Assertion failed in file src/mpid/ch3/src/ch3u_handle_send_req.c at line 61 (RMA && Derived datatypes)&In-Reply-To=<0A407957589BAB4F924824150C4293EF588072EC@CITESMBX2.ad.uillinois.edu> Message-ID: <8D58A4B5E6148C419C6AD6334962375DDD9A4CC7@UWMBX04.uw.lu.se> Dear Xin, unfortunately, MXM is not avaiable on our cluster and admins won't install it. So, should I wait untill you repair 'ch3:nemesis:ib'? Actually, I would be happy even if my test-code starts working on my laptop :) Best regards, Victor -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- _______________________________________________ discuss mailing list discuss at mpich.org To manage subscription options or unsubscribe: https://lists.mpich.org/mailman/listinfo/discuss From balaji at anl.gov Tue Nov 11 07:53:04 2014 From: balaji at anl.gov (Balaji, Pavan) Date: Tue, 11 Nov 2014 13:53:04 +0000 Subject: [mpich-discuss] Assertion failed in file src/mpid/ch3/src/ch3u_handle_send_req.c at line 61 (RMA && Derived datatypes)&In-Reply-To=<0A407957589BAB4F924824150C4293EF588072EC@CITESMBX2.ad.uillinois.edu> In-Reply-To: <8D58A4B5E6148C419C6AD6334962375DDD9A4CC7@UWMBX04.uw.lu.se> References: <8D58A4B5E6148C419C6AD6334962375DDD9A4CC7@UWMBX04.uw.lu.se> Message-ID: <86200D04-EB06-4BC5-BE1E-36F184817972@anl.gov> Victor, You don?t need admin access for installing MXM. You can just download it from here: http://www.mellanox.com/page/products_dyn?product_family=189&mtag=hpc-x Simply untar it in your home directory, and you can build mpich by pointing to that location with --with-mxm=/path/to/mxm FWIW, the ch3:nemesis:ib code is now fixed in mpich/master. You should be able to use it from tonight?s tarball as well. ? Pavan > On Nov 11, 2014, at 7:18 AM, Victor Vysotskiy wrote: > > Dear Xin, > > unfortunately, MXM is not avaiable on our cluster and admins won't install it. So, should I wait untill you repair 'ch3:nemesis:ib'? Actually, I would be happy even if my test-code starts working on my laptop :) > > Best regards, > Victor > _______________________________________________ > discuss mailing list discuss at mpich.org > To manage subscription options or unsubscribe: > https://lists.mpich.org/mailman/listinfo/discuss -- Pavan Balaji ?? http://www.mcs.anl.gov/~balaji _______________________________________________ discuss mailing list discuss at mpich.org To manage subscription options or unsubscribe: https://lists.mpich.org/mailman/listinfo/discuss From san.chafik at gmail.com Mon Nov 10 16:11:28 2014 From: san.chafik at gmail.com (Chafik sanaa) Date: Mon, 10 Nov 2014 23:11:28 +0100 Subject: [mpich-discuss] (no subject) Message-ID: if I want to distribute data (as vectors) with different sizes to each process, which function I should use? -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- _______________________________________________ discuss mailing list discuss at mpich.org To manage subscription options or unsubscribe: https://lists.mpich.org/mailman/listinfo/discuss From thakur at mcs.anl.gov Mon Nov 10 16:17:01 2014 From: thakur at mcs.anl.gov (Rajeev Thakur) Date: Mon, 10 Nov 2014 16:17:01 -0600 Subject: [mpich-discuss] (no subject) In-Reply-To: References: Message-ID: <4781EC90-3ADC-4B68-95FC-EFDC6AECEEA8@mcs.anl.gov> Try MPI_Scatterv. On Nov 10, 2014, at 4:11 PM, Chafik sanaa wrote: > if I want to distribute data (as vectors) with different sizes to each process, which function I should use? > _______________________________________________ > discuss mailing list discuss at mpich.org > To manage subscription options or unsubscribe: > https://lists.mpich.org/mailman/listinfo/discuss _______________________________________________ discuss mailing list discuss at mpich.org To manage subscription options or unsubscribe: https://lists.mpich.org/mailman/listinfo/discuss From victor.vysotskiy at teokem.lu.se Mon Nov 10 13:02:12 2014 From: victor.vysotskiy at teokem.lu.se (Victor Vysotskiy) Date: Mon, 10 Nov 2014 19:02:12 +0000 Subject: [mpich-discuss] Assertion failed in file src/mpid/ch3/src/ch3u_handle_send_req.c at line 61 (RMA && Derived datatypes) Message-ID: <8D58A4B5E6148C419C6AD6334962375DDD9A4C54@UWMBX04.uw.lu.se> Dear Pavan, well, the problem still remains with the latest ('e3e7d765e') mpich nightly snapshot cloned from git://git.mpich.org/mpich.git. There is an another compiling problem when MPICH is configure with the '--with-device=ch3:nemesis:ib' option: src/mpid/ch3/channels/nemesis/netmod/ib/ib_poll.c(37): error: expected a declaration } ^ src/mpid/ch3/channels/nemesis/netmod/ib/ib_poll.c(89): warning #12: parsing restarts here after previous syntax error goto fn_exit; ^ src/mpid/ch3/channels/nemesis/netmod/ib/ib_poll.c(90): error: expected a declaration } ^ src/mpid/ch3/channels/nemesis/netmod/ib/ib_poll.c(97): warning #12: parsing restarts here after previous syntax error ibv_poll_cq(MPID_nem_ib_rc_shared_scq, /*3 */ MPID_NEM_IB_COM_MAX_CQ_HEIGHT_DRAIN, &cqe[0]); ^ src/mpid/ch3/channels/nemesis/netmod/ib/ib_poll.c(100): error: expected a declaration MPIU_ERR_CHKANDJUMP(result < 0, mpi_errno, MPI_ERR_OTHER, "**netmod,ib,ibv_poll_cq"); ^ src/mpid/ch3/channels/nemesis/netmod/ib/ib_poll.c(102): error: expected a declaration if (result > 0) { ^ src/mpid/ch3/channels/nemesis/netmod/ib/ib_poll.c(126): warning #12: parsing restarts here after previous syntax error msg_type = MPIDI_Request_get_msg_type(req); Could you please have a look at the issues? With best regards, Victor. _______________________________________________ discuss mailing list discuss at mpich.org To manage subscription options or unsubscribe: https://lists.mpich.org/mailman/listinfo/discuss From xinzhao3 at illinois.edu Mon Nov 10 14:25:18 2014 From: xinzhao3 at illinois.edu (Zhao, Xin) Date: Mon, 10 Nov 2014 20:25:18 +0000 Subject: [mpich-discuss] Assertion failed in file src/mpid/ch3/src/ch3u_handle_send_req.c at line 61 (RMA && Derived datatypes) In-Reply-To: <8D58A4B5E6148C419C6AD6334962375DDD9A4C54@UWMBX04.uw.lu.se> References: <8D58A4B5E6148C419C6AD6334962375DDD9A4C54@UWMBX04.uw.lu.se> Message-ID: <0A407957589BAB4F924824150C4293EF588072EC@CITESMBX2.ad.uillinois.edu> Hi Victor, I will look into your assertion failure code. The RMA part in ch3:nemesis:ib is broken. Could you try to use MXM netmod by ch3:nemesis:mxm? Xin ________________________________________ From: Victor Vysotskiy [victor.vysotskiy at teokem.lu.se] Sent: Monday, November 10, 2014 1:02 PM To: discuss at mpich.org Subject: Re: [mpich-discuss] Assertion failed in file src/mpid/ch3/src/ch3u_handle_send_req.c at line 61 (RMA && Derived datatypes) Dear Pavan, well, the problem still remains with the latest ('e3e7d765e') mpich nightly snapshot cloned from git://git.mpich.org/mpich.git. There is an another compiling problem when MPICH is configure with the '--with-device=ch3:nemesis:ib' option: src/mpid/ch3/channels/nemesis/netmod/ib/ib_poll.c(37): error: expected a declaration } ^ src/mpid/ch3/channels/nemesis/netmod/ib/ib_poll.c(89): warning #12: parsing restarts here after previous syntax error goto fn_exit; ^ src/mpid/ch3/channels/nemesis/netmod/ib/ib_poll.c(90): error: expected a declaration } ^ src/mpid/ch3/channels/nemesis/netmod/ib/ib_poll.c(97): warning #12: parsing restarts here after previous syntax error ibv_poll_cq(MPID_nem_ib_rc_shared_scq, /*3 */ MPID_NEM_IB_COM_MAX_CQ_HEIGHT_DRAIN, &cqe[0]); ^ src/mpid/ch3/channels/nemesis/netmod/ib/ib_poll.c(100): error: expected a declaration MPIU_ERR_CHKANDJUMP(result < 0, mpi_errno, MPI_ERR_OTHER, "**netmod,ib,ibv_poll_cq"); ^ src/mpid/ch3/channels/nemesis/netmod/ib/ib_poll.c(102): error: expected a declaration if (result > 0) { ^ src/mpid/ch3/channels/nemesis/netmod/ib/ib_poll.c(126): warning #12: parsing restarts here after previous syntax error msg_type = MPIDI_Request_get_msg_type(req); Could you please have a look at the issues? With best regards, Victor. _______________________________________________ discuss mailing list discuss at mpich.org To manage subscription options or unsubscribe: https://lists.mpich.org/mailman/listinfo/discuss _______________________________________________ discuss mailing list discuss at mpich.org To manage subscription options or unsubscribe: https://lists.mpich.org/mailman/listinfo/discuss From balaji at anl.gov Mon Nov 10 07:53:12 2014 From: balaji at anl.gov (Balaji, Pavan) Date: Mon, 10 Nov 2014 13:53:12 +0000 Subject: [mpich-discuss] Assertion failed in file src/mpid/ch3/src/ch3u_handle_send_req.c at line 61 (RMA && Derived datatypes) In-Reply-To: <8D58A4B5E6148C419C6AD6334962375DDD9A4C26@UWMBX04.uw.lu.se> References: <8D58A4B5E6148C419C6AD6334962375DDD9A4C26@UWMBX04.uw.lu.se> Message-ID: Victor, We believe this has been fixed in mpich/master. Please download the latest mpich nightly snapshot and give it a try to see if it fixes the issue for you. FYI, the mpich website is down for maintenance at the moment. It should be back up in a couple of hours (or sooner). ? Pavan > On Nov 10, 2014, at 6:57 AM, Victor Vysotskiy wrote: > > Dear Developers, > > recently I have mentioned a problem with assertion failed in file src/mpid/ch3/src/ch3u_handle_send_req.c: > > > http://lists.mpich.org/pipermail/discuss/2014-October/003354.html > > Finally, I was able to narrow problem to a small piece of code. Enclosed please find the test-bed code. In order to reproduce the problem, please compile and execute it with the following arguments: > > mpicc mpi_tvec2_rma.c -o mpi_tvec2_rma > > mpirun -np 8 ./mpi_tvec2_rma 80 400000 > > Assertion failed in file src/mpid/ch3/src/ch3u_handle_send_req.c at line 61: win_ptr->at_completion_counter >= 0 > > internal ABORT - process 0 > > =================================================================================== > > = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES > > = PID 12900 RUNNING AT n6 > > = EXIT CODE: 1 > > = CLEANING UP REMAINING PROCESSES > > = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES > > > =================================================================================== > > > > However, everything works out with "8 400000", and with "16 400000". This problem is reproducible on my 4-core laptop as well as on the 16-core HP SL203S GEN8 compute node. The GCC v4.7.3 and Intel v15.0.0 were used to compile MPICH v3.1.3 on my laptop and on HP SL203S, respectively. Moreover, the MPICH v3.1.3 used includes the '920661c3931' commit by Xin Zhao. > > > > Could you please comment on this issue? Is it a bug in MPICH, or something is wrong with the test-bed code attached? > > > > With best regards, > > Victor. > > _______________________________________________ > discuss mailing list discuss at mpich.org > To manage subscription options or unsubscribe: > https://lists.mpich.org/mailman/listinfo/discuss -- Pavan Balaji ?? http://www.mcs.anl.gov/~balaji _______________________________________________ discuss mailing list discuss at mpich.org To manage subscription options or unsubscribe: https://lists.mpich.org/mailman/listinfo/discuss From huiweilu at mcs.anl.gov Sun Nov 9 16:55:58 2014 From: huiweilu at mcs.anl.gov (Lu, Huiwei) Date: Sun, 9 Nov 2014 22:55:58 +0000 Subject: [mpich-discuss] error of running the MPICH2 In-Reply-To: References: Message-ID: <6770C81E-903B-4E41-832A-7FE6D65E8074@anl.gov> Hi, Chafik, Due to lack of developer resources, MPICH no longer supports the Windows platform. Please refer to our FAQ for more information: http://wiki.mpich.org/mpich/index.php/Frequently_Asked_Questions#Q:_Why_can.27t_I_build_MPICH_on_Windows_anymore.3F We recommend you use Microsoft MPI, which can be found here: http://msdn.microsoft.com/en-us/library/bb524831(v=vs.85).aspx ? Huiwei > On Nov 9, 2014, at 2:29 PM, Chafik sanaa wrote: > >> I am trying to run MPICH2 from the Windows 7 environment but I can't get the wmpiexec.exe working and get the following error message: Error: No smpd passphrase specified through the registry or .smpd file, exiting.i have already registered user account in mpiexec.exe > _______________________________________________ > discuss mailing list discuss at mpich.org > To manage subscription options or unsubscribe: > https://lists.mpich.org/mailman/listinfo/discuss _______________________________________________ discuss mailing list discuss at mpich.org To manage subscription options or unsubscribe: https://lists.mpich.org/mailman/listinfo/discuss From victor.vysotskiy at teokem.lu.se Mon Nov 10 06:57:52 2014 From: victor.vysotskiy at teokem.lu.se (Victor Vysotskiy) Date: Mon, 10 Nov 2014 12:57:52 +0000 Subject: [mpich-discuss] Assertion failed in file src/mpid/ch3/src/ch3u_handle_send_req.c at line 61 (RMA && Derived datatypes) Message-ID: <8D58A4B5E6148C419C6AD6334962375DDD9A4C26@UWMBX04.uw.lu.se> Dear Developers, recently I have mentioned a problem with assertion failed in file src/mpid/ch3/src/ch3u_handle_send_req.c: http://lists.mpich.org/pipermail/discuss/2014-October/003354.html Finally, I was able to narrow problem to a small piece of code. Enclosed please find the test-bed code. In order to reproduce the problem, please compile and execute it with the following arguments: mpicc mpi_tvec2_rma.c -o mpi_tvec2_rma mpirun -np 8 ./mpi_tvec2_rma 80 400000 Assertion failed in file src/mpid/ch3/src/ch3u_handle_send_req.c at line 61: win_ptr->at_completion_counter >= 0 internal ABORT - process 0 =================================================================================== = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES = PID 12900 RUNNING AT n6 = EXIT CODE: 1 = CLEANING UP REMAINING PROCESSES = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES =================================================================================== However, everything works out with "8 400000", and with "16 400000". This problem is reproducible on my 4-core laptop as well as on the 16-core HP SL203S GEN8 compute node. The GCC v4.7.3 and Intel v15.0.0 were used to compile MPICH v3.1.3 on my laptop and on HP SL203S, respectively. Moreover, the MPICH v3.1.3 used includes the '920661c3931' commit by Xin Zhao. Could you please comment on this issue? Is it a bug in MPICH, or something is wrong with the test-bed code attached? With best regards, Victor. -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: mpi_tvec2_rma.c Type: application/octet-stream Size: 2164 bytes Desc: mpi_tvec2_rma.c URL: -------------- next part -------------- _______________________________________________ discuss mailing list discuss at mpich.org To manage subscription options or unsubscribe: https://lists.mpich.org/mailman/listinfo/discuss From yang.shuchi at gmail.com Sat Nov 8 11:26:11 2014 From: yang.shuchi at gmail.com (Shuchi Yang) Date: Sat, 8 Nov 2014 10:26:11 -0700 Subject: [mpich-discuss] Compile Fortran code with -static In-Reply-To: <066879CE-F4BA-4961-88E0-DB4EE84B9A18@mcs.anl.gov> References: <066879CE-F4BA-4961-88E0-DB4EE84B9A18@mcs.anl.gov> Message-ID: Thanks, I am trying it now. Best, Shuchi On Sat, Nov 8, 2014 at 9:39 AM, Rajeev Thakur wrote: > If you don't need to use the MPI-IO functions, try configuring with > --disable-romio. > > Rajeev > > On Nov 8, 2014, at 9:58 AM, Shuchi Yang wrote: > > > 1) At Ubuntu 14.04, Intel Fortran 2015 xe > > Download mpich-3.13 > > It works well with general purpose > > 2) I need compile the fortran code with -static to run on machine > without Intel Fortran environment so I use -static option to compile both > MPICH and FORTRAN code. I met the following error > > > > /mpich/lib/libmpi.a(lib_libmpi_la-tcp_init.o): In function > `MPID_nem_tcp_get_business_card': > > src/mpid/ch3/channels/nemesis/netmod/tcp/tcp_init.c:(.text+0x319): > warning: Using 'gethostbyname' in statically linked applications requires > at runtime the shared libraries from the glibc version used for linking > > /mpich/lib/libmpi.a(ad_iwrite.o): In function `ADIOI_GEN_IwriteContig': > > adio/common/ad_iwrite.c:(.text+0xb7): undefined reference to > `aio_write64' > > /mpich/lib/libmpi.a(ad_iwrite.o): In function `ADIOI_GEN_aio': > > adio/common/ad_iwrite.c:(.text+0x2dc): undefined reference to > `aio_write64' > > adio/common/ad_iwrite.c:(.text+0x2e6): undefined reference to > `aio_read64' > > /mpich/lib/libmpi.a(ad_iwrite.o): In function `ADIOI_GEN_aio_poll_fn': > > adio/common/ad_iwrite.c:(.text+0x4d0): undefined reference to > `aio_error64' > > adio/common/ad_iwrite.c:(.text+0x4f0): undefined reference to > `aio_return64' > > /mpich/lib/libmpi.a(ad_iwrite.o): In function `ADIOI_GEN_aio_wait_fn': > > adio/common/ad_iwrite.c:(.text+0x65d): undefined reference to > `aio_suspend64' > > adio/common/ad_iwrite.c:(.text+0x6a0): undefined reference to > `aio_error64' > > adio/common/ad_iwrite.c:(.text+0x6b5): undefined reference to > `aio_return64' > > > > I tried to link librt.a and libaio.a, it does not work with the > following option: > > -L/usr/lib/x86_64-linux-gnu -lrt -laio > > > > Any comments and suggestion will be welcome. > > > > Shuchi > > > > _______________________________________________ > > discuss mailing list discuss at mpich.org > > To manage subscription options or unsubscribe: > > https://lists.mpich.org/mailman/listinfo/discuss > > _______________________________________________ > discuss mailing list discuss at mpich.org > To manage subscription options or unsubscribe: > https://lists.mpich.org/mailman/listinfo/discuss > -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- _______________________________________________ discuss mailing list discuss at mpich.org To manage subscription options or unsubscribe: https://lists.mpich.org/mailman/listinfo/discuss From san.chafik at gmail.com Sun Nov 9 14:29:05 2014 From: san.chafik at gmail.com (Chafik sanaa) Date: Sun, 9 Nov 2014 21:29:05 +0100 Subject: [mpich-discuss] error of running the MPICH2 Message-ID: *I am trying to run MPICH2 from the Windows 7 environment but I can't get **the wmpiexec.exe working and get the following error message:** Error: No smpd passphrase specified through the registry or .smpd file, **exiting.**i have already registered user account in mpiexec.exe* -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- _______________________________________________ discuss mailing list discuss at mpich.org To manage subscription options or unsubscribe: https://lists.mpich.org/mailman/listinfo/discuss From yang.shuchi at gmail.com Sat Nov 8 09:58:15 2014 From: yang.shuchi at gmail.com (Shuchi Yang) Date: Sat, 8 Nov 2014 08:58:15 -0700 Subject: [mpich-discuss] Compile Fortran code with -static Message-ID: 1) At Ubuntu 14.04, Intel Fortran 2015 xe Download mpich-3.13 It works well with general purpose 2) I need compile the fortran code with -static to run on machine without Intel Fortran environment so I use -static option to compile both MPICH and FORTRAN code. I met the following error /mpich/lib/libmpi.a(lib_libmpi_la-tcp_init.o): In function `MPID_nem_tcp_get_business_card': src/mpid/ch3/channels/nemesis/netmod/tcp/tcp_init.c:(.text+0x319): warning: Using 'gethostbyname' in statically linked applications requires at runtime the shared libraries from the glibc version used for linking /mpich/lib/libmpi.a(ad_iwrite.o): In function `ADIOI_GEN_IwriteContig': adio/common/ad_iwrite.c:(.text+0xb7): undefined reference to `aio_write64' /mpich/lib/libmpi.a(ad_iwrite.o): In function `ADIOI_GEN_aio': adio/common/ad_iwrite.c:(.text+0x2dc): undefined reference to `aio_write64' adio/common/ad_iwrite.c:(.text+0x2e6): undefined reference to `aio_read64' /mpich/lib/libmpi.a(ad_iwrite.o): In function `ADIOI_GEN_aio_poll_fn': adio/common/ad_iwrite.c:(.text+0x4d0): undefined reference to `aio_error64' adio/common/ad_iwrite.c:(.text+0x4f0): undefined reference to `aio_return64' /mpich/lib/libmpi.a(ad_iwrite.o): In function `ADIOI_GEN_aio_wait_fn': adio/common/ad_iwrite.c:(.text+0x65d): undefined reference to `aio_suspend64' adio/common/ad_iwrite.c:(.text+0x6a0): undefined reference to `aio_error64' adio/common/ad_iwrite.c:(.text+0x6b5): undefined reference to `aio_return64' I tried to link librt.a and libaio.a, it does not work with the following option: -L/usr/lib/x86_64-linux-gnu -lrt -laio Any comments and suggestion will be welcome. Shuchi -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- _______________________________________________ discuss mailing list discuss at mpich.org To manage subscription options or unsubscribe: https://lists.mpich.org/mailman/listinfo/discuss From thakur at mcs.anl.gov Sat Nov 8 10:39:55 2014 From: thakur at mcs.anl.gov (Rajeev Thakur) Date: Sat, 8 Nov 2014 10:39:55 -0600 Subject: [mpich-discuss] Compile Fortran code with -static In-Reply-To: References: Message-ID: <066879CE-F4BA-4961-88E0-DB4EE84B9A18@mcs.anl.gov> If you don't need to use the MPI-IO functions, try configuring with --disable-romio. Rajeev On Nov 8, 2014, at 9:58 AM, Shuchi Yang wrote: > 1) At Ubuntu 14.04, Intel Fortran 2015 xe > Download mpich-3.13 > It works well with general purpose > 2) I need compile the fortran code with -static to run on machine without Intel Fortran environment so I use -static option to compile both MPICH and FORTRAN code. I met the following error > > /mpich/lib/libmpi.a(lib_libmpi_la-tcp_init.o): In function `MPID_nem_tcp_get_business_card': > src/mpid/ch3/channels/nemesis/netmod/tcp/tcp_init.c:(.text+0x319): warning: Using 'gethostbyname' in statically linked applications requires at runtime the shared libraries from the glibc version used for linking > /mpich/lib/libmpi.a(ad_iwrite.o): In function `ADIOI_GEN_IwriteContig': > adio/common/ad_iwrite.c:(.text+0xb7): undefined reference to `aio_write64' > /mpich/lib/libmpi.a(ad_iwrite.o): In function `ADIOI_GEN_aio': > adio/common/ad_iwrite.c:(.text+0x2dc): undefined reference to `aio_write64' > adio/common/ad_iwrite.c:(.text+0x2e6): undefined reference to `aio_read64' > /mpich/lib/libmpi.a(ad_iwrite.o): In function `ADIOI_GEN_aio_poll_fn': > adio/common/ad_iwrite.c:(.text+0x4d0): undefined reference to `aio_error64' > adio/common/ad_iwrite.c:(.text+0x4f0): undefined reference to `aio_return64' > /mpich/lib/libmpi.a(ad_iwrite.o): In function `ADIOI_GEN_aio_wait_fn': > adio/common/ad_iwrite.c:(.text+0x65d): undefined reference to `aio_suspend64' > adio/common/ad_iwrite.c:(.text+0x6a0): undefined reference to `aio_error64' > adio/common/ad_iwrite.c:(.text+0x6b5): undefined reference to `aio_return64' > > I tried to link librt.a and libaio.a, it does not work with the following option: > -L/usr/lib/x86_64-linux-gnu -lrt -laio > > Any comments and suggestion will be welcome. > > Shuchi > > _______________________________________________ > discuss mailing list discuss at mpich.org > To manage subscription options or unsubscribe: > https://lists.mpich.org/mailman/listinfo/discuss _______________________________________________ discuss mailing list discuss at mpich.org To manage subscription options or unsubscribe: https://lists.mpich.org/mailman/listinfo/discuss From jczhang at mcs.anl.gov Fri Nov 7 16:28:54 2014 From: jczhang at mcs.anl.gov (Junchao Zhang) Date: Fri, 7 Nov 2014 16:28:54 -0600 Subject: [mpich-discuss] environment settings to select the preferred TCP netmask In-Reply-To: References: Message-ID: On Fri, Nov 7, 2014 at 4:07 PM, Ebalunode, Jerry O wrote: > Hello all: > I am currently using mpich3.1.3. I need help in tuning the environment > settings to select the prefferred TCP netmask or network. The variables > > MPICH_TCP_NETMASK > MPICH_NETMASK > MPIR_CVAR_NEMESIS_TCP_NETMASK > > don?t seem to the working > MPICHers, do we have variables for this purpose? "mpivars | grep MASK" finds none. > > Also, how can I tune the eager threshhold. Intel MPI uses I_MPI_THRESHOLD > for doing this. Thanks for all the help advance. > mpivars | grep EAGER > > -Jerry > > University of Houston > > > > _______________________________________________ > discuss mailing list discuss at mpich.org > To manage subscription options or unsubscribe: > https://lists.mpich.org/mailman/listinfo/discuss > -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- _______________________________________________ discuss mailing list discuss at mpich.org To manage subscription options or unsubscribe: https://lists.mpich.org/mailman/listinfo/discuss From balaji at anl.gov Fri Nov 7 17:14:06 2014 From: balaji at anl.gov (Balaji, Pavan) Date: Fri, 7 Nov 2014 23:14:06 +0000 Subject: [mpich-discuss] Force MPICH2 to not use network adapter In-Reply-To: <77c338bc8b1b4f42a9eafcf9c7fc2b6d@CO1PR02MB111.namprd02.prod.outlook.com> References: <2324c0fc6a6e4cf4a20227b491f0998c@BY2PR02MB106.namprd02.prod.outlook.com> <77c338bc8b1b4f42a9eafcf9c7fc2b6d@CO1PR02MB111.namprd02.prod.outlook.com> Message-ID: <50389A74-5E35-4F11-A899-D86CBAE846F2@anl.gov> Hi Igor, Unfortunately, Windows is not supported anymore in MPICH. http://wiki.mpich.org/mpich/index.php/Frequently_Asked_Questions#Q:_Why_can.27t_I_build_MPICH_on_Windows_anymore.3F But 1.4.1p1 should be automatically detecting and using shared memory inside the node, and not using the network. Sorry, I know that doesn?t help you. But this is an unsupported platform and version of MPICH. Regards, ? Pavan > On Nov 7, 2014, at 10:07 AM, Igor Raskin wrote: > > Hi Pavan, > > Thank you for your response. > > I should have given more details about the platform and MPICH2 I am using. > > The platform is Windows 7 Professional. > > The version of MPICH2 is 1.4.1 (32-bit) which is latest version of MPICH2 that I can find. > > To what version of MPICH2 should I upgrade? > > Thank you and best regards, > Igor > > -----Original Message----- > From: Balaji, Pavan [mailto:balaji at anl.gov] > Sent: Friday, November 07, 2014 12:50 AM > To: discuss at mpich.org > Subject: Re: [mpich-discuss] Force MPICH2 to not use network adapter > > Igor, > > Based on the fact that you are asking about smpd and -localonly options, I?m assuming you are using a super-ancient vesion of MPICH that dinosaurs used to compute with. Perhaps you can upgrade to the latest version? For the past 10 years or so, we have been automatically detecting shared memory and using it internally without going over the network. > > ? Pavan > >> On Nov 6, 2014, at 8:28 PM, Igor Raskin wrote: >> >> Hello, >> >> I am running a multi-process calculations using MPICH2 from Argonne National Laboratory. The run is on a single machine, so -localonly option of mpiexec is used. Usually everything works. >> If the network adapter is enabled when the run starts, and if I disable it during the run, the run fails with error stating: >> >> op_read error on left context: Error = -1 >> >> op_read error on parent context: Error = -1 >> >> unable to read the cmd header on the left context, Error = -1 . >> unable to read the cmd header on the parent context, Error = -1 . >> Error posting readv, An existing connection was forcibly closed by the >> remote host.(10054) connection to my parent broken, aborting. >> state machine failed. >> However, if the network adapter is disabled when the run is started, I can enable/disable the adapter as many times as I want, and the run still proceeds to the end. >> >> Is there a way to run mpiexec or modify smpd configuration such that MPICH2 is not using network adapter for inter-process communication for local runs even if the adapter is available when the run starts? >> >> Thank you, >> >> Igor >> _______________________________________________ >> discuss mailing list discuss at mpich.org >> To manage subscription options or unsubscribe: >> https://lists.mpich.org/mailman/listinfo/discuss > > -- > Pavan Balaji ?? > http://www.mcs.anl.gov/~balaji > > _______________________________________________ > discuss mailing list discuss at mpich.org > To manage subscription options or unsubscribe: > https://lists.mpich.org/mailman/listinfo/discuss > _______________________________________________ > discuss mailing list discuss at mpich.org > To manage subscription options or unsubscribe: > https://lists.mpich.org/mailman/listinfo/discuss -- Pavan Balaji ?? http://www.mcs.anl.gov/~balaji _______________________________________________ discuss mailing list discuss at mpich.org To manage subscription options or unsubscribe: https://lists.mpich.org/mailman/listinfo/discuss From igor.raskin at weblakes.com Fri Nov 7 10:07:38 2014 From: igor.raskin at weblakes.com (Igor Raskin) Date: Fri, 7 Nov 2014 16:07:38 +0000 Subject: [mpich-discuss] Force MPICH2 to not use network adapter In-Reply-To: References: <2324c0fc6a6e4cf4a20227b491f0998c@BY2PR02MB106.namprd02.prod.outlook.com> Message-ID: <77c338bc8b1b4f42a9eafcf9c7fc2b6d@CO1PR02MB111.namprd02.prod.outlook.com> Hi Pavan, Thank you for your response. I should have given more details about the platform and MPICH2 I am using. The platform is Windows 7 Professional. The version of MPICH2 is 1.4.1 (32-bit) which is latest version of MPICH2 that I can find. To what version of MPICH2 should I upgrade? Thank you and best regards, Igor -----Original Message----- From: Balaji, Pavan [mailto:balaji at anl.gov] Sent: Friday, November 07, 2014 12:50 AM To: discuss at mpich.org Subject: Re: [mpich-discuss] Force MPICH2 to not use network adapter Igor, Based on the fact that you are asking about smpd and -localonly options, I?m assuming you are using a super-ancient vesion of MPICH that dinosaurs used to compute with. Perhaps you can upgrade to the latest version? For the past 10 years or so, we have been automatically detecting shared memory and using it internally without going over the network. ? Pavan > On Nov 6, 2014, at 8:28 PM, Igor Raskin wrote: > > Hello, > > I am running a multi-process calculations using MPICH2 from Argonne National Laboratory. The run is on a single machine, so -localonly option of mpiexec is used. Usually everything works. > If the network adapter is enabled when the run starts, and if I disable it during the run, the run fails with error stating: > > op_read error on left context: Error = -1 > > op_read error on parent context: Error = -1 > > unable to read the cmd header on the left context, Error = -1 . > unable to read the cmd header on the parent context, Error = -1 . > Error posting readv, An existing connection was forcibly closed by the > remote host.(10054) connection to my parent broken, aborting. > state machine failed. > However, if the network adapter is disabled when the run is started, I can enable/disable the adapter as many times as I want, and the run still proceeds to the end. > > Is there a way to run mpiexec or modify smpd configuration such that MPICH2 is not using network adapter for inter-process communication for local runs even if the adapter is available when the run starts? > > Thank you, > > Igor > _______________________________________________ > discuss mailing list discuss at mpich.org > To manage subscription options or unsubscribe: > https://lists.mpich.org/mailman/listinfo/discuss -- Pavan Balaji ?? http://www.mcs.anl.gov/~balaji _______________________________________________ discuss mailing list discuss at mpich.org To manage subscription options or unsubscribe: https://lists.mpich.org/mailman/listinfo/discuss _______________________________________________ discuss mailing list discuss at mpich.org To manage subscription options or unsubscribe: https://lists.mpich.org/mailman/listinfo/discuss From jebaluno at Central.UH.EDU Fri Nov 7 16:07:20 2014 From: jebaluno at Central.UH.EDU (Ebalunode, Jerry O) Date: Fri, 7 Nov 2014 16:07:20 -0600 Subject: [mpich-discuss] environment settings to select the preferred TCP netmask Message-ID: Hello all: I am currently using mpich3.1.3. I need help in tuning the environment settings to select the prefferred TCP netmask or network. The variables MPICH_TCP_NETMASK MPICH_NETMASK MPIR_CVAR_NEMESIS_TCP_NETMASK don?t seem to the working Also, how can I tune the eager threshhold. Intel MPI uses I_MPI_THRESHOLD for doing this. Thanks for all the help advance. -Jerry University of Houston -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- _______________________________________________ discuss mailing list discuss at mpich.org To manage subscription options or unsubscribe: https://lists.mpich.org/mailman/listinfo/discuss From toddwz at gmail.com Fri Nov 7 08:14:44 2014 From: toddwz at gmail.com (Zhen Wang) Date: Fri, 7 Nov 2014 09:14:44 -0500 Subject: [mpich-discuss] Isend and Recv In-Reply-To: References: Message-ID: I did more experiments, and results agrees with this. Thanks a lot for your help! Best regards, Zhen On Thu, Nov 6, 2014 at 11:55 AM, Junchao Zhang wrote: > My understanding is that: In eager mode, sender does not need a Ack > message from receiver (to know recv buf is ready), but in rendezvous > mode, it does. > In your case, if it is in rendezvous mode, Isend() issues the request. But > since it is a nonblocking call, before sender gets the Ack message, it > goes into sleep. The asynchronous progress engine does not progress until > MPI_Wait() is called. At that time, the Ack message is got, and message > passing happens. Therefore, MPI_Recv() finishes after MPI_Wait(). > > --Junchao Zhang > > On Thu, Nov 6, 2014 at 10:33 AM, Zhen Wang wrote: > >> Junchao, >> >> Thanks for your reply. It works! I digged deeper into the eager and >> rendezvous modes, and got confused.. >> >> The docs I read were >> https://computing.llnl.gov/tutorials/mpi_performance/#Protocols and >> http://www-01.ibm.com/support/knowledgecenter/SSFK3V_1.3.0/com.ibm.cluster.pe.v1r3.pe400.doc/am106_eagermess.htm >> . >> >> My understanding is: >> >> In eager mode, the receiver allocates a buffer and an additional copy is >> required to copy the data from the buffer to user allocated space. This is >> good for small messages, not for large ones. >> >> In rendezvous mode, the user allocated space on the receiver must be >> ready to receive the data. But if the space is ready, the MPI_Recv() should >> be completed almost immediately (there's a handshake between the send and >> receiver, and a copy operation). >> >> If I was right, in my code example, MPI_Recv() should finish after >> MPI_Isend() no matter in eager or rendezvous modes. The memory has been >> allocated, and the handshake should take like no time. Am I missing >> something? >> >> Thanks a lot. >> >> >> Best regards, >> Zhen >> >> On Wed, Nov 5, 2014 at 2:28 PM, Junchao Zhang >> wrote: >> >>> I think it is because MPICH just crosses over the eager to rendezvous >>> mode threshold, when n goes from 9999 to 99999. OpenMPI certainly uses a >>> different threshold than MPICH. >>> When you install MPICH, a utility program mpivars is also installed. >>> Type 'mpivars | grep EAGER', you will get default values for various eager >>> thresholds. >>> >>> In your case, export MPIR_CVAR_CH3_EAGER_MAX_MSG_SIZE=5000000 and you >>> will get the same result as OpenMPI. >>> >>> --Junchao Zhang >>> >>> On Wed, Nov 5, 2014 at 11:20 AM, Zhen Wang wrote: >>> >>>> Hi MPIers, >>>> >>>> I have some questions regarding MPI_Isend() and MPI_Recv(). MPICH-3.1.3 >>>> is used to compile and run the attached code on Red Hat Enterprise Linux >>>> 6.3 (a shared memory machine). While n = 9999, the MPI_Recv() finishes >>>> immediately after MPI_Isend(): (This is what I understand and expect) >>>> >>>> MPI 1: Recv started at 09:53:53. >>>> MPI 0: Isend started at 09:53:53. >>>> MPI 1: Recv finished at 09:53:53. >>>> MPI 0: Isend finished at 09:53:58. >>>> >>>> When n = 99999, I get the following. The MPI_Recv() finishes after >>>> MPI_Wait(): >>>> >>>> MPI 1: Recv started at 09:47:56. >>>> MPI 0: Isend started at 09:47:56. >>>> MPI 0: Isend finished at 09:48:01. >>>> MPI 1: Recv finished at 09:48:01. >>>> >>>> But with OpenMPI 1.8 and n = 99999, MPI_Recv() finishes immediately >>>> after MPI_Isend(): >>>> >>>> MPI 0: Isend started at 09:55:28. >>>> MPI 1: Recv started at 09:55:28. >>>> MPI 1: Recv finished at 09:55:28. >>>> MPI 0: Isend finished at 09:55:33. >>>> >>>> Am I misunderstanding something here? In case the attached code is >>>> dropped, the code is included. Thanks in advance. >>>> >>>> >>>> #include "mpi.h" >>>> #include >>>> #include >>>> #include "vector" >>>> #include >>>> >>>> int main(int argc, char* argv[]) >>>> { >>>> MPI_Init(&argc, &argv); >>>> >>>> int rank; >>>> MPI_Comm_rank(MPI_COMM_WORLD, &rank); >>>> >>>> int n = 9999; >>>> std::vector vec(n); >>>> MPI_Request mpiRequest; >>>> MPI_Status mpiStatus; >>>> char tt[9] = {0}; >>>> >>>> MPI_Barrier(MPI_COMM_WORLD); >>>> >>>> if (rank == 0) >>>> { >>>> MPI_Isend(&vec[0], n, MPI_INT, 1, 0, MPI_COMM_WORLD, &mpiRequest); >>>> time_t t = time(0); >>>> strftime(tt, 9, "%H:%M:%S", localtime(&t)); >>>> printf("MPI %d: Isend started at %s.\n", rank, tt); >>>> >>>> //int done = 0; >>>> //while (done == 0) >>>> //{ >>>> // MPI_Test(&mpiRequest, &done, &mpiStatus); >>>> //} >>>> sleep(5); >>>> MPI_Wait(&mpiRequest, &mpiStatus); >>>> >>>> t = time(0); >>>> strftime(tt, 9, "%H:%M:%S", localtime(&t)); >>>> printf("MPI %d: Isend finished at %s.\n", rank, tt); >>>> } >>>> else >>>> { >>>> time_t t = time(0); >>>> strftime(tt, 9, "%H:%M:%S", localtime(&t)); >>>> printf("MPI %d: Recv started at %s.\n", rank, tt); >>>> >>>> MPI_Recv(&vec[0], n, MPI_INT, 0, 0, MPI_COMM_WORLD, &mpiStatus); >>>> >>>> t = time(0); >>>> strftime(tt, 9, "%H:%M:%S", localtime(&t)); >>>> printf("MPI %d: Recv finished at %s.\n", rank, tt); >>>> } >>>> >>>> MPI_Finalize(); >>>> >>>> return 0; >>>> } >>>> >>>> >>>> >>>> Best regards, >>>> Zhen >>>> >>>> _______________________________________________ >>>> discuss mailing list discuss at mpich.org >>>> To manage subscription options or unsubscribe: >>>> https://lists.mpich.org/mailman/listinfo/discuss >>>> >>> >>> >>> _______________________________________________ >>> discuss mailing list discuss at mpich.org >>> To manage subscription options or unsubscribe: >>> https://lists.mpich.org/mailman/listinfo/discuss >>> >> >> >> _______________________________________________ >> discuss mailing list discuss at mpich.org >> To manage subscription options or unsubscribe: >> https://lists.mpich.org/mailman/listinfo/discuss >> > > > _______________________________________________ > discuss mailing list discuss at mpich.org > To manage subscription options or unsubscribe: > https://lists.mpich.org/mailman/listinfo/discuss > -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- _______________________________________________ discuss mailing list discuss at mpich.org To manage subscription options or unsubscribe: https://lists.mpich.org/mailman/listinfo/discuss From balaji at anl.gov Thu Nov 6 23:50:01 2014 From: balaji at anl.gov (Balaji, Pavan) Date: Fri, 7 Nov 2014 05:50:01 +0000 Subject: [mpich-discuss] Force MPICH2 to not use network adapter In-Reply-To: <2324c0fc6a6e4cf4a20227b491f0998c@BY2PR02MB106.namprd02.prod.outlook.com> References: <2324c0fc6a6e4cf4a20227b491f0998c@BY2PR02MB106.namprd02.prod.outlook.com> Message-ID: Igor, Based on the fact that you are asking about smpd and -localonly options, I?m assuming you are using a super-ancient vesion of MPICH that dinosaurs used to compute with. Perhaps you can upgrade to the latest version? For the past 10 years or so, we have been automatically detecting shared memory and using it internally without going over the network. ? Pavan > On Nov 6, 2014, at 8:28 PM, Igor Raskin wrote: > > Hello, > > I am running a multi-process calculations using MPICH2 from Argonne National Laboratory. The run is on a single machine, so -localonly option of mpiexec is used. Usually everything works. > If the network adapter is enabled when the run starts, and if I disable it during the run, the run fails with error stating: > > op_read error on left context: Error = -1 > > op_read error on parent context: Error = -1 > > unable to read the cmd header on the left context, Error = -1 > . > unable to read the cmd header on the parent context, Error = -1 > . > Error posting readv, An existing connection was forcibly closed by the remote host.(10054) > connection to my parent broken, aborting. > state machine failed. > However, if the network adapter is disabled when the run is started, I can enable/disable the adapter as many times as I want, and the run still proceeds to the end. > > Is there a way to run mpiexec or modify smpd configuration such that MPICH2 is not using network adapter for inter-process communication for local runs even if the adapter is available when the run starts? > > Thank you, > > Igor > _______________________________________________ > discuss mailing list discuss at mpich.org > To manage subscription options or unsubscribe: > https://lists.mpich.org/mailman/listinfo/discuss -- Pavan Balaji ?? http://www.mcs.anl.gov/~balaji _______________________________________________ discuss mailing list discuss at mpich.org To manage subscription options or unsubscribe: https://lists.mpich.org/mailman/listinfo/discuss From igor.raskin at weblakes.com Thu Nov 6 20:28:30 2014 From: igor.raskin at weblakes.com (Igor Raskin) Date: Fri, 7 Nov 2014 02:28:30 +0000 Subject: [mpich-discuss] Force MPICH2 to not use network adapter Message-ID: <2324c0fc6a6e4cf4a20227b491f0998c@BY2PR02MB106.namprd02.prod.outlook.com> Hello, I am running a multi-process calculations using MPICH2 from Argonne National Laboratory. The run is on a single machine, so -localonly option of mpiexec is used. Usually everything works. If the network adapter is enabled when the run starts, and if I disable it during the run, the run fails with error stating: op_read error on left context: Error = -1 op_read error on parent context: Error = -1 unable to read the cmd header on the left context, Error = -1 . unable to read the cmd header on the parent context, Error = -1 . Error posting readv, An existing connection was forcibly closed by the remote host.(10054) connection to my parent broken, aborting. state machine failed. However, if the network adapter is disabled when the run is started, I can enable/disable the adapter as many times as I want, and the run still proceeds to the end. Is there a way to run mpiexec or modify smpd configuration such that MPICH2 is not using network adapter for inter-process communication for local runs even if the adapter is available when the run starts? Thank you, Igor -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- _______________________________________________ discuss mailing list discuss at mpich.org To manage subscription options or unsubscribe: https://lists.mpich.org/mailman/listinfo/discuss From jczhang at mcs.anl.gov Thu Nov 6 10:55:38 2014 From: jczhang at mcs.anl.gov (Junchao Zhang) Date: Thu, 6 Nov 2014 10:55:38 -0600 Subject: [mpich-discuss] Isend and Recv In-Reply-To: References: Message-ID: My understanding is that: In eager mode, sender does not need a Ack message from receiver (to know recv buf is ready), but in rendezvous mode, it does. In your case, if it is in rendezvous mode, Isend() issues the request. But since it is a nonblocking call, before sender gets the Ack message, it goes into sleep. The asynchronous progress engine does not progress until MPI_Wait() is called. At that time, the Ack message is got, and message passing happens. Therefore, MPI_Recv() finishes after MPI_Wait(). --Junchao Zhang On Thu, Nov 6, 2014 at 10:33 AM, Zhen Wang wrote: > Junchao, > > Thanks for your reply. It works! I digged deeper into the eager and > rendezvous modes, and got confused.. > > The docs I read were > https://computing.llnl.gov/tutorials/mpi_performance/#Protocols and > http://www-01.ibm.com/support/knowledgecenter/SSFK3V_1.3.0/com.ibm.cluster.pe.v1r3.pe400.doc/am106_eagermess.htm > . > > My understanding is: > > In eager mode, the receiver allocates a buffer and an additional copy is > required to copy the data from the buffer to user allocated space. This is > good for small messages, not for large ones. > > In rendezvous mode, the user allocated space on the receiver must be ready > to receive the data. But if the space is ready, the MPI_Recv() should be > completed almost immediately (there's a handshake between the send and > receiver, and a copy operation). > > If I was right, in my code example, MPI_Recv() should finish after > MPI_Isend() no matter in eager or rendezvous modes. The memory has been > allocated, and the handshake should take like no time. Am I missing > something? > > Thanks a lot. > > > Best regards, > Zhen > > On Wed, Nov 5, 2014 at 2:28 PM, Junchao Zhang wrote: > >> I think it is because MPICH just crosses over the eager to rendezvous >> mode threshold, when n goes from 9999 to 99999. OpenMPI certainly uses a >> different threshold than MPICH. >> When you install MPICH, a utility program mpivars is also installed. Type >> 'mpivars | grep EAGER', you will get default values for various eager >> thresholds. >> >> In your case, export MPIR_CVAR_CH3_EAGER_MAX_MSG_SIZE=5000000 and you >> will get the same result as OpenMPI. >> >> --Junchao Zhang >> >> On Wed, Nov 5, 2014 at 11:20 AM, Zhen Wang wrote: >> >>> Hi MPIers, >>> >>> I have some questions regarding MPI_Isend() and MPI_Recv(). MPICH-3.1.3 >>> is used to compile and run the attached code on Red Hat Enterprise Linux >>> 6.3 (a shared memory machine). While n = 9999, the MPI_Recv() finishes >>> immediately after MPI_Isend(): (This is what I understand and expect) >>> >>> MPI 1: Recv started at 09:53:53. >>> MPI 0: Isend started at 09:53:53. >>> MPI 1: Recv finished at 09:53:53. >>> MPI 0: Isend finished at 09:53:58. >>> >>> When n = 99999, I get the following. The MPI_Recv() finishes after >>> MPI_Wait(): >>> >>> MPI 1: Recv started at 09:47:56. >>> MPI 0: Isend started at 09:47:56. >>> MPI 0: Isend finished at 09:48:01. >>> MPI 1: Recv finished at 09:48:01. >>> >>> But with OpenMPI 1.8 and n = 99999, MPI_Recv() finishes immediately >>> after MPI_Isend(): >>> >>> MPI 0: Isend started at 09:55:28. >>> MPI 1: Recv started at 09:55:28. >>> MPI 1: Recv finished at 09:55:28. >>> MPI 0: Isend finished at 09:55:33. >>> >>> Am I misunderstanding something here? In case the attached code is >>> dropped, the code is included. Thanks in advance. >>> >>> >>> #include "mpi.h" >>> #include >>> #include >>> #include "vector" >>> #include >>> >>> int main(int argc, char* argv[]) >>> { >>> MPI_Init(&argc, &argv); >>> >>> int rank; >>> MPI_Comm_rank(MPI_COMM_WORLD, &rank); >>> >>> int n = 9999; >>> std::vector vec(n); >>> MPI_Request mpiRequest; >>> MPI_Status mpiStatus; >>> char tt[9] = {0}; >>> >>> MPI_Barrier(MPI_COMM_WORLD); >>> >>> if (rank == 0) >>> { >>> MPI_Isend(&vec[0], n, MPI_INT, 1, 0, MPI_COMM_WORLD, &mpiRequest); >>> time_t t = time(0); >>> strftime(tt, 9, "%H:%M:%S", localtime(&t)); >>> printf("MPI %d: Isend started at %s.\n", rank, tt); >>> >>> //int done = 0; >>> //while (done == 0) >>> //{ >>> // MPI_Test(&mpiRequest, &done, &mpiStatus); >>> //} >>> sleep(5); >>> MPI_Wait(&mpiRequest, &mpiStatus); >>> >>> t = time(0); >>> strftime(tt, 9, "%H:%M:%S", localtime(&t)); >>> printf("MPI %d: Isend finished at %s.\n", rank, tt); >>> } >>> else >>> { >>> time_t t = time(0); >>> strftime(tt, 9, "%H:%M:%S", localtime(&t)); >>> printf("MPI %d: Recv started at %s.\n", rank, tt); >>> >>> MPI_Recv(&vec[0], n, MPI_INT, 0, 0, MPI_COMM_WORLD, &mpiStatus); >>> >>> t = time(0); >>> strftime(tt, 9, "%H:%M:%S", localtime(&t)); >>> printf("MPI %d: Recv finished at %s.\n", rank, tt); >>> } >>> >>> MPI_Finalize(); >>> >>> return 0; >>> } >>> >>> >>> >>> Best regards, >>> Zhen >>> >>> _______________________________________________ >>> discuss mailing list discuss at mpich.org >>> To manage subscription options or unsubscribe: >>> https://lists.mpich.org/mailman/listinfo/discuss >>> >> >> >> _______________________________________________ >> discuss mailing list discuss at mpich.org >> To manage subscription options or unsubscribe: >> https://lists.mpich.org/mailman/listinfo/discuss >> > > > _______________________________________________ > discuss mailing list discuss at mpich.org > To manage subscription options or unsubscribe: > https://lists.mpich.org/mailman/listinfo/discuss > -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- _______________________________________________ discuss mailing list discuss at mpich.org To manage subscription options or unsubscribe: https://lists.mpich.org/mailman/listinfo/discuss From toddwz at gmail.com Thu Nov 6 10:33:42 2014 From: toddwz at gmail.com (Zhen Wang) Date: Thu, 6 Nov 2014 11:33:42 -0500 Subject: [mpich-discuss] Isend and Recv In-Reply-To: References: Message-ID: Junchao, Thanks for your reply. It works! I digged deeper into the eager and rendezvous modes, and got confused.. The docs I read were https://computing.llnl.gov/tutorials/mpi_performance/#Protocols and http://www-01.ibm.com/support/knowledgecenter/SSFK3V_1.3.0/com.ibm.cluster.pe.v1r3.pe400.doc/am106_eagermess.htm . My understanding is: In eager mode, the receiver allocates a buffer and an additional copy is required to copy the data from the buffer to user allocated space. This is good for small messages, not for large ones. In rendezvous mode, the user allocated space on the receiver must be ready to receive the data. But if the space is ready, the MPI_Recv() should be completed almost immediately (there's a handshake between the send and receiver, and a copy operation). If I was right, in my code example, MPI_Recv() should finish after MPI_Isend() no matter in eager or rendezvous modes. The memory has been allocated, and the handshake should take like no time. Am I missing something? Thanks a lot. Best regards, Zhen On Wed, Nov 5, 2014 at 2:28 PM, Junchao Zhang wrote: > I think it is because MPICH just crosses over the eager to rendezvous mode > threshold, when n goes from 9999 to 99999. OpenMPI certainly uses a > different threshold than MPICH. > When you install MPICH, a utility program mpivars is also installed. Type > 'mpivars | grep EAGER', you will get default values for various eager > thresholds. > > In your case, export MPIR_CVAR_CH3_EAGER_MAX_MSG_SIZE=5000000 and you will > get the same result as OpenMPI. > > --Junchao Zhang > > On Wed, Nov 5, 2014 at 11:20 AM, Zhen Wang wrote: > >> Hi MPIers, >> >> I have some questions regarding MPI_Isend() and MPI_Recv(). MPICH-3.1.3 >> is used to compile and run the attached code on Red Hat Enterprise Linux >> 6.3 (a shared memory machine). While n = 9999, the MPI_Recv() finishes >> immediately after MPI_Isend(): (This is what I understand and expect) >> >> MPI 1: Recv started at 09:53:53. >> MPI 0: Isend started at 09:53:53. >> MPI 1: Recv finished at 09:53:53. >> MPI 0: Isend finished at 09:53:58. >> >> When n = 99999, I get the following. The MPI_Recv() finishes after >> MPI_Wait(): >> >> MPI 1: Recv started at 09:47:56. >> MPI 0: Isend started at 09:47:56. >> MPI 0: Isend finished at 09:48:01. >> MPI 1: Recv finished at 09:48:01. >> >> But with OpenMPI 1.8 and n = 99999, MPI_Recv() finishes immediately after >> MPI_Isend(): >> >> MPI 0: Isend started at 09:55:28. >> MPI 1: Recv started at 09:55:28. >> MPI 1: Recv finished at 09:55:28. >> MPI 0: Isend finished at 09:55:33. >> >> Am I misunderstanding something here? In case the attached code is >> dropped, the code is included. Thanks in advance. >> >> >> #include "mpi.h" >> #include >> #include >> #include "vector" >> #include >> >> int main(int argc, char* argv[]) >> { >> MPI_Init(&argc, &argv); >> >> int rank; >> MPI_Comm_rank(MPI_COMM_WORLD, &rank); >> >> int n = 9999; >> std::vector vec(n); >> MPI_Request mpiRequest; >> MPI_Status mpiStatus; >> char tt[9] = {0}; >> >> MPI_Barrier(MPI_COMM_WORLD); >> >> if (rank == 0) >> { >> MPI_Isend(&vec[0], n, MPI_INT, 1, 0, MPI_COMM_WORLD, &mpiRequest); >> time_t t = time(0); >> strftime(tt, 9, "%H:%M:%S", localtime(&t)); >> printf("MPI %d: Isend started at %s.\n", rank, tt); >> >> //int done = 0; >> //while (done == 0) >> //{ >> // MPI_Test(&mpiRequest, &done, &mpiStatus); >> //} >> sleep(5); >> MPI_Wait(&mpiRequest, &mpiStatus); >> >> t = time(0); >> strftime(tt, 9, "%H:%M:%S", localtime(&t)); >> printf("MPI %d: Isend finished at %s.\n", rank, tt); >> } >> else >> { >> time_t t = time(0); >> strftime(tt, 9, "%H:%M:%S", localtime(&t)); >> printf("MPI %d: Recv started at %s.\n", rank, tt); >> >> MPI_Recv(&vec[0], n, MPI_INT, 0, 0, MPI_COMM_WORLD, &mpiStatus); >> >> t = time(0); >> strftime(tt, 9, "%H:%M:%S", localtime(&t)); >> printf("MPI %d: Recv finished at %s.\n", rank, tt); >> } >> >> MPI_Finalize(); >> >> return 0; >> } >> >> >> >> Best regards, >> Zhen >> >> _______________________________________________ >> discuss mailing list discuss at mpich.org >> To manage subscription options or unsubscribe: >> https://lists.mpich.org/mailman/listinfo/discuss >> > > > _______________________________________________ > discuss mailing list discuss at mpich.org > To manage subscription options or unsubscribe: > https://lists.mpich.org/mailman/listinfo/discuss > -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- _______________________________________________ discuss mailing list discuss at mpich.org To manage subscription options or unsubscribe: https://lists.mpich.org/mailman/listinfo/discuss From joni.kurronen at gmail.com Thu Nov 6 07:17:45 2014 From: joni.kurronen at gmail.com (Joni-Pekka Kurronen) Date: Thu, 6 Nov 2014 15:17:45 +0200 Subject: [mpich-discuss] Installation problem Ubuntu 14.04 LTS Message-ID: <1415279865.26307.5.camel@mpi1.kurrola.dy.fi> CAN NOT FIND -lblcr,... I have blcr version 0.8.6_b4 and it compiles whitout problem's whit version 3.0.4,... 3.1.3 ./config options have changes,... I assume options have changed,... Here is error message at end of make: make[2]: Poistutaan hakemistosta "/mpi3/S4/mpich-3.1.3/src/pm/hydra" Making install in . make[2]: Siirryt??n hakemistoon "/mpi3/S4/mpich-3.1.3" GEN lib/libmpi.la /usr/bin/ld: cannot find -lblcr collect2: error: ld returned 1 exit status make[2]: *** [lib/libmpi.la] Virhe 1 make[2]: Poistutaan hakemistosta "/mpi3/S4/mpich-3.1.3" make[1]: *** [install-recursive] Virhe 1 make[1]: Poistutaan hakemistosta "/mpi3/S4/mpich-3.1.3" make: *** [install] Virhe 2 joni at mpi1:~$ ========= build script used: export JPK_THREADS="-L/usr/lib/x86_64-linux-gnu/libevent_pthreads-2.0.so.5 -L/usr/lib/x86_64-linux-gnu/libevent_core-2.0.so.5" export LD_FLAGS="$JPK_THREADS" export CFLAGS="-Wl,--no-as-needed -fPIC -m64 -pthread -O3 -fopenmp -lgomp -L$JPK_BLCR/lib/libcr.so" export CC=gcc export CXX=g++ export FC=gfortran export F77=gfortran export MPICH2LIB_CFLAGS="$CFLAGS $JPK_THREADS" export MPICH2LIB_CXXFLAGS="$CFLAGS $JPK_THREADS" export MPICH2LIB_CPPFLAGS="$CFLAGS $JPK_THREADS" export CXXFLAGS=$CFLAGS export FCFLAGS="$CFLAGS" # -fdefault-real-8 -fdefault-double-8" export FFLAGS="$CFLAGS" # -fdefault-real-8 -fdefault-double-8" export F90FLAGS="" export F77FLAGS="" if [ "$1" == "c" ]; then cd $JPK_MPICH2_S make distclean fi cd $JPK_MPICH2_S ./configure --prefix=$JPK_MPICH2 --enable-fast=all,O3 --with-thread-package=pthreads --with-device=ch3:nemesis --with-gnu-ld --enable-checkpointing --with-blcr=$JPK_BLCR --with-blcr-include= $JPK_BLCR/include --with-blcr-lib=$JPK_BLCR/lib --with-pm=hydra --with-hydrabss=rsh --enable-threads=runtime --disable-smpcoll --enable-static --enable-shared 2>&1 | tee $JPK_MPI_DIR/$JPK_VER_B/mes/mpich2.config make -j$JPK_JOBS 2>&1 | tee $JPK_MPI_DIR/$JPK_VER_B/mes/mpich2.make sudo make install 2>&1 | tee $JPK_MPI_DIR/$JPK_VER_B/mes/mpich2.install _______________________________________________ discuss mailing list discuss at mpich.org To manage subscription options or unsubscribe: https://lists.mpich.org/mailman/listinfo/discuss From wbland at anl.gov Thu Nov 6 09:52:38 2014 From: wbland at anl.gov (Bland, Wesley B.) Date: Thu, 6 Nov 2014 15:52:38 +0000 Subject: [mpich-discuss] Installation problem Ubuntu 14.04 LTS In-Reply-To: <1415279865.26307.5.camel@mpi1.kurrola.dy.fi> References: <1415279865.26307.5.camel@mpi1.kurrola.dy.fi> Message-ID: <973B0764-53FC-4A62-9D8D-AA4D84C32BA1@anl.gov> Unfortunately, BLCR hasn?t been working for quite some time now. At some point, we may revive support for it, but at the moment, you?ll have to use an old version if you want to use it. Thanks, Wesley > On Nov 6, 2014, at 7:17 AM, Joni-Pekka Kurronen wrote: > > > CAN NOT FIND -lblcr,... I have blcr version 0.8.6_b4 and > it compiles whitout problem's whit version 3.0.4,... > > 3.1.3 ./config options have changes,... I assume options > have changed,... > > Here is error message at end of make: > > make[2]: Poistutaan hakemistosta "/mpi3/S4/mpich-3.1.3/src/pm/hydra" > Making install in . > make[2]: Siirryt??n hakemistoon "/mpi3/S4/mpich-3.1.3" > GEN lib/libmpi.la > /usr/bin/ld: cannot find -lblcr > collect2: error: ld returned 1 exit status > make[2]: *** [lib/libmpi.la] Virhe 1 > make[2]: Poistutaan hakemistosta "/mpi3/S4/mpich-3.1.3" > make[1]: *** [install-recursive] Virhe 1 > make[1]: Poistutaan hakemistosta "/mpi3/S4/mpich-3.1.3" > make: *** [install] Virhe 2 > joni at mpi1:~$ > > > ========= > > build script used: > > export > JPK_THREADS="-L/usr/lib/x86_64-linux-gnu/libevent_pthreads-2.0.so.5 > -L/usr/lib/x86_64-linux-gnu/libevent_core-2.0.so.5" > > export LD_FLAGS="$JPK_THREADS" > > export CFLAGS="-Wl,--no-as-needed -fPIC -m64 -pthread -O3 -fopenmp > -lgomp -L$JPK_BLCR/lib/libcr.so" > > export CC=gcc > export CXX=g++ > export FC=gfortran > export F77=gfortran > export MPICH2LIB_CFLAGS="$CFLAGS $JPK_THREADS" > export MPICH2LIB_CXXFLAGS="$CFLAGS $JPK_THREADS" > export MPICH2LIB_CPPFLAGS="$CFLAGS $JPK_THREADS" > > export CXXFLAGS=$CFLAGS > export FCFLAGS="$CFLAGS" # -fdefault-real-8 -fdefault-double-8" > export FFLAGS="$CFLAGS" # -fdefault-real-8 -fdefault-double-8" > export F90FLAGS="" > export F77FLAGS="" > > > if [ "$1" == "c" ]; then > cd $JPK_MPICH2_S > make distclean > fi > > cd $JPK_MPICH2_S > > ./configure --prefix=$JPK_MPICH2 --enable-fast=all,O3 > --with-thread-package=pthreads --with-device=ch3:nemesis --with-gnu-ld > --enable-checkpointing --with-blcr=$JPK_BLCR --with-blcr-include= > $JPK_BLCR/include --with-blcr-lib=$JPK_BLCR/lib --with-pm=hydra > --with-hydrabss=rsh --enable-threads=runtime --disable-smpcoll > --enable-static --enable-shared 2>&1 | tee > $JPK_MPI_DIR/$JPK_VER_B/mes/mpich2.config > > > make -j$JPK_JOBS 2>&1 | tee $JPK_MPI_DIR/$JPK_VER_B/mes/mpich2.make > sudo make install 2>&1 | tee $JPK_MPI_DIR/$JPK_VER_B/mes/mpich2.install > > > > _______________________________________________ > discuss mailing list discuss at mpich.org > To manage subscription options or unsubscribe: > https://lists.mpich.org/mailman/listinfo/discuss _______________________________________________ discuss mailing list discuss at mpich.org To manage subscription options or unsubscribe: https://lists.mpich.org/mailman/listinfo/discuss From xinzhao3 at illinois.edu Wed Nov 5 12:59:52 2014 From: xinzhao3 at illinois.edu (Zhao, Xin) Date: Wed, 5 Nov 2014 18:59:52 +0000 Subject: [mpich-discuss] MPICH examples In-Reply-To: <545A5C33.1020807@uah.edu> References: <545A5C33.1020807@uah.edu> Message-ID: <0A407957589BAB4F924824150C4293EF4C4A95CA@CITESMBX3.ad.uillinois.edu> Hi Mike, I think you mean test/mpi/rma, not example/rma. Under test/mpi/rma, there are two tests that are not in testlist, one is ircpi.c, which is an interactive test. We will move it to examples. Another is nb_test.c, which is related to this bug: https://trac.mpich.org/projects/mpich/ticket/1910. The bug exists in mpich-3.1.3 but test code is not correct, that's why we comment out it for now. You can also find explanation about it in test/mpi/rma/testlist.in. Thanks, Xin ________________________________________ From: Michael L. Stokes [Michael.Stokes at uah.edu] Sent: Wednesday, November 05, 2014 11:19 AM To: discuss at mpich.org Subject: [mpich-discuss] MPICH examples Under examples/rma there are several .c files that are not listed in testlist (for example, nb_test.c). When I look at the source for nb_test.c it appears it should require at least 2 processes to run correctly, but when I run it with n>=2 it hangs. Do 2 questions. 1) What what is going on with applications not listed in testlist, and 2) for is going on with nb_test.c? Thanks Mike _______________________________________________ discuss mailing list discuss at mpich.org To manage subscription options or unsubscribe: https://lists.mpich.org/mailman/listinfo/discuss _______________________________________________ discuss mailing list discuss at mpich.org To manage subscription options or unsubscribe: https://lists.mpich.org/mailman/listinfo/discuss From jczhang at mcs.anl.gov Wed Nov 5 13:28:28 2014 From: jczhang at mcs.anl.gov (Junchao Zhang) Date: Wed, 5 Nov 2014 13:28:28 -0600 Subject: [mpich-discuss] Isend and Recv In-Reply-To: References: Message-ID: I think it is because MPICH just crosses over the eager to rendezvous mode threshold, when n goes from 9999 to 99999. OpenMPI certainly uses a different threshold than MPICH. When you install MPICH, a utility program mpivars is also installed. Type 'mpivars | grep EAGER', you will get default values for various eager thresholds. In your case, export MPIR_CVAR_CH3_EAGER_MAX_MSG_SIZE=5000000 and you will get the same result as OpenMPI. --Junchao Zhang On Wed, Nov 5, 2014 at 11:20 AM, Zhen Wang wrote: > Hi MPIers, > > I have some questions regarding MPI_Isend() and MPI_Recv(). MPICH-3.1.3 is > used to compile and run the attached code on Red Hat Enterprise Linux 6.3 > (a shared memory machine). While n = 9999, the MPI_Recv() finishes > immediately after MPI_Isend(): (This is what I understand and expect) > > MPI 1: Recv started at 09:53:53. > MPI 0: Isend started at 09:53:53. > MPI 1: Recv finished at 09:53:53. > MPI 0: Isend finished at 09:53:58. > > When n = 99999, I get the following. The MPI_Recv() finishes after > MPI_Wait(): > > MPI 1: Recv started at 09:47:56. > MPI 0: Isend started at 09:47:56. > MPI 0: Isend finished at 09:48:01. > MPI 1: Recv finished at 09:48:01. > > But with OpenMPI 1.8 and n = 99999, MPI_Recv() finishes immediately after > MPI_Isend(): > > MPI 0: Isend started at 09:55:28. > MPI 1: Recv started at 09:55:28. > MPI 1: Recv finished at 09:55:28. > MPI 0: Isend finished at 09:55:33. > > Am I misunderstanding something here? In case the attached code is > dropped, the code is included. Thanks in advance. > > > #include "mpi.h" > #include > #include > #include "vector" > #include > > int main(int argc, char* argv[]) > { > MPI_Init(&argc, &argv); > > int rank; > MPI_Comm_rank(MPI_COMM_WORLD, &rank); > > int n = 9999; > std::vector vec(n); > MPI_Request mpiRequest; > MPI_Status mpiStatus; > char tt[9] = {0}; > > MPI_Barrier(MPI_COMM_WORLD); > > if (rank == 0) > { > MPI_Isend(&vec[0], n, MPI_INT, 1, 0, MPI_COMM_WORLD, &mpiRequest); > time_t t = time(0); > strftime(tt, 9, "%H:%M:%S", localtime(&t)); > printf("MPI %d: Isend started at %s.\n", rank, tt); > > //int done = 0; > //while (done == 0) > //{ > // MPI_Test(&mpiRequest, &done, &mpiStatus); > //} > sleep(5); > MPI_Wait(&mpiRequest, &mpiStatus); > > t = time(0); > strftime(tt, 9, "%H:%M:%S", localtime(&t)); > printf("MPI %d: Isend finished at %s.\n", rank, tt); > } > else > { > time_t t = time(0); > strftime(tt, 9, "%H:%M:%S", localtime(&t)); > printf("MPI %d: Recv started at %s.\n", rank, tt); > > MPI_Recv(&vec[0], n, MPI_INT, 0, 0, MPI_COMM_WORLD, &mpiStatus); > > t = time(0); > strftime(tt, 9, "%H:%M:%S", localtime(&t)); > printf("MPI %d: Recv finished at %s.\n", rank, tt); > } > > MPI_Finalize(); > > return 0; > } > > > > Best regards, > Zhen > > _______________________________________________ > discuss mailing list discuss at mpich.org > To manage subscription options or unsubscribe: > https://lists.mpich.org/mailman/listinfo/discuss > -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- _______________________________________________ discuss mailing list discuss at mpich.org To manage subscription options or unsubscribe: https://lists.mpich.org/mailman/listinfo/discuss From Michael.Stokes at uah.edu Wed Nov 5 11:19:47 2014 From: Michael.Stokes at uah.edu (Michael L. Stokes) Date: Wed, 5 Nov 2014 11:19:47 -0600 Subject: [mpich-discuss] MPICH examples Message-ID: <545A5C33.1020807@uah.edu> Under examples/rma there are several .c files that are not listed in testlist (for example, nb_test.c). When I look at the source for nb_test.c it appears it should require at least 2 processes to run correctly, but when I run it with n>=2 it hangs. Do 2 questions. 1) What what is going on with applications not listed in testlist, and 2) for is going on with nb_test.c? Thanks Mike _______________________________________________ discuss mailing list discuss at mpich.org To manage subscription options or unsubscribe: https://lists.mpich.org/mailman/listinfo/discuss From toddwz at gmail.com Wed Nov 5 11:20:47 2014 From: toddwz at gmail.com (Zhen Wang) Date: Wed, 5 Nov 2014 12:20:47 -0500 Subject: [mpich-discuss] Isend and Recv Message-ID: Hi MPIers, I have some questions regarding MPI_Isend() and MPI_Recv(). MPICH-3.1.3 is used to compile and run the attached code on Red Hat Enterprise Linux 6.3 (a shared memory machine). While n = 9999, the MPI_Recv() finishes immediately after MPI_Isend(): (This is what I understand and expect) MPI 1: Recv started at 09:53:53. MPI 0: Isend started at 09:53:53. MPI 1: Recv finished at 09:53:53. MPI 0: Isend finished at 09:53:58. When n = 99999, I get the following. The MPI_Recv() finishes after MPI_Wait(): MPI 1: Recv started at 09:47:56. MPI 0: Isend started at 09:47:56. MPI 0: Isend finished at 09:48:01. MPI 1: Recv finished at 09:48:01. But with OpenMPI 1.8 and n = 99999, MPI_Recv() finishes immediately after MPI_Isend(): MPI 0: Isend started at 09:55:28. MPI 1: Recv started at 09:55:28. MPI 1: Recv finished at 09:55:28. MPI 0: Isend finished at 09:55:33. Am I misunderstanding something here? In case the attached code is dropped, the code is included. Thanks in advance. #include "mpi.h" #include #include #include "vector" #include int main(int argc, char* argv[]) { MPI_Init(&argc, &argv); int rank; MPI_Comm_rank(MPI_COMM_WORLD, &rank); int n = 9999; std::vector vec(n); MPI_Request mpiRequest; MPI_Status mpiStatus; char tt[9] = {0}; MPI_Barrier(MPI_COMM_WORLD); if (rank == 0) { MPI_Isend(&vec[0], n, MPI_INT, 1, 0, MPI_COMM_WORLD, &mpiRequest); time_t t = time(0); strftime(tt, 9, "%H:%M:%S", localtime(&t)); printf("MPI %d: Isend started at %s.\n", rank, tt); //int done = 0; //while (done == 0) //{ // MPI_Test(&mpiRequest, &done, &mpiStatus); //} sleep(5); MPI_Wait(&mpiRequest, &mpiStatus); t = time(0); strftime(tt, 9, "%H:%M:%S", localtime(&t)); printf("MPI %d: Isend finished at %s.\n", rank, tt); } else { time_t t = time(0); strftime(tt, 9, "%H:%M:%S", localtime(&t)); printf("MPI %d: Recv started at %s.\n", rank, tt); MPI_Recv(&vec[0], n, MPI_INT, 0, 0, MPI_COMM_WORLD, &mpiStatus); t = time(0); strftime(tt, 9, "%H:%M:%S", localtime(&t)); printf("MPI %d: Recv finished at %s.\n", rank, tt); } MPI_Finalize(); return 0; } Best regards, Zhen -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: a.cpp Type: text/x-c++src Size: 1233 bytes Desc: not available URL: -------------- next part -------------- _______________________________________________ discuss mailing list discuss at mpich.org To manage subscription options or unsubscribe: https://lists.mpich.org/mailman/listinfo/discuss From raffenet at mcs.anl.gov Tue Nov 4 14:12:30 2014 From: raffenet at mcs.anl.gov (Kenneth Raffenetti) Date: Tue, 4 Nov 2014 14:12:30 -0600 Subject: [mpich-discuss] Question regarding MPICH Examples In-Reply-To: <54593178.80803@uah.edu> References: <54467F8B.9080400@uah.edu> <54591590.7050103@uah.edu> <54591715.70900@mcs.anl.gov> <54593178.80803@uah.edu> Message-ID: <5459332E.5010500@mcs.anl.gov> This wiki page has info for Hydra's (mpiexec) envvars. https://wiki.mpich.org/mpich/index.php/Using_the_Hydra_Process_Manager#Environment_Settings For all others, see README.envvar in the base MPICH source directory. Ken On 11/04/2014 02:05 PM, Michael L. Stokes wrote: > Kenneth, > > If I were to run these examples outside of the provided scripts, how > would I used > these environment variables? There seems to be a multitude of ways > environment > variables can be specified. > > Thanks > Mike > > On 11/04/2014 12:12 PM, Kenneth Raffenetti wrote: >> Hi Mike, >> >> Those testlist files are interpreted by the mpich/test/mpi/runtests >> script to set actual MPICH/Hydra envvars. >> >> Ken >> >> On 11/04/2014 12:06 PM, Michael L. Stokes wrote: >>> In the file testlist (i.e rma directory) you have what looks like >>> environment variables like timeLimit=600, mpiversion=3.0 >>> listed by the names of the tests (as well as the number of processors to >>> pass to mpirun). >>> How are these values used? I looked in the mtest.c routines and there >>> are environment variables (i.e. MPITEST_DEBUG, MPITEST_VERBOSE) that you >>> read >>> to control execution of the tests, but I don't see the ones listed here. >>> >>> Thanks >>> Mike >>> _______________________________________________ >>> discuss mailing list discuss at mpich.org >>> To manage subscription options or unsubscribe: >>> https://lists.mpich.org/mailman/listinfo/discuss >> _______________________________________________ >> discuss mailing list discuss at mpich.org >> To manage subscription options or unsubscribe: >> https://lists.mpich.org/mailman/listinfo/discuss > > _______________________________________________ > discuss mailing list discuss at mpich.org > To manage subscription options or unsubscribe: > https://lists.mpich.org/mailman/listinfo/discuss _______________________________________________ discuss mailing list discuss at mpich.org To manage subscription options or unsubscribe: https://lists.mpich.org/mailman/listinfo/discuss From wgropp at illinois.edu Tue Nov 4 14:30:51 2014 From: wgropp at illinois.edu (William Gropp) Date: Tue, 4 Nov 2014 14:30:51 -0600 Subject: [mpich-discuss] Question regarding MPICH Examples In-Reply-To: <54593178.80803@uah.edu> References: <54467F8B.9080400@uah.edu> <54591590.7050103@uah.edu> <54591715.70900@mcs.anl.gov> <54593178.80803@uah.edu> Message-ID: Note that some of these are not environment variables. The mpiversion number is compared to the test level, and controls whether that test is run. The timelimit option is really meaningful only for a script - its intended to terminate hangs; people are pretty good at that when running these manually. Bill On Nov 4, 2014, at 2:05 PM, Michael L. Stokes wrote: > Kenneth, > > If I were to run these examples outside of the provided scripts, how would I used > these environment variables? There seems to be a multitude of ways environment > variables can be specified. > > Thanks > Mike > > On 11/04/2014 12:12 PM, Kenneth Raffenetti wrote: >> Hi Mike, >> >> Those testlist files are interpreted by the mpich/test/mpi/runtests script to set actual MPICH/Hydra envvars. >> >> Ken >> >> On 11/04/2014 12:06 PM, Michael L. Stokes wrote: >>> In the file testlist (i.e rma directory) you have what looks like >>> environment variables like timeLimit=600, mpiversion=3.0 >>> listed by the names of the tests (as well as the number of processors to >>> pass to mpirun). >>> How are these values used? I looked in the mtest.c routines and there >>> are environment variables (i.e. MPITEST_DEBUG, MPITEST_VERBOSE) that you >>> read >>> to control execution of the tests, but I don't see the ones listed here. >>> >>> Thanks >>> Mike >>> _______________________________________________ >>> discuss mailing list discuss at mpich.org >>> To manage subscription options or unsubscribe: >>> https://lists.mpich.org/mailman/listinfo/discuss >> _______________________________________________ >> discuss mailing list discuss at mpich.org >> To manage subscription options or unsubscribe: >> https://lists.mpich.org/mailman/listinfo/discuss > > _______________________________________________ > discuss mailing list discuss at mpich.org > To manage subscription options or unsubscribe: > https://lists.mpich.org/mailman/listinfo/discuss _______________________________________________ discuss mailing list discuss at mpich.org To manage subscription options or unsubscribe: https://lists.mpich.org/mailman/listinfo/discuss From raffenet at mcs.anl.gov Tue Nov 4 12:12:37 2014 From: raffenet at mcs.anl.gov (Kenneth Raffenetti) Date: Tue, 4 Nov 2014 12:12:37 -0600 Subject: [mpich-discuss] Question regarding MPICH Examples In-Reply-To: <54591590.7050103@uah.edu> References: <54467F8B.9080400@uah.edu> <54591590.7050103@uah.edu> Message-ID: <54591715.70900@mcs.anl.gov> Hi Mike, Those testlist files are interpreted by the mpich/test/mpi/runtests script to set actual MPICH/Hydra envvars. Ken On 11/04/2014 12:06 PM, Michael L. Stokes wrote: > In the file testlist (i.e rma directory) you have what looks like > environment variables like timeLimit=600, mpiversion=3.0 > listed by the names of the tests (as well as the number of processors to > pass to mpirun). > How are these values used? I looked in the mtest.c routines and there > are environment variables (i.e. MPITEST_DEBUG, MPITEST_VERBOSE) that you > read > to control execution of the tests, but I don't see the ones listed here. > > Thanks > Mike > _______________________________________________ > discuss mailing list discuss at mpich.org > To manage subscription options or unsubscribe: > https://lists.mpich.org/mailman/listinfo/discuss _______________________________________________ discuss mailing list discuss at mpich.org To manage subscription options or unsubscribe: https://lists.mpich.org/mailman/listinfo/discuss From Michael.Stokes at uah.edu Tue Nov 4 14:05:12 2014 From: Michael.Stokes at uah.edu (Michael L. Stokes) Date: Tue, 4 Nov 2014 14:05:12 -0600 Subject: [mpich-discuss] Question regarding MPICH Examples In-Reply-To: <54591715.70900@mcs.anl.gov> References: <54467F8B.9080400@uah.edu> <54591590.7050103@uah.edu> <54591715.70900@mcs.anl.gov> Message-ID: <54593178.80803@uah.edu> Kenneth, If I were to run these examples outside of the provided scripts, how would I used these environment variables? There seems to be a multitude of ways environment variables can be specified. Thanks Mike On 11/04/2014 12:12 PM, Kenneth Raffenetti wrote: > Hi Mike, > > Those testlist files are interpreted by the mpich/test/mpi/runtests > script to set actual MPICH/Hydra envvars. > > Ken > > On 11/04/2014 12:06 PM, Michael L. Stokes wrote: >> In the file testlist (i.e rma directory) you have what looks like >> environment variables like timeLimit=600, mpiversion=3.0 >> listed by the names of the tests (as well as the number of processors to >> pass to mpirun). >> How are these values used? I looked in the mtest.c routines and there >> are environment variables (i.e. MPITEST_DEBUG, MPITEST_VERBOSE) that you >> read >> to control execution of the tests, but I don't see the ones listed here. >> >> Thanks >> Mike >> _______________________________________________ >> discuss mailing list discuss at mpich.org >> To manage subscription options or unsubscribe: >> https://lists.mpich.org/mailman/listinfo/discuss > _______________________________________________ > discuss mailing list discuss at mpich.org > To manage subscription options or unsubscribe: > https://lists.mpich.org/mailman/listinfo/discuss _______________________________________________ discuss mailing list discuss at mpich.org To manage subscription options or unsubscribe: https://lists.mpich.org/mailman/listinfo/discuss From wbland at anl.gov Mon Nov 3 15:35:38 2014 From: wbland at anl.gov (Wesley Bland) Date: Mon, 3 Nov 2014 15:35:38 -0600 Subject: [mpich-discuss] Assertion Failure using MPICH3 RMA In-Reply-To: References: Message-ID: <62DDE5E6-778D-414F-A5A7-DA09C6EF9ADC@anl.gov> Have you tried updating to use the latest version of MPICH? You said you haven?t seen the issue when you run with 3.1. We?re now on 3.1.3 and I believe a number of RMA bug fixes have gone in recently. Thanks, Wesley > On Nov 3, 2014, at 3:14 PM, Corey A. Henderson wrote: > > About 1 in 100 runs on my local desktop, while developing an MPI code that uses MPI 3-0 RMA shared-lock features, I see the following assertion failure. The assertion does not fail at the same point in a run, or following any pattern that I can see. It has not happened on a cluster I use that is running MPICH3.1, but I haven't run my code there very often yet. > > Failure text: > > Assertion failed in file /mpich-3.0.4/src/mpid/ch3/src/ch3u_rma_sync.c at line 2803: win_ptr->targets[target_rank].remote_lock_state == MPIDI_CH3_WIN_LOCK_REQUESTED || win_ptr->targets[target_rank].remote_lock_state == MPIDI_CH3_WIN_LOCK_GRANTED > internal ABORT - process 0 > > I have not attempted to recreate the issue with a smaller code snippet because I am not sure where to even begin to do so. Can anyone suggest to me where I might start to look for the cause of this? > > Some notes on what the code does: > > - One window per node of fixed size opened (MPI_Win_allocate) at program start. > - All windows locked (shared) after creation with MPI_Win_lock_all > - Code uses MPI_Fetch_and_op and MPI_Compare_and_swap on a few concurrently-accessed MPI_INT locations in a node's window > - Code uses GET/PUT for data access to the rest of the window on any node (those portions that are not accessed concurrently) > > MPICH is v3.0.4 on 64-bit Ubuntu 12.04 LTS in a single-machine configuration. > > The error occurs regardless of how many MPI processes I may be testing with at any given time. I have not nailed down where in the code to trace to see why the error occurs because I don't know what could cause this (which is why I'm asking). The MPI messaging portion of the code hasn't changed in a couple of months, but I'm starting to run my code more often and for longer periods now as it nears completion. > > Any help on where to begin tracing to fix this would be great. > > > -- > Corey A. Henderson > PhD Candidate and NSF Graduate Fellow > Dept. of Engineering Physics > Univ. of Wisconsin - Madison > _______________________________________________ > discuss mailing list discuss at mpich.org > To manage subscription options or unsubscribe: > https://lists.mpich.org/mailman/listinfo/discuss _______________________________________________ discuss mailing list discuss at mpich.org To manage subscription options or unsubscribe: https://lists.mpich.org/mailman/listinfo/discuss From Michael.Stokes at uah.edu Tue Nov 4 12:06:08 2014 From: Michael.Stokes at uah.edu (Michael L. Stokes) Date: Tue, 4 Nov 2014 12:06:08 -0600 Subject: [mpich-discuss] Question regarding MPICH Examples In-Reply-To: <54467F8B.9080400@uah.edu> References: <54467F8B.9080400@uah.edu> Message-ID: <54591590.7050103@uah.edu> In the file testlist (i.e rma directory) you have what looks like environment variables like timeLimit=600, mpiversion=3.0 listed by the names of the tests (as well as the number of processors to pass to mpirun). How are these values used? I looked in the mtest.c routines and there are environment variables (i.e. MPITEST_DEBUG, MPITEST_VERBOSE) that you read to control execution of the tests, but I don't see the ones listed here. Thanks Mike _______________________________________________ discuss mailing list discuss at mpich.org To manage subscription options or unsubscribe: https://lists.mpich.org/mailman/listinfo/discuss From cahenderson at wisc.edu Mon Nov 3 15:14:44 2014 From: cahenderson at wisc.edu (Corey A. Henderson) Date: Mon, 3 Nov 2014 15:14:44 -0600 Subject: [mpich-discuss] Assertion Failure using MPICH3 RMA Message-ID: About 1 in 100 runs on my local desktop, while developing an MPI code that uses MPI 3-0 RMA shared-lock features, I see the following assertion failure. The assertion does not fail at the same point in a run, or following any pattern that I can see. It has not happened on a cluster I use that is running MPICH3.1, but I haven't run my code there very often yet. Failure text: Assertion failed in file /mpich-3.0.4/src/mpid/ch3/src/ch3u_rma_sync.c at line 2803: win_ptr->targets[target_rank].remote_lock_state == MPIDI_CH3_WIN_LOCK_REQUESTED || win_ptr->targets[target_rank].remote_lock_state == MPIDI_CH3_WIN_LOCK_GRANTED internal ABORT - process 0 I have not attempted to recreate the issue with a smaller code snippet because I am not sure where to even begin to do so. Can anyone suggest to me where I might start to look for the cause of this? Some notes on what the code does: - One window per node of fixed size opened (MPI_Win_allocate) at program start. - All windows locked (shared) after creation with MPI_Win_lock_all - Code uses MPI_Fetch_and_op and MPI_Compare_and_swap on a few concurrently-accessed MPI_INT locations in a node's window - Code uses GET/PUT for data access to the rest of the window on any node (those portions that are not accessed concurrently) MPICH is v3.0.4 on 64-bit Ubuntu 12.04 LTS in a single-machine configuration. The error occurs regardless of how many MPI processes I may be testing with at any given time. I have not nailed down where in the code to trace to see why the error occurs because I don't know what could cause this (which is why I'm asking). The MPI messaging portion of the code hasn't changed in a couple of months, but I'm starting to run my code more often and for longer periods now as it nears completion. Any help on where to begin tracing to fix this would be great. -- Corey A. Henderson PhD Candidate and NSF Graduate Fellow Dept. of Engineering Physics Univ. of Wisconsin - Madison -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- _______________________________________________ discuss mailing list discuss at mpich.org To manage subscription options or unsubscribe: https://lists.mpich.org/mailman/listinfo/discuss From sseo at anl.gov Sun Nov 2 20:05:46 2014 From: sseo at anl.gov (Seo, Sangmin) Date: Mon, 3 Nov 2014 02:05:46 +0000 Subject: [mpich-discuss] Build Problem mpich-3.1.3 In-Reply-To: <469E36D0-19E3-4BDF-BA15-1DAD5BA773C2@anl.gov> References: <8B38871795FD7042B826C1D1670246FBE240BC2D@EU-MBX-01.mgc.mentorg.com> <469E36D0-19E3-4BDF-BA15-1DAD5BA773C2@anl.gov> Message-ID: <41307990-0A71-4CC7-B9AC-FF3EDDDF8AB9@anl.gov> Hi Hirak, I?d like to say again this is a bug of configure because it does nothing to correct the situation when the valgrind headers are broken, even though it causes make to fail. I have created a ticket for this problem, https://trac.mpich.org/projects/mpich/ticket/2195, and cc?ed you to the ticket. Thanks, Sangmin On Nov 2, 2014, at 9:53 AM, Seo, Sangmin > wrote: Hi Hirak, It may not be a bug. It happens when the valgrind headers are broken or too old. Can you send us c.txt (configure logs) that you got on the compilation error? Regards, Sangmin On Nov 2, 2014, at 8:52 AM, Roy, Hirak > wrote: Thanks Sangmin. ??without-valgrind? worked for me. However, is this a bug ? Thanks, Hirak Hi Hirak, Can you send us c.txt as described in http://www.mpich.org/static/downloads/3.1.3/mpich-3.1.3-installguide.pdf? In the meantime, you can also try configure with '--without-valgrind?. I think it can be a problem of valgrind version. Regards, Sangmin On Oct 31, 2014, at 5:44 AM, Hirak Roy >> wrote: Hi Team, I could not complete the build. Could you please let me know if it is config issue or anything else. Thanks, Hirak make all-recursive make[1]: Entering directory `/net/zinzeng17/export/hroy/software/MPI/mpich-3.1.3' Making all in /net/zinzeng17/export/hroy/software/MPI/mpich-3.1.3/src/mpl make[2]: Entering directory `/net/zinzeng17/export/hroy/software/MPI/mpich-3.1.3/src/mpl' CC src/mpltrmem.lo src/mpltrmem.c: In function 'MPL_trdump': src/mpltrmem.c:657:9: error: 'old_head' undeclared (first use in this function) src/mpltrmem.c:657:9: note: each undeclared identifier is reported only once for each function it appears in make[2]: *** [src/mpltrmem.lo] Error 1 make[2]: Leaving directory `/net/zinzeng17/export/hroy/software/MPI/mpich-3.1.3/src/mpl' make[1]: *** [all-recursive] Error 1 make[1]: Leaving directory `/net/zinzeng17/export/hroy/software/MPI/mpich-3.1.3' make: *** [all] Error 2 My linux distribution : Linux zinblade62 2.6.9-67.ELsmp #1 SMP Wed Nov 7 13:56:44 EST 2007 x86_64 x86_64 x86_64 GNU/Linux Config settings: Configuring with : mpich2 version = 3.1.3 platform = linux_x86_64 Source directory = /net/zinzeng17/export/hroy/software/MPI/mpich-3.1.3 Install directory = /home/hroy/local//mpich-3.1.3/linux_x86_64 CC = /u/prod/gnu/gcc/20121129/gcc-4.5.0-linux_x86_64/bin/gcc CFLAGS = -O3 -fPIC CXXFLAGS = -O3 -fPIC LDFLAGS = CONFIG_OPTIONS = --disable-f77 --disable-fc --disable-f90modules --disable-cxx --enable-fast=nochkmsg --enable-fast=notiming --enable-fast=ndebug --enable-fast=O3 --with-device=ch3:sock --enable-g=dbg --disable-fortran _______________________________________________ discuss mailing list discuss at mpich.org> To manage subscription options or unsubscribe: https://lists.mpich.org/mailman/listinfo/discuss _______________________________________________ discuss mailing list discuss at mpich.org To manage subscription options or unsubscribe: https://lists.mpich.org/mailman/listinfo/discuss _______________________________________________ discuss mailing list discuss at mpich.org To manage subscription options or unsubscribe: https://lists.mpich.org/mailman/listinfo/discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- _______________________________________________ discuss mailing list discuss at mpich.org To manage subscription options or unsubscribe: https://lists.mpich.org/mailman/listinfo/discuss From sseo at anl.gov Sun Nov 2 09:53:42 2014 From: sseo at anl.gov (Seo, Sangmin) Date: Sun, 2 Nov 2014 15:53:42 +0000 Subject: [mpich-discuss] Build Problem mpich-3.1.3 In-Reply-To: <8B38871795FD7042B826C1D1670246FBE240BC2D@EU-MBX-01.mgc.mentorg.com> References: <8B38871795FD7042B826C1D1670246FBE240BC2D@EU-MBX-01.mgc.mentorg.com> Message-ID: <469E36D0-19E3-4BDF-BA15-1DAD5BA773C2@anl.gov> Hi Hirak, It may not be a bug. It happens when the valgrind headers are broken or too old. Can you send us c.txt (configure logs) that you got on the compilation error? Regards, Sangmin On Nov 2, 2014, at 8:52 AM, Roy, Hirak > wrote: Thanks Sangmin. ??without-valgrind? worked for me. However, is this a bug ? Thanks, Hirak Hi Hirak, Can you send us c.txt as described in http://www.mpich.org/static/downloads/3.1.3/mpich-3.1.3-installguide.pdf? In the meantime, you can also try configure with '--without-valgrind?. I think it can be a problem of valgrind version. Regards, Sangmin On Oct 31, 2014, at 5:44 AM, Hirak Roy >> wrote: Hi Team, I could not complete the build. Could you please let me know if it is config issue or anything else. Thanks, Hirak make all-recursive make[1]: Entering directory `/net/zinzeng17/export/hroy/software/MPI/mpich-3.1.3' Making all in /net/zinzeng17/export/hroy/software/MPI/mpich-3.1.3/src/mpl make[2]: Entering directory `/net/zinzeng17/export/hroy/software/MPI/mpich-3.1.3/src/mpl' CC src/mpltrmem.lo src/mpltrmem.c: In function 'MPL_trdump': src/mpltrmem.c:657:9: error: 'old_head' undeclared (first use in this function) src/mpltrmem.c:657:9: note: each undeclared identifier is reported only once for each function it appears in make[2]: *** [src/mpltrmem.lo] Error 1 make[2]: Leaving directory `/net/zinzeng17/export/hroy/software/MPI/mpich-3.1.3/src/mpl' make[1]: *** [all-recursive] Error 1 make[1]: Leaving directory `/net/zinzeng17/export/hroy/software/MPI/mpich-3.1.3' make: *** [all] Error 2 My linux distribution : Linux zinblade62 2.6.9-67.ELsmp #1 SMP Wed Nov 7 13:56:44 EST 2007 x86_64 x86_64 x86_64 GNU/Linux Config settings: Configuring with : mpich2 version = 3.1.3 platform = linux_x86_64 Source directory = /net/zinzeng17/export/hroy/software/MPI/mpich-3.1.3 Install directory = /home/hroy/local//mpich-3.1.3/linux_x86_64 CC = /u/prod/gnu/gcc/20121129/gcc-4.5.0-linux_x86_64/bin/gcc CFLAGS = -O3 -fPIC CXXFLAGS = -O3 -fPIC LDFLAGS = CONFIG_OPTIONS = --disable-f77 --disable-fc --disable-f90modules --disable-cxx --enable-fast=nochkmsg --enable-fast=notiming --enable-fast=ndebug --enable-fast=O3 --with-device=ch3:sock --enable-g=dbg --disable-fortran _______________________________________________ discuss mailing list discuss at mpich.org> To manage subscription options or unsubscribe: https://lists.mpich.org/mailman/listinfo/discuss _______________________________________________ discuss mailing list discuss at mpich.org To manage subscription options or unsubscribe: https://lists.mpich.org/mailman/listinfo/discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- _______________________________________________ discuss mailing list discuss at mpich.org To manage subscription options or unsubscribe: https://lists.mpich.org/mailman/listinfo/discuss From Hirak_Roy at mentor.com Sun Nov 2 08:52:25 2014 From: Hirak_Roy at mentor.com (Roy, Hirak) Date: Sun, 2 Nov 2014 14:52:25 +0000 Subject: [mpich-discuss] Build Problem mpich-3.1.3 Message-ID: <8B38871795FD7042B826C1D1670246FBE240BC2D@EU-MBX-01.mgc.mentorg.com> Thanks Sangmin. '-without-valgrind' worked for me. However, is this a bug ? Thanks, Hirak Hi Hirak, Can you send us c.txt as described in http://www.mpich.org/static/downloads/3.1.3/mpich-3.1.3-installguide.pdf? In the meantime, you can also try configure with '--without-valgrind'. I think it can be a problem of valgrind version. Regards, Sangmin On Oct 31, 2014, at 5:44 AM, Hirak Roy >> wrote: Hi Team, I could not complete the build. Could you please let me know if it is config issue or anything else. Thanks, Hirak make all-recursive make[1]: Entering directory `/net/zinzeng17/export/hroy/software/MPI/mpich-3.1.3' Making all in /net/zinzeng17/export/hroy/software/MPI/mpich-3.1.3/src/mpl make[2]: Entering directory `/net/zinzeng17/export/hroy/software/MPI/mpich-3.1.3/src/mpl' CC src/mpltrmem.lo src/mpltrmem.c: In function 'MPL_trdump': src/mpltrmem.c:657:9: error: 'old_head' undeclared (first use in this function) src/mpltrmem.c:657:9: note: each undeclared identifier is reported only once for each function it appears in make[2]: *** [src/mpltrmem.lo] Error 1 make[2]: Leaving directory `/net/zinzeng17/export/hroy/software/MPI/mpich-3.1.3/src/mpl' make[1]: *** [all-recursive] Error 1 make[1]: Leaving directory `/net/zinzeng17/export/hroy/software/MPI/mpich-3.1.3' make: *** [all] Error 2 My linux distribution : Linux zinblade62 2.6.9-67.ELsmp #1 SMP Wed Nov 7 13:56:44 EST 2007 x86_64 x86_64 x86_64 GNU/Linux Config settings: Configuring with : mpich2 version = 3.1.3 platform = linux_x86_64 Source directory = /net/zinzeng17/export/hroy/software/MPI/mpich-3.1.3 Install directory = /home/hroy/local//mpich-3.1.3/linux_x86_64 CC = /u/prod/gnu/gcc/20121129/gcc-4.5.0-linux_x86_64/bin/gcc CFLAGS = -O3 -fPIC CXXFLAGS = -O3 -fPIC LDFLAGS = CONFIG_OPTIONS = --disable-f77 --disable-fc --disable-f90modules --disable-cxx --enable-fast=nochkmsg --enable-fast=notiming --enable-fast=ndebug --enable-fast=O3 --with-device=ch3:sock --enable-g=dbg --disable-fortran _______________________________________________ discuss mailing list discuss at mpich.org> To manage subscription options or unsubscribe: https://lists.mpich.org/mailman/listinfo/discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- _______________________________________________ discuss mailing list discuss at mpich.org To manage subscription options or unsubscribe: https://lists.mpich.org/mailman/listinfo/discuss From ahassani at cis.uab.edu Tue Nov 25 22:02:24 2014 From: ahassani at cis.uab.edu (Amin Hassani) Date: Tue, 25 Nov 2014 22:02:24 -0600 Subject: [mpich-discuss] having problem running MPICH on multiple nodes In-Reply-To: References: <5474EAA0.9090800@mcs.anl.gov> <78B3A627-D829-40A5-98E6-632502B10F74@anl.gov> Message-ID: Same type of problem. seems some problem with the network, but as I mentioned I run openmpi on it perfectly, both TCP and infiniband. machines are not behind a firewall. Same problem even if I run mpirun on one of the nodes. (not headnode) Fatal error in MPI_Send: Unknown error class, error stack: MPI_Send(174)..............: MPI_Send(buf=0x7fff9cb16128, count=1, MPI_INT, dest=1, tag=0, MPI_COMM_WORLD) failed MPID_nem_tcp_connpoll(1832): Communication error with rank 1: Connection refused =================================================================================== = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES = PID 1438 RUNNING AT oakmnt-0-a = EXIT CODE: 1 = CLEANING UP REMAINING PROCESSES = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES =================================================================================== [proxy:0:1 at oakmnt-0-b] HYD_pmcd_pmip_control_cmd_cb (../../../../src/pm/hydra/pm/pmiserv/pmip_cb.c:885): assert (!closed) failed [proxy:0:1 at oakmnt-0-b] HYDT_dmxu_poll_wait_for_event (../../../../src/pm/hydra/tools/demux/demux_poll.c:76): callback returned error status [proxy:0:1 at oakmnt-0-b] main (../../../../src/pm/hydra/pm/pmiserv/pmip.c:206): demux engine error waiting for event [mpiexec at vulcan13] HYDT_bscu_wait_for_completion (../../../../src/pm/hydra/tools/bootstrap/utils/bscu_wait.c:76): one of the processes terminated badly; aborting [mpiexec at vulcan13] HYDT_bsci_wait_for_completion (../../../../src/pm/hydra/tools/bootstrap/src/bsci_wait.c:23): launcher returned error waiting for completion [mpiexec at vulcan13] HYD_pmci_wait_for_completion (../../../../src/pm/hydra/pm/pmiserv/pmiserv_pmci.c:218): launcher returned error waiting for completion [mpiexec at vulcan13] main (../../../../src/pm/hydra/ui/mpich/mpiexec.c:344): process manager error waiting for completion Amin Hassani, CIS department at UAB, Birmingham, AL, USA. On Tue, Nov 25, 2014 at 9:53 PM, Junchao Zhang wrote: > Is the failure specific to MPI_Allreduce? Did other tests (like simple > send/recv) work? > > --Junchao Zhang > > On Tue, Nov 25, 2014 at 9:41 PM, Amin Hassani > wrote: > >> Is there any debugging flag that I can turn on to figure out problems? >> >> Thanks. >> >> Amin Hassani, >> CIS department at UAB, >> Birmingham, AL, USA. >> >> On Tue, Nov 25, 2014 at 9:31 PM, Amin Hassani >> wrote: >> >>> Now I'm getting this error with MPICH-3.2a2 >>> Any thought? >>> >>> ?$ mpirun -hostfile hosts-hydra -np 2 test_dup >>> Fatal error in MPI_Allreduce: Unknown error class, error stack: >>> MPI_Allreduce(912)....................: >>> MPI_Allreduce(sbuf=0x7fffa5240e60, rbuf=0x7fffa5240e68, count=1, >>> MPI_DOUBLE, MPI_MAX, MPI_COMM_WORLD) failed >>> MPIR_Allreduce_impl(769)..............: >>> MPIR_Allreduce_intra(419).............: >>> MPIDU_Complete_posted_with_error(1192): Process failed >>> Fatal error in MPI_Allreduce: Unknown error class, error stack: >>> MPI_Allreduce(912)....................: >>> MPI_Allreduce(sbuf=0x7fffaf6ef070, rbuf=0x7fffaf6ef078, count=1, >>> MPI_DOUBLE, MPI_MAX, MPI_COMM_WORLD) failed >>> MPIR_Allreduce_impl(769)..............: >>> MPIR_Allreduce_intra(419).............: >>> MPIDU_Complete_posted_with_error(1192): Process failed >>> >>> >>> =================================================================================== >>> = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES >>> = PID 451 RUNNING AT oakmnt-0-a >>> = EXIT CODE: 1 >>> = CLEANING UP REMAINING PROCESSES >>> = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES >>> >>> =================================================================================== >>> ? >>> >>> Thanks. >>> >>> Amin Hassani, >>> CIS department at UAB, >>> Birmingham, AL, USA. >>> >>> On Tue, Nov 25, 2014 at 9:25 PM, Amin Hassani >>> wrote: >>> >>>> Ok, I'll try to test the alpha version. I'll let you know the results. >>>> >>>> Thank you. >>>> >>>> Amin Hassani, >>>> CIS department at UAB, >>>> Birmingham, AL, USA. >>>> >>>> On Tue, Nov 25, 2014 at 9:21 PM, Bland, Wesley B. >>>> wrote: >>>> >>>>> It?s hard to tell then. Other than some problems compiling (not >>>>> declaring all of your variables), everything seems ok. Can you try running >>>>> with the most recent alpha. I have no idea what bug we could have fixed >>>>> here to make things work, but it?d be good to eliminate the possibility. >>>>> >>>>> Thanks, >>>>> Wesley >>>>> >>>>> On Nov 25, 2014, at 10:11 PM, Amin Hassani >>>>> wrote: >>>>> >>>>> Here I attached config.log exits in the root folder where it is >>>>> compiled. I'm not too familiar with MPICH but, there are other config.logs >>>>> in other directories also but not sure if you needed them too. >>>>> I don't have any specific environment variable that can relate to >>>>> MPICH. Also tried with >>>>> export HYDRA_HOST_FILE=
, >>>>> but have the same problem. >>>>> I don't do anything FT related in MPICH, I don't think this version of >>>>> MPICH has anything related to FT in it. >>>>> >>>>> Thanks. >>>>> >>>>> Amin Hassani, >>>>> CIS department at UAB, >>>>> Birmingham, AL, USA. >>>>> >>>>> On Tue, Nov 25, 2014 at 9:02 PM, Bland, Wesley B. >>>>> wrote: >>>>> >>>>>> Can you also provide your config.log and any CVARs or other relevant >>>>>> environment variables that you might be setting (for instance, in relation >>>>>> to fault tolerance)? >>>>>> >>>>>> Thanks, >>>>>> Wesley >>>>>> >>>>>> >>>>>> On Nov 25, 2014, at 3:58 PM, Amin Hassani >>>>>> wrote: >>>>>> >>>>>> This is the simplest code I have that doesn't run. >>>>>> >>>>>> >>>>>> #include >>>>>> #include >>>>>> #include >>>>>> #include >>>>>> #include >>>>>> >>>>>> int main(int argc, char** argv) >>>>>> { >>>>>> int rank, size; >>>>>> int i, j, k; >>>>>> double t1, t2; >>>>>> int rc; >>>>>> >>>>>> MPI_Init(&argc, &argv); >>>>>> MPI_Comm world = MPI_COMM_WORLD, newworld, newworld2; >>>>>> MPI_Comm_rank(world, &rank); >>>>>> MPI_Comm_size(world, &size); >>>>>> >>>>>> t2 = 1; >>>>>> MPI_Allreduce(&t2, &t_avg, 1, MPI_DOUBLE, MPI_SUM, world); >>>>>> t_avg = t_avg / size; >>>>>> >>>>>> MPI_Finalize(); >>>>>> >>>>>> return 0; >>>>>> }? >>>>>> >>>>>> Amin Hassani, >>>>>> CIS department at UAB, >>>>>> Birmingham, AL, USA. >>>>>> >>>>>> On Tue, Nov 25, 2014 at 2:46 PM, "Antonio J. Pe?a" < >>>>>> apenya at mcs.anl.gov> wrote: >>>>>> >>>>>>> >>>>>>> Hi Amin, >>>>>>> >>>>>>> Can you share with us a minimal piece of code with which you can >>>>>>> reproduce this issue? >>>>>>> >>>>>>> Thanks, >>>>>>> Antonio >>>>>>> >>>>>>> >>>>>>> >>>>>>> On 11/25/2014 12:52 PM, Amin Hassani wrote: >>>>>>> >>>>>>> Hi, >>>>>>> >>>>>>> I am having problem running MPICH, on multiple nodes. When I run >>>>>>> an multiple MPI processes on one node, it totally works, but when I try to >>>>>>> run on multiple nodes, it fails with the error below. >>>>>>> My machines have Debian OS, Both infiniband and TCP interconnects. >>>>>>> I'm guessing it has something do to with the TCP network, but I can run >>>>>>> openmpi on these machines with no problem. But for some reason I cannot run >>>>>>> MPICH on multiple nodes. Please let me know if more info is needed from my >>>>>>> side. I'm guessing there are some configuration that I am missing. I used >>>>>>> MPICH 3.1.3 for this test. I googled this problem but couldn't find any >>>>>>> solution. >>>>>>> >>>>>>> ?In my MPI program, I am doing a simple allreduce over >>>>>>> MPI_COMM_WORLD?. >>>>>>> >>>>>>> ?my host file (hosts-hydra) is something like this: >>>>>>> oakmnt-0-a:1 >>>>>>> oakmnt-0-b:1 ? >>>>>>> >>>>>>> ?I get this error:? >>>>>>> >>>>>>> $ mpirun -hostfile hosts-hydra -np 2 test_dup >>>>>>> Assertion failed in file ../src/mpi/coll/helper_fns.c at line 490: >>>>>>> status->MPI_TAG == recvtag >>>>>>> Assertion failed in file ../src/mpi/coll/helper_fns.c at line 490: >>>>>>> status->MPI_TAG == recvtag >>>>>>> internal ABORT - process 1 >>>>>>> internal ABORT - process 0 >>>>>>> >>>>>>> >>>>>>> =================================================================================== >>>>>>> = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES >>>>>>> = PID 30744 RUNNING AT oakmnt-0-b >>>>>>> = EXIT CODE: 1 >>>>>>> = CLEANING UP REMAINING PROCESSES >>>>>>> = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES >>>>>>> >>>>>>> =================================================================================== >>>>>>> [mpiexec at vulcan13] HYDU_sock_read >>>>>>> (../../../../src/pm/hydra/utils/sock/sock.c:239): read error (Bad file >>>>>>> descriptor) >>>>>>> [mpiexec at vulcan13] control_cb >>>>>>> (../../../../src/pm/hydra/pm/pmiserv/pmiserv_cb.c:199): unable to read >>>>>>> command from proxy >>>>>>> [mpiexec at vulcan13] HYDT_dmxu_poll_wait_for_event >>>>>>> (../../../../src/pm/hydra/tools/demux/demux_poll.c:76): callback returned >>>>>>> error status >>>>>>> [mpiexec at vulcan13] HYD_pmci_wait_for_completion >>>>>>> (../../../../src/pm/hydra/pm/pmiserv/pmiserv_pmci.c:198): error waiting for >>>>>>> event >>>>>>> [mpiexec at vulcan13] main >>>>>>> (../../../../src/pm/hydra/ui/mpich/mpiexec.c:344): process manager error >>>>>>> waiting for completion >>>>>>> >>>>>>> Thanks. >>>>>>> Amin Hassani, >>>>>>> CIS department at UAB, >>>>>>> Birmingham, AL, USA. >>>>>>> >>>>>>> >>>>>>> _______________________________________________ >>>>>>> discuss mailing list discuss at mpich.org >>>>>>> To manage subscription options or unsubscribe:https://lists.mpich.org/mailman/listinfo/discuss >>>>>>> >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> Antonio J. Pe?a >>>>>>> Postdoctoral Appointee >>>>>>> Mathematics and Computer Science Division >>>>>>> Argonne National Laboratory >>>>>>> 9700 South Cass Avenue, Bldg. 240, Of. 3148 >>>>>>> Argonne, IL 60439-4847apenya at mcs.anl.govwww.mcs.anl.gov/~apenya >>>>>>> >>>>>>> >>>>>>> _______________________________________________ >>>>>>> discuss mailing list discuss at mpich.org >>>>>>> To manage subscription options or unsubscribe: >>>>>>> https://lists.mpich.org/mailman/listinfo/discuss >>>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> discuss mailing list discuss at mpich.org >>>>>> To manage subscription options or unsubscribe: >>>>>> https://lists.mpich.org/mailman/listinfo/discuss >>>>>> >>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> discuss mailing list discuss at mpich.org >>>>>> To manage subscription options or unsubscribe: >>>>>> https://lists.mpich.org/mailman/listinfo/discuss >>>>>> >>>>> >>>>> _______________________________________________ >>>>> discuss mailing list discuss at mpich.org >>>>> To manage subscription options or unsubscribe: >>>>> https://lists.mpich.org/mailman/listinfo/discuss >>>>> >>>>> >>>>> >>>>> _______________________________________________ >>>>> discuss mailing list discuss at mpich.org >>>>> To manage subscription options or unsubscribe: >>>>> https://lists.mpich.org/mailman/listinfo/discuss >>>>> >>>> >>>> >>> >> >> _______________________________________________ >> discuss mailing list discuss at mpich.org >> To manage subscription options or unsubscribe: >> https://lists.mpich.org/mailman/listinfo/discuss >> > > > _______________________________________________ > discuss mailing list discuss at mpich.org > To manage subscription options or unsubscribe: > https://lists.mpich.org/mailman/listinfo/discuss > -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- _______________________________________________ discuss mailing list discuss at mpich.org To manage subscription options or unsubscribe: https://lists.mpich.org/mailman/listinfo/discuss From victor.vysotskiy at teokem.lu.se Fri Nov 28 03:57:41 2014 From: victor.vysotskiy at teokem.lu.se (Victor Vysotskiy) Date: Fri, 28 Nov 2014 09:57:41 +0000 Subject: [mpich-discuss] Problem with MPICH3/OpenPA on IBM P755 Message-ID: <8D58A4B5E6148C419C6AD6334962375DDD9AE3F6@UWMBX04.uw.lu.se> Hi Pavan, enclosed please find the requested config.log file. BTW, I am trying to compile MPICH3 on P775 (Power7 running under AIX) rather than on P755. However, I can check the issue on P755 (Power7 running under Linux) too, if it is needed. With best regards, Victor. -------------- next part -------------- A non-text attachment was scrubbed... Name: config.log Type: text/x-log Size: 78264 bytes Desc: config.log URL: -------------- next part -------------- _______________________________________________ discuss mailing list discuss at mpich.org To manage subscription options or unsubscribe: https://lists.mpich.org/mailman/listinfo/discuss From ahassani at cis.uab.edu Tue Nov 25 21:11:18 2014 From: ahassani at cis.uab.edu (Amin Hassani) Date: Tue, 25 Nov 2014 21:11:18 -0600 Subject: [mpich-discuss] having problem running MPICH on multiple nodes In-Reply-To: <78B3A627-D829-40A5-98E6-632502B10F74@anl.gov> References: <5474EAA0.9090800@mcs.anl.gov> <78B3A627-D829-40A5-98E6-632502B10F74@anl.gov> Message-ID: Here I attached config.log exits in the root folder where it is compiled. I'm not too familiar with MPICH but, there are other config.logs in other directories also but not sure if you needed them too. I don't have any specific environment variable that can relate to MPICH. Also tried with export HYDRA_HOST_FILE=
, but have the same problem. I don't do anything FT related in MPICH, I don't think this version of MPICH has anything related to FT in it. Thanks. Amin Hassani, CIS department at UAB, Birmingham, AL, USA. On Tue, Nov 25, 2014 at 9:02 PM, Bland, Wesley B. wrote: > Can you also provide your config.log and any CVARs or other relevant > environment variables that you might be setting (for instance, in relation > to fault tolerance)? > > Thanks, > Wesley > > > On Nov 25, 2014, at 3:58 PM, Amin Hassani wrote: > > This is the simplest code I have that doesn't run. > > > #include > #include > #include > #include > #include > > int main(int argc, char** argv) > { > int rank, size; > int i, j, k; > double t1, t2; > int rc; > > MPI_Init(&argc, &argv); > MPI_Comm world = MPI_COMM_WORLD, newworld, newworld2; > MPI_Comm_rank(world, &rank); > MPI_Comm_size(world, &size); > > t2 = 1; > MPI_Allreduce(&t2, &t_avg, 1, MPI_DOUBLE, MPI_SUM, world); > t_avg = t_avg / size; > > MPI_Finalize(); > > return 0; > }? > > Amin Hassani, > CIS department at UAB, > Birmingham, AL, USA. > > On Tue, Nov 25, 2014 at 2:46 PM, "Antonio J. Pe?a" > wrote: > >> >> Hi Amin, >> >> Can you share with us a minimal piece of code with which you can >> reproduce this issue? >> >> Thanks, >> Antonio >> >> >> >> On 11/25/2014 12:52 PM, Amin Hassani wrote: >> >> Hi, >> >> I am having problem running MPICH, on multiple nodes. When I run an >> multiple MPI processes on one node, it totally works, but when I try to run >> on multiple nodes, it fails with the error below. >> My machines have Debian OS, Both infiniband and TCP interconnects. I'm >> guessing it has something do to with the TCP network, but I can run openmpi >> on these machines with no problem. But for some reason I cannot run MPICH >> on multiple nodes. Please let me know if more info is needed from my side. >> I'm guessing there are some configuration that I am missing. I used MPICH >> 3.1.3 for this test. I googled this problem but couldn't find any solution. >> >> ?In my MPI program, I am doing a simple allreduce over MPI_COMM_WORLD?. >> >> ?my host file (hosts-hydra) is something like this: >> oakmnt-0-a:1 >> oakmnt-0-b:1 ? >> >> ?I get this error:? >> >> $ mpirun -hostfile hosts-hydra -np 2 test_dup >> Assertion failed in file ../src/mpi/coll/helper_fns.c at line 490: >> status->MPI_TAG == recvtag >> Assertion failed in file ../src/mpi/coll/helper_fns.c at line 490: >> status->MPI_TAG == recvtag >> internal ABORT - process 1 >> internal ABORT - process 0 >> >> >> =================================================================================== >> = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES >> = PID 30744 RUNNING AT oakmnt-0-b >> = EXIT CODE: 1 >> = CLEANING UP REMAINING PROCESSES >> = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES >> >> =================================================================================== >> [mpiexec at vulcan13] HYDU_sock_read >> (../../../../src/pm/hydra/utils/sock/sock.c:239): read error (Bad file >> descriptor) >> [mpiexec at vulcan13] control_cb >> (../../../../src/pm/hydra/pm/pmiserv/pmiserv_cb.c:199): unable to read >> command from proxy >> [mpiexec at vulcan13] HYDT_dmxu_poll_wait_for_event >> (../../../../src/pm/hydra/tools/demux/demux_poll.c:76): callback returned >> error status >> [mpiexec at vulcan13] HYD_pmci_wait_for_completion >> (../../../../src/pm/hydra/pm/pmiserv/pmiserv_pmci.c:198): error waiting for >> event >> [mpiexec at vulcan13] main >> (../../../../src/pm/hydra/ui/mpich/mpiexec.c:344): process manager error >> waiting for completion >> >> Thanks. >> Amin Hassani, >> CIS department at UAB, >> Birmingham, AL, USA. >> >> >> _______________________________________________ >> discuss mailing list discuss at mpich.org >> To manage subscription options or unsubscribe:https://lists.mpich.org/mailman/listinfo/discuss >> >> >> >> -- >> Antonio J. Pe?a >> Postdoctoral Appointee >> Mathematics and Computer Science Division >> Argonne National Laboratory >> 9700 South Cass Avenue, Bldg. 240, Of. 3148 >> Argonne, IL 60439-4847apenya at mcs.anl.govwww.mcs.anl.gov/~apenya >> >> >> _______________________________________________ >> discuss mailing list discuss at mpich.org >> To manage subscription options or unsubscribe: >> https://lists.mpich.org/mailman/listinfo/discuss >> > > _______________________________________________ > discuss mailing list discuss at mpich.org > To manage subscription options or unsubscribe: > https://lists.mpich.org/mailman/listinfo/discuss > > > > _______________________________________________ > discuss mailing list discuss at mpich.org > To manage subscription options or unsubscribe: > https://lists.mpich.org/mailman/listinfo/discuss > -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: config.log Type: text/x-log Size: 495437 bytes Desc: not available URL: -------------- next part -------------- _______________________________________________ discuss mailing list discuss at mpich.org To manage subscription options or unsubscribe: https://lists.mpich.org/mailman/listinfo/discuss From fisaila at mcs.anl.gov Tue Nov 11 17:00:50 2014 From: fisaila at mcs.anl.gov (Isaila, Florin D.) Date: Tue, 11 Nov 2014 23:00:50 +0000 Subject: [mpich-discuss] mpiexec and DISPLAY for remote hosts In-Reply-To: <6F4D5A685397B940825208C64CF853A7477F85EA@HALAS.anl.gov> References: <6F4D5A685397B940825208C64CF853A7477F85D3@HALAS.anl.gov>, <6F4D5A685397B940825208C64CF853A7477F85EA@HALAS.anl.gov> Message-ID: <6F4D5A685397B940825208C64CF853A7477F8601@HALAS.anl.gov> Hi, I am having troubles running mpiexec (from MPICH 3.1) on remote hosts with xterm (for debugging purposes) from an Ubuntu box. The following command works perfectly in local: mpiexec -n 1 xterm However, if I use a machine file: mpiexec -f macfile -n 1 xterm xterm Xt error: Can't open display: xterm: DISPLAY is not set =================================================================================== = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES = PID 19225 RUNNING AT thwomp.mcs.anl.gov = EXIT CODE: 1 = CLEANING UP REMAINING PROCESSES = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES =================================================================================== X forwarding is activated: fisaila at howard:bin$ grep X11Forwarding /etc/ssh/sshd_config X11Forwarding yes The following command works: fisaila at howard:bin$ ssh thwomp.mcs.anl.gov xterm Any suggestion how to solve this? Thanks Florin -------------- next part -------------- An HTML attachment was scrubbed... URL: From balaji at anl.gov Sun Nov 16 11:35:44 2014 From: balaji at anl.gov (Balaji, Pavan) Date: Sun, 16 Nov 2014 17:35:44 +0000 Subject: [mpich-discuss] Announcing the availability of mpich-3.2a2 Message-ID: <80E2961E-3285-492E-A3C9-21E95A02363C@anl.gov> The MPICH team is pleased to announce the availability of a new preview release (mpich-3.2a2). This preview release adds several capabilities including support for the proposed MPI-3.1 standard (contains nonblocking collective I/O), full Fortran 2008 support (enabled by default), support for the Mellanox MXM interface for InfiniBand, support for the Mellanox HCOLL interface for collective communication, support for OFED InfiniBand for Xeon and Xeon Phi architectures, and significant improvements to the MPICH/portals4 implementation. These features represent a subset of those planned for the 3.2.x release series. Complete list of features planned for mpich-3.2: https://trac.mpich.org/projects/mpich/roadmap MPICH website: http://www.mpich.org A list of changes in this release is included at the end of this email. Regards, The MPICH Team =============================================================================== Changes in 3.2a2 =============================================================================== # Added support for proposed MPI-3.1 features including nonblocking collective I/O, address manipulation routines, thread-safety for MPI initialization and pre-init functionality. Proposed MPI-3.1 features are implemented as MPIX_ functions and the MPI version number is still set to 3.0. # Fortran 2008 bindings are enabled by default and fully supported. # Added support for the Mellanox MXM InfiniBand interface. (thanks to Mellanox for the code contribution). # Added support for the Mellanox HCOLL interface for collectives. (thanks to Mellanox for the code contribution). # Added support for OFED IB on Xeon and Xeon Phi. (thanks to RIKEN and University of Tokyo for the contribution). # Significant stability improvements to the MPICH/portals4 implementation. # Several other minor bug fixes, memory leak fixes, and code cleanup. A full list of changes is available at the following link: http://git.mpich.org/mpich.git/shortlog/v3.1.2..v3.2a2 A full list of bugs that have been fixed is available at the following link: https://trac.mpich.org/projects/mpich/query?status=closed&group=resolution&milestone=mpich-3.2 -- Pavan Balaji ?? http://www.mcs.anl.gov/~balaji From fisaila at mcs.anl.gov Mon Nov 17 15:28:43 2014 From: fisaila at mcs.anl.gov (Isaila, Florin D.) Date: Mon, 17 Nov 2014 21:28:43 +0000 Subject: [mpich-discuss] f77 bindings and profiling Message-ID: <6F4D5A685397B940825208C64CF853A7477F881F@HALAS.anl.gov> Hi , I am trying to use MPI profiling to make mpi_init from a F77 program call my MPI_Init (written in C), but I do not manage to achieve that. In this simple F77 program: program main include 'mpif.h' integer error call mpi_init(error) call mpi_finalize(error) end I try to make the mpi_init call: int MPI_Init (int *argc, char ***argv){ int ret; printf("My function!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!"); ret = PMPI_Init(argc, argv); return ret; } My MPI_Init belongs to a library libtarget.a I created. I use -profile for compiling and I created the target.conf containing: PROFILE_PRELIB="-L/homes/fisaila/benches/try/staticlib_mpif77 -ltarget" in the right place. The library appears in the command line before the mpich library: mpif77 -show -g -profile=target init_finalize.f -o init_finalize gfortran -g init_finalize.f -o init_finalize -I/homes/fisaila/software/mpich/include -L/homes/fisaila/software/mpich/lib -L/homes/fisaila/benches/try/staticlib_mpif77 -ltarget -Wl,-rpath -Wl,/homes/fisaila/software/mpich/lib -lmpich -lopa -lmpl -lrt -lpthread However, the program never gets into my MPI_Init. Any suggestion about what I am missing? Thanks Florin -------------- next part -------------- An HTML attachment was scrubbed... URL: From fisaila at mcs.anl.gov Thu Nov 20 16:20:18 2014 From: fisaila at mcs.anl.gov (Isaila, Florin D.) Date: Thu, 20 Nov 2014 22:20:18 +0000 Subject: [mpich-discuss] MPI_Iprobe bug in MPICH for BGQ? Message-ID: <6F4D5A685397B940825208C64CF853A747800895@HALAS.anl.gov> Hi, when I run the program from below on 1 node on BGQ (Vesta), the message is not received (flag is 0). However on a Ubuntu, the message is received (flag is non-zero). If I add another Iprobe (uncomment the Iprobe in the code below) the message is received on both BGQ and Ubuntu. Note that the program sleeps for 1 second after the Isend. Is it a bug? This happens for both MPICH-3.1.3 and MPICH-3.1. #include "mpi.h" #include #include int main(int argc, char **argv) { int send_int, recv_int, tag, flag; MPI_Status status; MPI_Request req; MPI_Init(&argc, &argv); tag = 0; send_int = 100; MPI_Isend(&send_int, 1, MPI_INT, 0, tag, MPI_COMM_WORLD, &req ); sleep(1); MPI_Iprobe(MPI_ANY_SOURCE , MPI_ANY_TAG, MPI_COMM_WORLD, &flag, &status ); //MPI_Iprobe(MPI_ANY_SOURCE , tag, MPI_COMM_WORLD, &flag, &status ); if (flag) { MPI_Recv( &recv_int, 1, MPI_INT, MPI_ANY_SOURCE, tag, MPI_COMM_WORLD, &status); printf("Received = %d\n", recv_int); } else printf("Message not received yet"); MPI_Waitall(1, &req, MPI_STATUSES_IGNORE); MPI_Finalize(); return 0; } Thanks Florin -------------- next part -------------- An HTML attachment was scrubbed... URL: