[mpich-discuss] Fatal error in PMPI_Reduce

Michael Colonno mcolonno at stanford.edu
Fri Jan 11 17:02:39 CST 2013


	With the options I used to configure, mpiexec / mpirun are not built
or installed (presumably due to SLURM interface). 

	Thanks,
	~Mike C. 

-----Original Message-----
From: discuss-bounces at mpich.org [mailto:discuss-bounces at mpich.org] On Behalf
Of Pavan Balaji
Sent: Friday, January 11, 2013 2:20 PM
To: discuss at mpich.org
Subject: Re: [mpich-discuss] Fatal error in PMPI_Reduce


FYI, the reason I suggested this is because mpiexec will automatically
detect and use slurm internally.

 -- Pavan

On 01/11/2013 04:19 PM US Central Time, Pavan Balaji wrote:
> Michael,
> 
> Did you try just using mpiexec?
> 
> mpiexec -n 32 /usr/local/apps/cxxcpi
> 
>  -- Pavan
> 
> On 01/11/2013 03:31 PM US Central Time, Michael Colonno wrote:
>>             Hi All ~
>>
>>  
>>
>>             I've compiled MPICH2 3.0 with the Intel compiler (v. 13) 
>> on a CentOS 6.3 x64 system using SLURM as the process manager. My 
>> configure was simply:
>>
>>  
>>
>> ./configure --with-pmi=slurm --with-pm=no 
>> --prefix=/usr/local/apps/MPICH2
>>
>>  
>>
>> No errors during build or install. When I compile and run the example 
>> program cxxcpi I get (truncated):
>>
>>  
>>
>> $ srun -n32 /usr/local/apps/cxxcpi
>>
>> Fatal error in PMPI_Reduce: A process has failed, error stack:
>>
>> PMPI_Reduce(1217)...............: MPI_Reduce(sbuf=0x7fff4ad18120, 
>> rbuf=0x7fff4ad18128, count=1, MPI_DOUBLE, MPI_SUM, root=0,
>> MPI_COMM_WORLD) failed
>>
>> MPIR_Reduce_impl(1029)..........:
>>
>> MPIR_Reduce_intra(779)..........:
>>
>> MPIR_Reduce_impl(1029)..........:
>>
>> MPIR_Reduce_intra(835)..........:
>>
>> MPIR_Reduce_binomial(144).......:
>>
>> MPIDI_CH3U_Recvq_FDU_or_AEP(612): Communication error with rank 16
>>
>> MPIR_Reduce_intra(799)..........:
>>
>> MPIR_Reduce_impl(1029)..........:
>>
>> MPIR_Reduce_intra(835)..........:
>>
>> MPIR_Reduce_binomial(206).......: Failure during collective
>>
>> srun: error: task 0: Exited with exit code 1
>>
>>  
>>
>>             This error is experienced with many of my MPI programs. A 
>> different application yields:
>>
>>  
>>
>> PMPI_Bcast(1525)......: MPI_Bcast(buf=0x7fff545be5fc, count=1, 
>> MPI_INT, root=0, MPI_COMM_WORLD) failed
>>
>> MPIR_Bcast_impl(1369).:
>>
>> MPIR_Bcast_intra(1160):
>>
>> MPIR_SMP_Bcast(1077)..: Failure during collective
>>
>>  
>>
>>             Can anyone point me in the right direction?
>>
>>  
>>
>>             Thanks,
>>
>>             ~Mike C.  
>>
>>
>>
>> _______________________________________________
>> discuss mailing list     discuss at mpich.org
>> To manage subscription options or unsubscribe:
>> https://lists.mpich.org/mailman/listinfo/discuss
>>
> 

--
Pavan Balaji
http://www.mcs.anl.gov/~balaji
_______________________________________________
discuss mailing list     discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss




More information about the discuss mailing list