[mpich-discuss] Fatal error in PMPI_Bcast

James Long jlong at iarc.uaf.edu
Fri Jul 11 13:25:42 CDT 2014


Compiled as shown in thread below for use with slurm (config.log at http://pastebin.com/aB8CZsr0), I get the following error:

$ srun -n 64 -l ./mpi_wrapper -dl cfile
00: Fatal error in PMPI_Bcast: Other MPI error, error stack:
00: PMPI_Bcast(1607)......: MPI_Bcast(buf=0x6104e0, count=1, MPI_INT, root=0, MPI_COMM_WORLD) failed
00: MPIR_Bcast_impl(1459).: 
00: MPIR_Bcast(1483)......: 
00: MPIR_Bcast_intra(1254): 
00: MPIR_SMP_Bcast(1168)..: Failure during collective
.
.
.
run: error: 10_4_5_11: tasks 0-31: Exited with exit code 1


where the fatal error is the same for tasks 0-31 on the first node (10_4_5_11), and no error for tasks 32-63 on the second node. The code runs fine if I only use one node for srun (-n 32), and works fine across two nodes if I compile mpich without the slurm flags (PGI compiler in all cases) and use mpirun. openmpi works OK across the two nodes when compiled with slurm support, so is the problem something to do with the slurm pmi library and mpich? I'm using slurm 2.6.7-1 and libpmi0 2.6.7-1 under debian.

Jim

On Jul 9, 2014, at 4:39 PM, James Long wrote:

> Commenting out the -pthread flag in /usr/lib/libpmi.la allows it to compile OK. But while 'srun' successfully runs mpi jobs on one 64-core node, the following error appears when trying to run on 2 nodes, and the job hangs:
> 
> srun: error: slurm_send_recv_rc_msg_only_one: Connection timed out
> 
> No help from google...
> 
> Thanks,
> 
> Jim
> 
> On Jul 9, 2014, at 3:37 PM, Kenneth Raffenetti wrote:
> 
>> Libtool is adding the -pthread flag because it exists in the "inherited_linker_flags" in libpmi.la. However, PGI clearly doesn't understand it. The libtool devs called this issue a bug when it was reported here:
>> 
>> http://lists.gnu.org/archive/html/libtool/2010-11/msg00034.html
>> 
>> You could either remove the flag from the /usr/lib/libpmi.la file, or use the workaround suggested by Gus Correa to finish your build.
>> 
>> Ken
>> 
>> On 07/09/2014 05:56 PM, Gus Correa wrote:
>>> Hi James
>>> 
>>> Adding '-noswitcherror' to the CFLAGS, FFLAGS, FCFLAGS,
>>> may be a workaround to this problem.
>>> 
>>> My two cents,
>>> Gus Correa
>>> 
>>> On 07/09/2014 05:55 PM, James Long wrote:
>>>> I need to use srun to launch mpi jobs under slurm, so configured
>>>> mpich-3.1.1 with
>>>> 
>>>> $ env CC=pgcc FC=pgf90 CXX=pgCC CPPFLAGS="-DNDEBUG -DpgiFortran"
>>>> CFLAGS="-O2" FFLAGS="-O2 -w" ./configure --prefix=/opt/mpich-slurm
>>>> --with-pmi=slurm --with-pm=no --with-slurm-include=/usr/include/slurm
>>>> --with-slurm-lib=/usr/lib/slurm --enable-fortran=yes
>>>> 
>>>> The following error occurs when compiling:
>>>> 
>>>>  CC       src/mpid/common/datatype/lib_libmpi_la-mpir_type_flatten.lo
>>>>  CC       src/mpid/common/sched/lib_libmpi_la-mpid_sched.lo
>>>>  CC       src/mpid/common/thread/lib_libmpi_la-mpid_thread.lo
>>>>  CC       src/mpi_t/lib_libmpi_la-mpit.lo
>>>>  CC       src/nameserv/pmi/lib_libmpi_la-pmi_nameserv.lo
>>>>  GEN      lib/libmpi.la
>>>> pgf90-Error-Unknown switch: -pthread
>>>> make[2]: *** [lib/libmpi.la] Error 1
>>>> make[2]: Leaving directory `/home/boss/mpich-3.1.1'
>>>> make[1]: *** [all-recursive] Error 1
>>>> make[1]: Leaving directory `/home/boss/mpich-3.1.1'
>>>> make: *** [all] Error 2
>>>> 
>>>> config.log is at http://pastebin.com/aB8CZsr0
>>>> 
>>>> Jim
>>>> --

--

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
James Long
International Arctic Research Center
University of Alaska Fairbanks
jlong|at|alaska.edu
(907) 474-2440
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%







More information about the discuss mailing list