[mpich-discuss] Problems running MPICH jobs under SLURM

Pavan Balaji balaji at mcs.anl.gov
Fri Jun 7 15:08:05 CDT 2013


Are you using the correct mpiexec?  Your submission script is using the 
mpiexec from this directory:

/home/biddisco/apps/mpich-3.0.4/bin/mpiexec

  -- Pavan

On 06/07/2013 03:40 AM, Biddiscombe, John A. wrote:
> I downloaded the nightly tarball and recompiled/installed mpich (used
> mpich-master-v3.0.4-259-gf322ce79)
>
> I still get this (output below) with a simple hello world program.
>
> Now you must understand that I have no idea what I’m doing (really). I
> wanted to test some debugging features under slurm so installed slurm
> myself on a workstation with just 2 cores and have the bare minimum
> setup. I’m doing the following
>
> sudo munged &
>
> sudo slurmd &
>
> sudo slurmctld -D
>
> and then I can run jobs on the local machine and it seems to be ok,
> except that mpi jobs always give the double free error as below when run
> under slurm, but are just fine when run from the command line.
>
> My suspicion is that slurm is not actually using the hydra pm that I
> just compiled. I installed slurm from rpms. Should I recompile slurm
> myself and somehow tell it which mpi to use?
>
> My job script looks as follows
>
> ######################
>
> #!/bin/bash
>
> #
>
> # Create the job script from the supplied parameters
>
> #
>
> #SBATCH --job-name=pvserver
>
> #SBATCH --time=00:04:00
>
> #SBATCH --nodes=1
>
> #SBATCH --partition=normal
>
> #SBATCH --output=/home/biddisco/slurm.out
>
> #SBATCH --error=/home/biddisco/slurm.err
>
> #SBATCH --mem=2048MB
>
> #export
>
> # echo "Path is $PATH"
>
> # echo "LD_LIBRARY_PATH is " $LD_LIBRARY_PATH
>
> # cd /home/biddisco/build/pv-38/bin/
>
> #export PMI_DEBUG=9
>
> #ulimit -s unlimited
>
> #ulimit -c 0
>
> /home/biddisco/apps/mpich-3.0.4/bin/mpiexec -rmk slurm -n 2
> /home/biddisco/build/hello/hello
>
> ######################
>
> It gives the same result with or without the –rmk slurm and the #ulimit
> settings.
>
> Apologies for wasting your time, I’m certain I’m doing something wrong –
> I just don’t know what.
>
> JB
>
> biddisco at breno2 ~ $ more ~/slurm.err
>
> *** glibc detected *** /home/biddisco/build/hello/hello: double free or
> corruption (fasttop): 0x0000000001896340 ***
>
> ======= Backtrace: =========
>
> /lib/x86_64-linux-gnu/libc.so.6(+0x7eb96)[0x7f9a1695cb96]
>
> /home/biddisco/build/hello/hello(MPIDI_Populate_vc_node_ids+0x3f9)[0x427c89]
>
> /home/biddisco/build/hello/hello(MPID_Init+0x136)[0x4253f6]
>
> /home/biddisco/build/hello/hello(MPIR_Init_thread+0x22f)[0x414cbf]
>
> /home/biddisco/build/hello/hello(MPI_Init+0xae)[0x4146ee]
>
> /home/biddisco/build/hello/hello(main+0x22)[0x413f2e]
>
> /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xed)[0x7f9a168ff76d]
>
> /home/biddisco/build/hello/hello[0x413e31]
>
> ======= Memory map: ========
>
> 00400000-0051a000 r-xp 00000000 08:01 8661191
> /home/biddisco/build/hello/hello
>
> 0071a000-00727000 r--p 0011a000 08:01 8661191
>                         /home/biddisco/build/hello/hello
>
> 00727000-00729000 rw-p 00127000 08:01 8661191
> /home/biddisco/build/hello/hello
>
> 00729000-00751000 rw-p 00000000 00:00 0
>
> 01895000-018b6000 rw-p 00000000 00:00 0
>                     [heap]
>
> 7f9a166c8000-7f9a166dd000 r-xp 00000000 08:01 9047556
> /lib/x86_64-linux-gnu/libgcc_s.so.1
>
> 7f9a166dd000-7f9a168dc000 ---p 00015000 08:01 9047556
> /lib/x86_64-linux-gnu/libgcc_s.so.1
>
> 7f9a168dc000-7f9a168dd000 r--p 00014000 08:01 9047556
> /lib/x86_64-linux-gnu/libgcc_s.so.1
>
> 7f9a168dd000-7f9a168de000 rw-p 00015000 08:01 9047556
> /lib/x86_64-linux-gnu/libgcc_s.so.1
>
> 7f9a168de000-7f9a16a93000 r-xp 00000000 08:01 9050358
> /lib/x86_64-linux-gnu/libc-2.15.so
>
> 7f9a16a93000-7f9a16c92000 ---p 001b5000 08:01 9050358
> /lib/x86_64-linux-gnu/libc-2.15.so
>
> 7f9a16c92000-7f9a16c96000 r--p 001b4000 08:01 9050358
> /lib/x86_64-linux-gnu/libc-2.15.so
>
> 7f9a16c96000-7f9a16c98000 rw-p 001b8000 08:01 9050358
> /lib/x86_64-linux-gnu/libc-2.15.so
>
> 7f9a16c98000-7f9a16c9d000 rw-p 00000000 00:00 0
>
> 7f9a16c9d000-7f9a16cb5000 r-xp 00000000 08:01 9050338
>     /lib/x86_64-linux-gnu/libpthread-2.15.so
>
> 7f9a16cb5000-7f9a16eb4000 ---p 00018000 08:01 9050338
> /lib/x86_64-linux-gnu/libpthread-2.15.so
>
> 7f9a16eb4000-7f9a16eb5000 r--p 00017000 08:01 9050338
> /lib/x86_64-linux-gnu/libpthread-2.15.so
>
> 7f9a16eb5000-7f9a16eb6000 rw-p 00018000 08:01 9050338
> /lib/x86_64-linux-gnu/libpthread-2.15.so
>
> 7f9a16eb6000-7f9a16eba000 rw-p 00000000 00:00 0
>
> 7f9a16eba000-7f9a16edc000 r-xp 00000000 08:01 9050344
> /lib/x86_64-linux-gnu/ld-2.15.so
>
> 7f9a170c1000-7f9a170c4000 rw-p 00000000 00:00 0
>
> 7f9a170d9000-7f9a170dc000 rw-p 00000000 00:00 0
>
> 7f9a170dc000-7f9a170dd000 r--p 00022000 08:01 9050344
> /lib/x86_64-linux-gnu/ld-2.15.so
>
> 7f9a170dd000-7f9a170df000 rw-p 00023000 08:01 9050344
> /lib/x86_64-linux-gnu/ld-2.15.so
>
> 7fff52f27000-7fff52f48000 rw-p 00000000 00:00 0
> [stack]
>
> 7fff52fff000-7fff53000000 r-xp 00000000 00:00 0
> [vdso]
>
> ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0
> [vsyscall]
>

-- 
Pavan Balaji
http://www.mcs.anl.gov/~balaji



More information about the discuss mailing list