[mpich-discuss] Problems running MPICH jobs under SLURM
Biddiscombe, John A.
biddisco at cscs.ch
Fri Jun 7 03:40:26 CDT 2013
I downloaded the nightly tarball and recompiled/installed mpich (used mpich-master-v3.0.4-259-gf322ce79)
I still get this (output below) with a simple hello world program.
Now you must understand that I have no idea what I'm doing (really). I wanted to test some debugging features under slurm so installed slurm myself on a workstation with just 2 cores and have the bare minimum setup. I'm doing the following
sudo munged &
sudo slurmd &
sudo slurmctld -D
and then I can run jobs on the local machine and it seems to be ok, except that mpi jobs always give the double free error as below when run under slurm, but are just fine when run from the command line.
My suspicion is that slurm is not actually using the hydra pm that I just compiled. I installed slurm from rpms. Should I recompile slurm myself and somehow tell it which mpi to use?
My job script looks as follows
######################
#!/bin/bash
#
# Create the job script from the supplied parameters
#
#SBATCH --job-name=pvserver
#SBATCH --time=00:04:00
#SBATCH --nodes=1
#SBATCH --partition=normal
#SBATCH --output=/home/biddisco/slurm.out
#SBATCH --error=/home/biddisco/slurm.err
#SBATCH --mem=2048MB
#export
# echo "Path is $PATH"
# echo "LD_LIBRARY_PATH is " $LD_LIBRARY_PATH
# cd /home/biddisco/build/pv-38/bin/
#export PMI_DEBUG=9
#ulimit -s unlimited
#ulimit -c 0
/home/biddisco/apps/mpich-3.0.4/bin/mpiexec -rmk slurm -n 2 /home/biddisco/build/hello/hello
######################
It gives the same result with or without the -rmk slurm and the #ulimit settings.
Apologies for wasting your time, I'm certain I'm doing something wrong - I just don't know what.
JB
biddisco at breno2 ~ $ more ~/slurm.err
*** glibc detected *** /home/biddisco/build/hello/hello: double free or corruption (fasttop): 0x0000000001896340 ***
======= Backtrace: =========
/lib/x86_64-linux-gnu/libc.so.6(+0x7eb96)[0x7f9a1695cb96]
/home/biddisco/build/hello/hello(MPIDI_Populate_vc_node_ids+0x3f9)[0x427c89]
/home/biddisco/build/hello/hello(MPID_Init+0x136)[0x4253f6]
/home/biddisco/build/hello/hello(MPIR_Init_thread+0x22f)[0x414cbf]
/home/biddisco/build/hello/hello(MPI_Init+0xae)[0x4146ee]
/home/biddisco/build/hello/hello(main+0x22)[0x413f2e]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xed)[0x7f9a168ff76d]
/home/biddisco/build/hello/hello[0x413e31]
======= Memory map: ========
00400000-0051a000 r-xp 00000000 08:01 8661191 /home/biddisco/build/hello/hello
0071a000-00727000 r--p 0011a000 08:01 8661191 /home/biddisco/build/hello/hello
00727000-00729000 rw-p 00127000 08:01 8661191 /home/biddisco/build/hello/hello
00729000-00751000 rw-p 00000000 00:00 0
01895000-018b6000 rw-p 00000000 00:00 0 [heap]
7f9a166c8000-7f9a166dd000 r-xp 00000000 08:01 9047556 /lib/x86_64-linux-gnu/libgcc_s.so.1
7f9a166dd000-7f9a168dc000 ---p 00015000 08:01 9047556 /lib/x86_64-linux-gnu/libgcc_s.so.1
7f9a168dc000-7f9a168dd000 r--p 00014000 08:01 9047556 /lib/x86_64-linux-gnu/libgcc_s.so.1
7f9a168dd000-7f9a168de000 rw-p 00015000 08:01 9047556 /lib/x86_64-linux-gnu/libgcc_s.so.1
7f9a168de000-7f9a16a93000 r-xp 00000000 08:01 9050358 /lib/x86_64-linux-gnu/libc-2.15.so
7f9a16a93000-7f9a16c92000 ---p 001b5000 08:01 9050358 /lib/x86_64-linux-gnu/libc-2.15.so
7f9a16c92000-7f9a16c96000 r--p 001b4000 08:01 9050358 /lib/x86_64-linux-gnu/libc-2.15.so
7f9a16c96000-7f9a16c98000 rw-p 001b8000 08:01 9050358 /lib/x86_64-linux-gnu/libc-2.15.so
7f9a16c98000-7f9a16c9d000 rw-p 00000000 00:00 0
7f9a16c9d000-7f9a16cb5000 r-xp 00000000 08:01 9050338 /lib/x86_64-linux-gnu/libpthread-2.15.so
7f9a16cb5000-7f9a16eb4000 ---p 00018000 08:01 9050338 /lib/x86_64-linux-gnu/libpthread-2.15.so
7f9a16eb4000-7f9a16eb5000 r--p 00017000 08:01 9050338 /lib/x86_64-linux-gnu/libpthread-2.15.so
7f9a16eb5000-7f9a16eb6000 rw-p 00018000 08:01 9050338 /lib/x86_64-linux-gnu/libpthread-2.15.so
7f9a16eb6000-7f9a16eba000 rw-p 00000000 00:00 0
7f9a16eba000-7f9a16edc000 r-xp 00000000 08:01 9050344 /lib/x86_64-linux-gnu/ld-2.15.so
7f9a170c1000-7f9a170c4000 rw-p 00000000 00:00 0
7f9a170d9000-7f9a170dc000 rw-p 00000000 00:00 0
7f9a170dc000-7f9a170dd000 r--p 00022000 08:01 9050344 /lib/x86_64-linux-gnu/ld-2.15.so
7f9a170dd000-7f9a170df000 rw-p 00023000 08:01 9050344 /lib/x86_64-linux-gnu/ld-2.15.so
7fff52f27000-7fff52f48000 rw-p 00000000 00:00 0 [stack]
7fff52fff000-7fff53000000 r-xp 00000000 00:00 0 [vdso]
ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0 [vsyscall]
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20130607/b298795d/attachment.html>
More information about the discuss
mailing list