[mpich-discuss] Error creating 272 processes on a multicore, single CPU
Gajbe, Manisha
manisha.gajbe at intel.com
Tue Jul 26 11:25:17 CDT 2016
Hi Kenneth,
The configure script is default except the "prefix". I used Intel compilers version 16.0.2.
$ ./configure --prefix=/usr/local/my-mpich-3.2/64bit
Hi Halim,
I removed all the writes to stdout. No change in the output. Also, the code runs with IntelMPI 5.1.3 without any issues with lots of writes to stdout. Do you have any suggestion on setting up the limits of pipe on what should be the reasonable size.
Below is output from my ulimit
[mmgajbe at fm05wcon0025 Test]$ ulimit -a
core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 62912
max locked memory (kbytes, -l) 64
max memory size (kbytes, -m) unlimited
open files (-n) 2048
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) 8192
cpu time (seconds, -t) unlimited
max user processes (-u) 4096
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited
Hi Husen,
I tried on multiple systems with different Oss such as RHEL 7.0, OpenSuse, Ubuntu 12.04 etc. However, the error is observed only with MPICH and not with IntelMPI.
~ Manisha
-----Original Message-----
From: discuss-request at mpich.org [mailto:discuss-request at mpich.org]
Sent: Tuesday, July 26, 2016 9:00 AM
To: discuss at mpich.org
Subject: discuss Digest, Vol 45, Issue 7
Send discuss mailing list submissions to
discuss at mpich.org
To subscribe or unsubscribe via the World Wide Web, visit
https://lists.mpich.org/mailman/listinfo/discuss
or, via email, send a message with subject or body 'help' to
discuss-request at mpich.org
You can reach the person managing the list at
discuss-owner at mpich.org
When replying, please edit your Subject line so it is more specific than "Re: Contents of discuss digest..."
Today's Topics:
1. Re: Error creating 272 processes on a multicore, single CPU
(Husen R)
2. Segfault with MPICH 3.2+Clang but not GCC (Andreas Noack)
3. Re: Segfault with MPICH 3.2+Clang but not GCC (Jeff Hammond)
4. Re: Segfault with MPICH 3.2+Clang but not GCC (Rob Latham)
----------------------------------------------------------------------
Message: 1
Date: Tue, 26 Jul 2016 10:58:01 +0700
From: Husen R <hus3nr at gmail.com>
To: discuss at mpich.org
Subject: Re: [mpich-discuss] Error creating 272 processes on a
multicore, single CPU
Message-ID:
<CACPfdUsN7N5TwU++tsoPiUGA1kiS3Cm-VGBaXotDfkKcquhiag at mail.gmail.com>
Content-Type: text/plain; charset="utf-8"
it seems the number of processes that can be created in one machine limited by Operating System.
On Sat, Jul 23, 2016 at 11:19 AM, Gajbe, Manisha <manisha.gajbe at intel.com>
wrote:
> Hi,
>
>
>
> I have installed mpich-3.2 on a multicore platform. When I spawn 272
> processes , I get the error message mentioned below. I am able to
> create upto 271 processes successfully.
>
>
>
> /usr/local/my-mpich-3.2/64bit/bin/mpirun -n 272 ./hello_c
>
>
>
> [cli_0]: write_line: message string doesn't end in newline: :cmd=put
> kvsname=kvs_8846_0 key=r2h1
> value=r0#0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,103,104,105,106,107,108,109,110,111,112,113,114,115,116,117,118,119,120,121,122,123,124,125,126,127,128,129,130,131,132,133,134,135,136,137,138,139,140,141,142,143,144,145,146,147,148,149,150,151,152,153,154,155,156,157,158,159,160,161,162,163,164,165,166,167,168,169,170,171,172,173,174,175,176,177,178,179,180,181,182,183,184,185,186,187,188,189,190,191,192,193,194,195,196,197,198,199,200,201,202,203,204,205,206,207,208,209,210,211,212,213,214,215,216,217,218,219,220,221,222,223,224,225,226,227,228,229,230,231,232,233,234,235,236,237,238,239,240,241,242,243,244,245,246,247,248,249,250,251,252,253,254,255,256,257,258,259,260,261,262,263,264,265,266,267,268,269,270,271$:
>
>
>
>
>
>
>
> *Manisha Gajbe*
>
> *MVE PQV Content*
>
> SCE ? System Content Engineering
>
>
>
> _______________________________________________
> discuss mailing list discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss
>
--
Post Graduate Student
Faculty of Computer Science
University of Indonesia
Depok
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20160726/a77bbc71/attachment-0001.html>
------------------------------
Message: 2
Date: Tue, 26 Jul 2016 11:17:05 -0400
From: Andreas Noack <andreasnoackjensen at gmail.com>
To: discuss at mpich.org
Subject: [mpich-discuss] Segfault with MPICH 3.2+Clang but not GCC
Message-ID:
<CAFKYB6mgy8bO7Mw05it4w78XDRoh70KxUu1epkZfv34=KqL=yQ at mail.gmail.com>
Content-Type: text/plain; charset="utf-8"
On my El Capitan macbook I get a segfault when running the program below with more than a single process but only when MPICH has been compiled with Clang.
I don't get that good debug info but here is some of what I got
(lldb) c
Process 61129 resuming
Process 61129 stopped
* thread #1: tid = 0x32c438, 0x00000003119d0432 libpmpi.12.dylib`MPID_Request_create + 244, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=EXC_I386_GPFLT)
frame #0: 0x00000003119d0432 libpmpi.12.dylib`MPID_Request_create + 244
libpmpi.12.dylib`MPID_Request_create:
-> 0x3119d0432 <+244>: movaps %xmm0, 0x230(%rax)
0x3119d0439 <+251>: movq $0x0, 0x240(%rax)
0x3119d0444 <+262>: movl %ecx, 0x210(%rax)
0x3119d044a <+268>: popq %rbp
My version of Clang is
Apple LLVM version 7.3.0 (clang-703.0.31)
Target: x86_64-apple-darwin15.6.0
Thread model: posix
InstalledDir: /Library/Developer/CommandLineTools/usr/bin
and the bug has been confirmed by my colleague who is running Linux and compiling with Clang 3.8. The program runs fine with OpenMPI+Clang.
#include <mpi.h>
#include <stdio.h>
#include <stdlib.h>
int main(int argc, char *argv[])
{
MPI_Init(&argc, &argv);
MPI_Comm comm = MPI_COMM_WORLD;
uint64_t *A, *C;
int rnk;
MPI_Comm_rank(comm, &rnk);
A = calloc(1, sizeof(uint64_t));
C = calloc(2, sizeof(uint64_t));
A[0] = rnk + 1;
MPI_Allgather(A, 1, MPI_UINT64_T, C, 1, MPI_UINT64_T, comm);
MPI_Finalize();
return 0;
}
Best regards
Andreas Noack
Postdoctoral Associate
Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20160726/27eff56b/attachment-0001.html>
------------------------------
Message: 3
Date: Tue, 26 Jul 2016 08:56:03 -0700
From: Jeff Hammond <jeff.science at gmail.com>
To: MPICH <discuss at mpich.org>
Subject: Re: [mpich-discuss] Segfault with MPICH 3.2+Clang but not GCC
Message-ID:
<CAGKz=uL+3P=d7W1TP9R9Teafip44aMnnk9c6nzuL_yDF_u+bsw at mail.gmail.com>
Content-Type: text/plain; charset="utf-8"
I cannot reproduce this. I am using Darwin 15.5.0 instead of 15.6.0, but the compiler is identical. I am using MPICH Git master from June 29.
At this point, it is unclear to me if the bug is in MPICH or Clang.
Jeff
vsanthan-mobl1:BUGS jrhammon$ /opt/mpich/dev/clang/default/bin/mpichversion
MPICH Version: 3.2
MPICH Release date: unreleased development copy
MPICH Device: ch3:nemesis
MPICH configure: CC=clang CXX=clang++ FC=false F77=false --enable-cxx --disable-fortran --with-pm=hydra --prefix=/opt/mpich/dev/clang/default
--enable-cxx --enable-wrapper-rpath --disable-static --enable-shared
MPICH CC: clang -O2
MPICH CXX: clang++ -O2
MPICH F77: false
MPICH FC: false
vsanthan-mobl1:BUGS jrhammon$ /opt/mpich/dev/clang/default/bin/mpicc -v
mpicc for MPICH version 3.2
Apple LLVM version 7.3.0 (clang-703.0.31)
Target: x86_64-apple-darwin15.5.0
Thread model: posix
InstalledDir:
/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin
clang: warning: argument unused during compilation: '-I /opt/mpich/dev/clang/default/include'
On Tue, Jul 26, 2016 at 8:17 AM, Andreas Noack <andreasnoackjensen at gmail.com
> wrote:
> On my El Capitan macbook I get a segfault when running the program
> below with more than a single process but only when MPICH has been
> compiled with Clang.
>
> I don't get that good debug info but here is some of what I got
>
> (lldb) c
> Process 61129 resuming
> Process 61129 stopped
> * thread #1: tid = 0x32c438, 0x00000003119d0432
> libpmpi.12.dylib`MPID_Request_create + 244, queue =
> 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=EXC_I386_GPFLT)
> frame #0: 0x00000003119d0432 libpmpi.12.dylib`MPID_Request_create
> + 244
> libpmpi.12.dylib`MPID_Request_create:
> -> 0x3119d0432 <+244>: movaps %xmm0, 0x230(%rax)
> 0x3119d0439 <+251>: movq $0x0, 0x240(%rax)
> 0x3119d0444 <+262>: movl %ecx, 0x210(%rax)
> 0x3119d044a <+268>: popq %rbp
>
> My version of Clang is
>
> Apple LLVM version 7.3.0 (clang-703.0.31)
> Target: x86_64-apple-darwin15.6.0
> Thread model: posix
> InstalledDir: /Library/Developer/CommandLineTools/usr/bin
>
> and the bug has been confirmed by my colleague who is running Linux
> and compiling with Clang 3.8. The program runs fine with OpenMPI+Clang.
>
> #include <mpi.h>
> #include <stdio.h>
> #include <stdlib.h>
>
> int main(int argc, char *argv[])
> {
> MPI_Init(&argc, &argv);
>
> MPI_Comm comm = MPI_COMM_WORLD;
> uint64_t *A, *C;
> int rnk;
>
> MPI_Comm_rank(comm, &rnk);
> A = calloc(1, sizeof(uint64_t));
> C = calloc(2, sizeof(uint64_t));
> A[0] = rnk + 1;
>
> MPI_Allgather(A, 1, MPI_UINT64_T, C, 1, MPI_UINT64_T, comm);
>
> MPI_Finalize();
> return 0;
> }
>
>
> Best regards
>
> Andreas Noack
> Postdoctoral Associate
> Computer Science and Artificial Intelligence Laboratory Massachusetts
> Institute of Technology
>
> _______________________________________________
> discuss mailing list discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss
>
--
Jeff Hammond
jeff.science at gmail.com
http://jeffhammond.github.io/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20160726/a92cfc56/attachment-0001.html>
------------------------------
Message: 4
Date: Tue, 26 Jul 2016 11:00:00 -0500
From: Rob Latham <robl at mcs.anl.gov>
To: <discuss at mpich.org>
Subject: Re: [mpich-discuss] Segfault with MPICH 3.2+Clang but not GCC
Message-ID: <57978900.8030207 at mcs.anl.gov>
Content-Type: text/plain; charset="windows-1252"; format=flowed
On 07/26/2016 10:17 AM, Andreas Noack wrote:
> On my El Capitan macbook I get a segfault when running the program
> below with more than a single process but only when MPICH has been
> compiled with Clang.
>
> I don't get that good debug info but here is some of what I got
valgrind is pretty good at sussing out these sorts of things:
==18132== Unaddressable byte(s) found during client check request
==18132== at 0x504D1D7: MPIR_Localcopy (helper_fns.c:84)
==18132== by 0x4EC8EA1: MPIR_Allgather_intra (allgather.c:169)
==18132== by 0x4ECA5EC: MPIR_Allgather (allgather.c:791)
==18132== by 0x4ECA7A4: MPIR_Allgather_impl (allgather.c:832)
==18132== by 0x4EC8B5C: MPID_Allgather (mpid_coll.h:61)
==18132== by 0x4ECB9F7: PMPI_Allgather (allgather.c:978)
==18132== by 0x4008F5: main (noack_segv.c:18)
==18132== Address 0x6f2f138 is 8 bytes after a block of size 16 alloc'd
==18132== at 0x4C2FB55: calloc (in
/usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==18132== by 0x4008B0: main (noack_segv.c:15)
==18132==
==18132== Invalid write of size 8
==18132== at 0x4C326CB: memcpy@@GLIBC_2.14 (in
/usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==18132== by 0x504D31B: MPIR_Localcopy (helper_fns.c:84)
==18132== by 0x4EC8EA1: MPIR_Allgather_intra (allgather.c:169)
==18132== by 0x4ECA5EC: MPIR_Allgather (allgather.c:791)
==18132== by 0x4ECA7A4: MPIR_Allgather_impl (allgather.c:832)
==18132== by 0x4EC8B5C: MPID_Allgather (mpid_coll.h:61)
==18132== by 0x4ECB9F7: PMPI_Allgather (allgather.c:978)
==18132== by 0x4008F5: main (noack_segv.c:18)
==18132== Address 0x6f2f138 is 8 bytes after a block of size 16 alloc'd
==18132== at 0x4C2FB55: calloc (in
/usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==18132== by 0x4008B0: main (noack_segv.c:15)
>
> MPI_Comm_rank(comm, &rnk);
> A = calloc(1, sizeof(uint64_t));
> C = calloc(2, sizeof(uint64_t));
> A[0] = rnk + 1;
>
> MPI_Allgather(A, 1, MPI_UINT64_T, C, 1, MPI_UINT64_T, comm);
Your 'buf count tuple' is ok for A: every process sends one uint64
your 'buf count tuple' is too small for C if there are any more than 2 proceses .
When you say "more than one"... do you mean 2?
==rob
------------------------------
_______________________________________________
discuss mailing list
discuss at mpich.org
https://lists.mpich.org/mailman/listinfo/discuss
End of discuss Digest, Vol 45, Issue 7
**************************************
_______________________________________________
discuss mailing list discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss
More information about the discuss
mailing list