[mpich-discuss] Error creating 272 processes on a multicore, single CPU
Kenneth Raffenetti
raffenet at mcs.anl.gov
Thu Jul 28 10:51:51 CDT 2016
Can you also send the output of:
/usr/local/my-mpich-3.2/64bit/bin/mpichversion
/usr/local/my-mpich-3.2/64bit/bin/mpiexec -info
Something in your error output doesn't look right to me. I'm having a
hard time finding the code that would execute a PMI command that causes
the string buffer to overflow and omit a newline.
Ken
On 07/26/2016 11:25 AM, Gajbe, Manisha wrote:
> Hi Kenneth,
>
> The configure script is default except the "prefix". I used Intel compilers version 16.0.2.
>
> $ ./configure --prefix=/usr/local/my-mpich-3.2/64bit
>
> Hi Halim,
>
> I removed all the writes to stdout. No change in the output. Also, the code runs with IntelMPI 5.1.3 without any issues with lots of writes to stdout. Do you have any suggestion on setting up the limits of pipe on what should be the reasonable size.
>
> Below is output from my ulimit
>
> [mmgajbe at fm05wcon0025 Test]$ ulimit -a
> core file size (blocks, -c) 0
> data seg size (kbytes, -d) unlimited
> scheduling priority (-e) 0
> file size (blocks, -f) unlimited
> pending signals (-i) 62912
> max locked memory (kbytes, -l) 64
> max memory size (kbytes, -m) unlimited
> open files (-n) 2048
> pipe size (512 bytes, -p) 8
> POSIX message queues (bytes, -q) 819200
> real-time priority (-r) 0
> stack size (kbytes, -s) 8192
> cpu time (seconds, -t) unlimited
> max user processes (-u) 4096
> virtual memory (kbytes, -v) unlimited
> file locks (-x) unlimited
>
>
> Hi Husen,
>
> I tried on multiple systems with different Oss such as RHEL 7.0, OpenSuse, Ubuntu 12.04 etc. However, the error is observed only with MPICH and not with IntelMPI.
>
>
> ~ Manisha
> -----Original Message-----
> From: discuss-request at mpich.org [mailto:discuss-request at mpich.org]
> Sent: Tuesday, July 26, 2016 9:00 AM
> To: discuss at mpich.org
> Subject: discuss Digest, Vol 45, Issue 7
>
> Send discuss mailing list submissions to
> discuss at mpich.org
>
> To subscribe or unsubscribe via the World Wide Web, visit
> https://lists.mpich.org/mailman/listinfo/discuss
> or, via email, send a message with subject or body 'help' to
> discuss-request at mpich.org
>
> You can reach the person managing the list at
> discuss-owner at mpich.org
>
> When replying, please edit your Subject line so it is more specific than "Re: Contents of discuss digest..."
>
>
> Today's Topics:
>
> 1. Re: Error creating 272 processes on a multicore, single CPU
> (Husen R)
> 2. Segfault with MPICH 3.2+Clang but not GCC (Andreas Noack)
> 3. Re: Segfault with MPICH 3.2+Clang but not GCC (Jeff Hammond)
> 4. Re: Segfault with MPICH 3.2+Clang but not GCC (Rob Latham)
>
>
> ----------------------------------------------------------------------
>
> Message: 1
> Date: Tue, 26 Jul 2016 10:58:01 +0700
> From: Husen R <hus3nr at gmail.com>
> To: discuss at mpich.org
> Subject: Re: [mpich-discuss] Error creating 272 processes on a
> multicore, single CPU
> Message-ID:
> <CACPfdUsN7N5TwU++tsoPiUGA1kiS3Cm-VGBaXotDfkKcquhiag at mail.gmail.com>
> Content-Type: text/plain; charset="utf-8"
>
> it seems the number of processes that can be created in one machine limited by Operating System.
>
> On Sat, Jul 23, 2016 at 11:19 AM, Gajbe, Manisha <manisha.gajbe at intel.com>
> wrote:
>
>> Hi,
>>
>>
>>
>> I have installed mpich-3.2 on a multicore platform. When I spawn 272
>> processes , I get the error message mentioned below. I am able to
>> create upto 271 processes successfully.
>>
>>
>>
>> /usr/local/my-mpich-3.2/64bit/bin/mpirun -n 272 ./hello_c
>>
>>
>>
>> [cli_0]: write_line: message string doesn't end in newline: :cmd=put
>> kvsname=kvs_8846_0 key=r2h1
>> value=r0#0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,103,104,105,106,107,108,109,110,111,112,113,114,115,116,117,118,119,120,121,122,123,124,125,126,127,128,129,130,131,132,133,134,135,136,137,138,139,140,141,142,143,144,145,146,147,148,149,150,151,152,153,154,155,156,157,158,159,160,161,162,163,164,165,166,167,168,169,170,171,172,173,174,175,176,177,178,179,180,181,182,183,184,185,186,187,188,189,190,191,192,193,194,195,196,197,198,199,200,201,202,203,204,205,206,207,208,209,210,211,212,213,214,215,216,217,218,219,220,221,222,223,224,225,226,227,228,229,230,231,232,233,234,235,236,237,238,239,240,241,242,243,244,245,246,247,248,249,250,251,252,253,254,255,256,257,258,259,260,261,262,263,264,265,266,267,268,269,270,271$
:
>>
>>
>>
>>
>>
>>
>>
>> *Manisha Gajbe*
>>
>> *MVE PQV Content*
>>
>> SCE ? System Content Engineering
>>
>>
>>
>> _______________________________________________
>> discuss mailing list discuss at mpich.org
>> To manage subscription options or unsubscribe:
>> https://lists.mpich.org/mailman/listinfo/discuss
>>
>
>
>
> --
> Post Graduate Student
> Faculty of Computer Science
> University of Indonesia
> Depok
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL: <http://lists.mpich.org/pipermail/discuss/attachments/20160726/a77bbc71/attachment-0001.html>
>
> ------------------------------
>
> Message: 2
> Date: Tue, 26 Jul 2016 11:17:05 -0400
> From: Andreas Noack <andreasnoackjensen at gmail.com>
> To: discuss at mpich.org
> Subject: [mpich-discuss] Segfault with MPICH 3.2+Clang but not GCC
> Message-ID:
> <CAFKYB6mgy8bO7Mw05it4w78XDRoh70KxUu1epkZfv34=KqL=yQ at mail.gmail.com>
> Content-Type: text/plain; charset="utf-8"
>
> On my El Capitan macbook I get a segfault when running the program below with more than a single process but only when MPICH has been compiled with Clang.
>
> I don't get that good debug info but here is some of what I got
>
> (lldb) c
> Process 61129 resuming
> Process 61129 stopped
> * thread #1: tid = 0x32c438, 0x00000003119d0432 libpmpi.12.dylib`MPID_Request_create + 244, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=EXC_I386_GPFLT)
> frame #0: 0x00000003119d0432 libpmpi.12.dylib`MPID_Request_create + 244
> libpmpi.12.dylib`MPID_Request_create:
> -> 0x3119d0432 <+244>: movaps %xmm0, 0x230(%rax)
> 0x3119d0439 <+251>: movq $0x0, 0x240(%rax)
> 0x3119d0444 <+262>: movl %ecx, 0x210(%rax)
> 0x3119d044a <+268>: popq %rbp
>
> My version of Clang is
>
> Apple LLVM version 7.3.0 (clang-703.0.31)
> Target: x86_64-apple-darwin15.6.0
> Thread model: posix
> InstalledDir: /Library/Developer/CommandLineTools/usr/bin
>
> and the bug has been confirmed by my colleague who is running Linux and compiling with Clang 3.8. The program runs fine with OpenMPI+Clang.
>
> #include <mpi.h>
> #include <stdio.h>
> #include <stdlib.h>
>
> int main(int argc, char *argv[])
> {
> MPI_Init(&argc, &argv);
>
> MPI_Comm comm = MPI_COMM_WORLD;
> uint64_t *A, *C;
> int rnk;
>
> MPI_Comm_rank(comm, &rnk);
> A = calloc(1, sizeof(uint64_t));
> C = calloc(2, sizeof(uint64_t));
> A[0] = rnk + 1;
>
> MPI_Allgather(A, 1, MPI_UINT64_T, C, 1, MPI_UINT64_T, comm);
>
> MPI_Finalize();
> return 0;
> }
>
>
> Best regards
>
> Andreas Noack
> Postdoctoral Associate
> Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL: <http://lists.mpich.org/pipermail/discuss/attachments/20160726/27eff56b/attachment-0001.html>
>
> ------------------------------
>
> Message: 3
> Date: Tue, 26 Jul 2016 08:56:03 -0700
> From: Jeff Hammond <jeff.science at gmail.com>
> To: MPICH <discuss at mpich.org>
> Subject: Re: [mpich-discuss] Segfault with MPICH 3.2+Clang but not GCC
> Message-ID:
> <CAGKz=uL+3P=d7W1TP9R9Teafip44aMnnk9c6nzuL_yDF_u+bsw at mail.gmail.com>
> Content-Type: text/plain; charset="utf-8"
>
> I cannot reproduce this. I am using Darwin 15.5.0 instead of 15.6.0, but the compiler is identical. I am using MPICH Git master from June 29.
>
> At this point, it is unclear to me if the bug is in MPICH or Clang.
>
> Jeff
>
> vsanthan-mobl1:BUGS jrhammon$ /opt/mpich/dev/clang/default/bin/mpichversion
>
> MPICH Version: 3.2
>
> MPICH Release date: unreleased development copy
>
> MPICH Device: ch3:nemesis
>
> MPICH configure: CC=clang CXX=clang++ FC=false F77=false --enable-cxx --disable-fortran --with-pm=hydra --prefix=/opt/mpich/dev/clang/default
> --enable-cxx --enable-wrapper-rpath --disable-static --enable-shared
>
> MPICH CC: clang -O2
>
> MPICH CXX: clang++ -O2
>
> MPICH F77: false
>
> MPICH FC: false
>
> vsanthan-mobl1:BUGS jrhammon$ /opt/mpich/dev/clang/default/bin/mpicc -v
>
> mpicc for MPICH version 3.2
>
> Apple LLVM version 7.3.0 (clang-703.0.31)
>
> Target: x86_64-apple-darwin15.5.0
>
> Thread model: posix
>
> InstalledDir:
> /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin
>
> clang: warning: argument unused during compilation: '-I /opt/mpich/dev/clang/default/include'
>
> On Tue, Jul 26, 2016 at 8:17 AM, Andreas Noack <andreasnoackjensen at gmail.com
>> wrote:
>
>> On my El Capitan macbook I get a segfault when running the program
>> below with more than a single process but only when MPICH has been
>> compiled with Clang.
>>
>> I don't get that good debug info but here is some of what I got
>>
>> (lldb) c
>> Process 61129 resuming
>> Process 61129 stopped
>> * thread #1: tid = 0x32c438, 0x00000003119d0432
>> libpmpi.12.dylib`MPID_Request_create + 244, queue =
>> 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=EXC_I386_GPFLT)
>> frame #0: 0x00000003119d0432 libpmpi.12.dylib`MPID_Request_create
>> + 244
>> libpmpi.12.dylib`MPID_Request_create:
>> -> 0x3119d0432 <+244>: movaps %xmm0, 0x230(%rax)
>> 0x3119d0439 <+251>: movq $0x0, 0x240(%rax)
>> 0x3119d0444 <+262>: movl %ecx, 0x210(%rax)
>> 0x3119d044a <+268>: popq %rbp
>>
>> My version of Clang is
>>
>> Apple LLVM version 7.3.0 (clang-703.0.31)
>> Target: x86_64-apple-darwin15.6.0
>> Thread model: posix
>> InstalledDir: /Library/Developer/CommandLineTools/usr/bin
>>
>> and the bug has been confirmed by my colleague who is running Linux
>> and compiling with Clang 3.8. The program runs fine with OpenMPI+Clang.
>>
>> #include <mpi.h>
>> #include <stdio.h>
>> #include <stdlib.h>
>>
>> int main(int argc, char *argv[])
>> {
>> MPI_Init(&argc, &argv);
>>
>> MPI_Comm comm = MPI_COMM_WORLD;
>> uint64_t *A, *C;
>> int rnk;
>>
>> MPI_Comm_rank(comm, &rnk);
>> A = calloc(1, sizeof(uint64_t));
>> C = calloc(2, sizeof(uint64_t));
>> A[0] = rnk + 1;
>>
>> MPI_Allgather(A, 1, MPI_UINT64_T, C, 1, MPI_UINT64_T, comm);
>>
>> MPI_Finalize();
>> return 0;
>> }
>>
>>
>> Best regards
>>
>> Andreas Noack
>> Postdoctoral Associate
>> Computer Science and Artificial Intelligence Laboratory Massachusetts
>> Institute of Technology
>>
>> _______________________________________________
>> discuss mailing list discuss at mpich.org
>> To manage subscription options or unsubscribe:
>> https://lists.mpich.org/mailman/listinfo/discuss
>>
>
>
>
> --
> Jeff Hammond
> jeff.science at gmail.com
> http://jeffhammond.github.io/
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL: <http://lists.mpich.org/pipermail/discuss/attachments/20160726/a92cfc56/attachment-0001.html>
>
> ------------------------------
>
> Message: 4
> Date: Tue, 26 Jul 2016 11:00:00 -0500
> From: Rob Latham <robl at mcs.anl.gov>
> To: <discuss at mpich.org>
> Subject: Re: [mpich-discuss] Segfault with MPICH 3.2+Clang but not GCC
> Message-ID: <57978900.8030207 at mcs.anl.gov>
> Content-Type: text/plain; charset="windows-1252"; format=flowed
>
>
>
> On 07/26/2016 10:17 AM, Andreas Noack wrote:
>> On my El Capitan macbook I get a segfault when running the program
>> below with more than a single process but only when MPICH has been
>> compiled with Clang.
>>
>> I don't get that good debug info but here is some of what I got
>
>
> valgrind is pretty good at sussing out these sorts of things:
>
> ==18132== Unaddressable byte(s) found during client check request
> ==18132== at 0x504D1D7: MPIR_Localcopy (helper_fns.c:84)
> ==18132== by 0x4EC8EA1: MPIR_Allgather_intra (allgather.c:169)
> ==18132== by 0x4ECA5EC: MPIR_Allgather (allgather.c:791)
> ==18132== by 0x4ECA7A4: MPIR_Allgather_impl (allgather.c:832)
> ==18132== by 0x4EC8B5C: MPID_Allgather (mpid_coll.h:61)
> ==18132== by 0x4ECB9F7: PMPI_Allgather (allgather.c:978)
> ==18132== by 0x4008F5: main (noack_segv.c:18)
> ==18132== Address 0x6f2f138 is 8 bytes after a block of size 16 alloc'd
> ==18132== at 0x4C2FB55: calloc (in
> /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
> ==18132== by 0x4008B0: main (noack_segv.c:15)
> ==18132==
> ==18132== Invalid write of size 8
> ==18132== at 0x4C326CB: memcpy@@GLIBC_2.14 (in
> /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
> ==18132== by 0x504D31B: MPIR_Localcopy (helper_fns.c:84)
> ==18132== by 0x4EC8EA1: MPIR_Allgather_intra (allgather.c:169)
> ==18132== by 0x4ECA5EC: MPIR_Allgather (allgather.c:791)
> ==18132== by 0x4ECA7A4: MPIR_Allgather_impl (allgather.c:832)
> ==18132== by 0x4EC8B5C: MPID_Allgather (mpid_coll.h:61)
> ==18132== by 0x4ECB9F7: PMPI_Allgather (allgather.c:978)
> ==18132== by 0x4008F5: main (noack_segv.c:18)
> ==18132== Address 0x6f2f138 is 8 bytes after a block of size 16 alloc'd
> ==18132== at 0x4C2FB55: calloc (in
> /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
> ==18132== by 0x4008B0: main (noack_segv.c:15)
>
>
>>
>> MPI_Comm_rank(comm, &rnk);
>> A = calloc(1, sizeof(uint64_t));
>> C = calloc(2, sizeof(uint64_t));
>> A[0] = rnk + 1;
>>
>> MPI_Allgather(A, 1, MPI_UINT64_T, C, 1, MPI_UINT64_T, comm);
>
> Your 'buf count tuple' is ok for A: every process sends one uint64
>
> your 'buf count tuple' is too small for C if there are any more than 2 proceses .
>
> When you say "more than one"... do you mean 2?
>
> ==rob
>
>
> ------------------------------
>
> _______________________________________________
> discuss mailing list
> discuss at mpich.org
> https://lists.mpich.org/mailman/listinfo/discuss
>
> End of discuss Digest, Vol 45, Issue 7
> **************************************
> _______________________________________________
> discuss mailing list discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss
>
_______________________________________________
discuss mailing list discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss
More information about the discuss
mailing list