[mpich-discuss] Error creating 272 processes on a multicore, single CPU

Gajbe, Manisha manisha.gajbe at intel.com
Tue Jul 26 11:25:17 CDT 2016


Hi Kenneth,

The configure script is default except the "prefix". I used Intel compilers version 16.0.2. 

$ ./configure --prefix=/usr/local/my-mpich-3.2/64bit

Hi Halim,

I removed all the writes to stdout. No change in the output. Also, the code runs with IntelMPI 5.1.3 without any issues with lots of writes to stdout. Do you have any suggestion on setting up the limits of pipe on what should be the reasonable size.

Below is output from my ulimit

[mmgajbe at fm05wcon0025 Test]$ ulimit -a
core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 62912
max locked memory       (kbytes, -l) 64
max memory size         (kbytes, -m) unlimited
open files                      (-n) 2048
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 8192
cpu time               (seconds, -t) unlimited
max user processes              (-u) 4096
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited


Hi Husen,

I tried on multiple systems with different Oss such as RHEL 7.0, OpenSuse, Ubuntu 12.04 etc. However, the error is observed only with MPICH and not with IntelMPI.


~ Manisha
-----Original Message-----
From: discuss-request at mpich.org [mailto:discuss-request at mpich.org] 
Sent: Tuesday, July 26, 2016 9:00 AM
To: discuss at mpich.org
Subject: discuss Digest, Vol 45, Issue 7

Send discuss mailing list submissions to
	discuss at mpich.org

To subscribe or unsubscribe via the World Wide Web, visit
	https://lists.mpich.org/mailman/listinfo/discuss
or, via email, send a message with subject or body 'help' to
	discuss-request at mpich.org

You can reach the person managing the list at
	discuss-owner at mpich.org

When replying, please edit your Subject line so it is more specific than "Re: Contents of discuss digest..."


Today's Topics:

   1. Re:  Error creating 272 processes on a multicore, single CPU
      (Husen R)
   2.  Segfault with MPICH 3.2+Clang but not GCC (Andreas Noack)
   3. Re:  Segfault with MPICH 3.2+Clang but not GCC (Jeff Hammond)
   4. Re:  Segfault with MPICH 3.2+Clang but not GCC (Rob Latham)


----------------------------------------------------------------------

Message: 1
Date: Tue, 26 Jul 2016 10:58:01 +0700
From: Husen R <hus3nr at gmail.com>
To: discuss at mpich.org
Subject: Re: [mpich-discuss] Error creating 272 processes on a
	multicore, single CPU
Message-ID:
	<CACPfdUsN7N5TwU++tsoPiUGA1kiS3Cm-VGBaXotDfkKcquhiag at mail.gmail.com>
Content-Type: text/plain; charset="utf-8"

it seems the number of processes that can be created in one machine limited by Operating System.

On Sat, Jul 23, 2016 at 11:19 AM, Gajbe, Manisha <manisha.gajbe at intel.com>
wrote:

> Hi,
>
>
>
> I have installed mpich-3.2 on a multicore platform. When I spawn 272 
> processes , I get the error message mentioned below. I am able to 
> create upto 271 processes successfully.
>
>
>
> /usr/local/my-mpich-3.2/64bit/bin/mpirun -n 272 ./hello_c
>
>
>
> [cli_0]: write_line: message string doesn't end in newline: :cmd=put
> kvsname=kvs_8846_0 key=r2h1
> value=r0#0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,103,104,105,106,107,108,109,110,111,112,113,114,115,116,117,118,119,120,121,122,123,124,125,126,127,128,129,130,131,132,133,134,135,136,137,138,139,140,141,142,143,144,145,146,147,148,149,150,151,152,153,154,155,156,157,158,159,160,161,162,163,164,165,166,167,168,169,170,171,172,173,174,175,176,177,178,179,180,181,182,183,184,185,186,187,188,189,190,191,192,193,194,195,196,197,198,199,200,201,202,203,204,205,206,207,208,209,210,211,212,213,214,215,216,217,218,219,220,221,222,223,224,225,226,227,228,229,230,231,232,233,234,235,236,237,238,239,240,241,242,243,244,245,246,247,248,249,250,251,252,253,254,255,256,257,258,259,260,261,262,263,264,265,266,267,268,269,270,271$:
>
>
>
>
>
>
>
> *Manisha Gajbe*
>
> *MVE PQV Content*
>
> SCE ? System Content Engineering
>
>
>
> _______________________________________________
> discuss mailing list     discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss
>



--
Post Graduate Student
Faculty of Computer Science
University of Indonesia
Depok
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20160726/a77bbc71/attachment-0001.html>

------------------------------

Message: 2
Date: Tue, 26 Jul 2016 11:17:05 -0400
From: Andreas Noack <andreasnoackjensen at gmail.com>
To: discuss at mpich.org
Subject: [mpich-discuss] Segfault with MPICH 3.2+Clang but not GCC
Message-ID:
	<CAFKYB6mgy8bO7Mw05it4w78XDRoh70KxUu1epkZfv34=KqL=yQ at mail.gmail.com>
Content-Type: text/plain; charset="utf-8"

On my El Capitan macbook I get a segfault when running the program below with more than a single process but only when MPICH has been compiled with Clang.

I don't get that good debug info but here is some of what I got

(lldb) c
Process 61129 resuming
Process 61129 stopped
* thread #1: tid = 0x32c438, 0x00000003119d0432 libpmpi.12.dylib`MPID_Request_create + 244, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=EXC_I386_GPFLT)
    frame #0: 0x00000003119d0432 libpmpi.12.dylib`MPID_Request_create + 244
libpmpi.12.dylib`MPID_Request_create:
->  0x3119d0432 <+244>: movaps %xmm0, 0x230(%rax)
    0x3119d0439 <+251>: movq   $0x0, 0x240(%rax)
    0x3119d0444 <+262>: movl   %ecx, 0x210(%rax)
    0x3119d044a <+268>: popq   %rbp

My version of Clang is

Apple LLVM version 7.3.0 (clang-703.0.31)
Target: x86_64-apple-darwin15.6.0
Thread model: posix
InstalledDir: /Library/Developer/CommandLineTools/usr/bin

and the bug has been confirmed by my colleague who is running Linux and compiling with Clang 3.8. The program runs fine with OpenMPI+Clang.

#include <mpi.h>
#include <stdio.h>
#include <stdlib.h>

int main(int argc, char *argv[])
{
    MPI_Init(&argc, &argv);

    MPI_Comm comm = MPI_COMM_WORLD;
    uint64_t *A, *C;
    int rnk;

    MPI_Comm_rank(comm, &rnk);
    A = calloc(1, sizeof(uint64_t));
    C = calloc(2, sizeof(uint64_t));
    A[0] = rnk + 1;

    MPI_Allgather(A, 1, MPI_UINT64_T, C, 1, MPI_UINT64_T, comm);

    MPI_Finalize();
    return 0;
}


Best regards

Andreas Noack
Postdoctoral Associate
Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20160726/27eff56b/attachment-0001.html>

------------------------------

Message: 3
Date: Tue, 26 Jul 2016 08:56:03 -0700
From: Jeff Hammond <jeff.science at gmail.com>
To: MPICH <discuss at mpich.org>
Subject: Re: [mpich-discuss] Segfault with MPICH 3.2+Clang but not GCC
Message-ID:
	<CAGKz=uL+3P=d7W1TP9R9Teafip44aMnnk9c6nzuL_yDF_u+bsw at mail.gmail.com>
Content-Type: text/plain; charset="utf-8"

I cannot reproduce this.  I am using Darwin 15.5.0 instead of 15.6.0, but the compiler is identical.  I am using MPICH Git master from June 29.

At this point, it is unclear to me if the bug is in MPICH or Clang.

Jeff

vsanthan-mobl1:BUGS jrhammon$ /opt/mpich/dev/clang/default/bin/mpichversion

MPICH Version:    3.2

MPICH Release date: unreleased development copy

MPICH Device:    ch3:nemesis

MPICH configure: CC=clang CXX=clang++ FC=false F77=false --enable-cxx --disable-fortran --with-pm=hydra --prefix=/opt/mpich/dev/clang/default
--enable-cxx --enable-wrapper-rpath --disable-static --enable-shared

MPICH CC: clang    -O2

MPICH CXX: clang++   -O2

MPICH F77: false

MPICH FC: false

vsanthan-mobl1:BUGS jrhammon$ /opt/mpich/dev/clang/default/bin/mpicc -v

mpicc for MPICH version 3.2

Apple LLVM version 7.3.0 (clang-703.0.31)

Target: x86_64-apple-darwin15.5.0

Thread model: posix

InstalledDir:
/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin

clang: warning: argument unused during compilation: '-I /opt/mpich/dev/clang/default/include'

On Tue, Jul 26, 2016 at 8:17 AM, Andreas Noack <andreasnoackjensen at gmail.com
> wrote:

> On my El Capitan macbook I get a segfault when running the program 
> below with more than a single process but only when MPICH has been 
> compiled with Clang.
>
> I don't get that good debug info but here is some of what I got
>
> (lldb) c
> Process 61129 resuming
> Process 61129 stopped
> * thread #1: tid = 0x32c438, 0x00000003119d0432 
> libpmpi.12.dylib`MPID_Request_create + 244, queue = 
> 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=EXC_I386_GPFLT)
>     frame #0: 0x00000003119d0432 libpmpi.12.dylib`MPID_Request_create 
> + 244
> libpmpi.12.dylib`MPID_Request_create:
> ->  0x3119d0432 <+244>: movaps %xmm0, 0x230(%rax)
>     0x3119d0439 <+251>: movq   $0x0, 0x240(%rax)
>     0x3119d0444 <+262>: movl   %ecx, 0x210(%rax)
>     0x3119d044a <+268>: popq   %rbp
>
> My version of Clang is
>
> Apple LLVM version 7.3.0 (clang-703.0.31)
> Target: x86_64-apple-darwin15.6.0
> Thread model: posix
> InstalledDir: /Library/Developer/CommandLineTools/usr/bin
>
> and the bug has been confirmed by my colleague who is running Linux 
> and compiling with Clang 3.8. The program runs fine with OpenMPI+Clang.
>
> #include <mpi.h>
> #include <stdio.h>
> #include <stdlib.h>
>
> int main(int argc, char *argv[])
> {
>     MPI_Init(&argc, &argv);
>
>     MPI_Comm comm = MPI_COMM_WORLD;
>     uint64_t *A, *C;
>     int rnk;
>
>     MPI_Comm_rank(comm, &rnk);
>     A = calloc(1, sizeof(uint64_t));
>     C = calloc(2, sizeof(uint64_t));
>     A[0] = rnk + 1;
>
>     MPI_Allgather(A, 1, MPI_UINT64_T, C, 1, MPI_UINT64_T, comm);
>
>     MPI_Finalize();
>     return 0;
> }
>
>
> Best regards
>
> Andreas Noack
> Postdoctoral Associate
> Computer Science and Artificial Intelligence Laboratory Massachusetts 
> Institute of Technology
>
> _______________________________________________
> discuss mailing list     discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss
>



--
Jeff Hammond
jeff.science at gmail.com
http://jeffhammond.github.io/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20160726/a92cfc56/attachment-0001.html>

------------------------------

Message: 4
Date: Tue, 26 Jul 2016 11:00:00 -0500
From: Rob Latham <robl at mcs.anl.gov>
To: <discuss at mpich.org>
Subject: Re: [mpich-discuss] Segfault with MPICH 3.2+Clang but not GCC
Message-ID: <57978900.8030207 at mcs.anl.gov>
Content-Type: text/plain; charset="windows-1252"; format=flowed



On 07/26/2016 10:17 AM, Andreas Noack wrote:
> On my El Capitan macbook I get a segfault when running the program 
> below with more than a single process but only when MPICH has been 
> compiled with Clang.
>
> I don't get that good debug info but here is some of what I got


valgrind is pretty good at sussing out these sorts of things:

==18132== Unaddressable byte(s) found during client check request
==18132==    at 0x504D1D7: MPIR_Localcopy (helper_fns.c:84)
==18132==    by 0x4EC8EA1: MPIR_Allgather_intra (allgather.c:169)
==18132==    by 0x4ECA5EC: MPIR_Allgather (allgather.c:791)
==18132==    by 0x4ECA7A4: MPIR_Allgather_impl (allgather.c:832)
==18132==    by 0x4EC8B5C: MPID_Allgather (mpid_coll.h:61)
==18132==    by 0x4ECB9F7: PMPI_Allgather (allgather.c:978)
==18132==    by 0x4008F5: main (noack_segv.c:18)
==18132==  Address 0x6f2f138 is 8 bytes after a block of size 16 alloc'd
==18132==    at 0x4C2FB55: calloc (in 
/usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==18132==    by 0x4008B0: main (noack_segv.c:15)
==18132==
==18132== Invalid write of size 8
==18132==    at 0x4C326CB: memcpy@@GLIBC_2.14 (in 
/usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==18132==    by 0x504D31B: MPIR_Localcopy (helper_fns.c:84)
==18132==    by 0x4EC8EA1: MPIR_Allgather_intra (allgather.c:169)
==18132==    by 0x4ECA5EC: MPIR_Allgather (allgather.c:791)
==18132==    by 0x4ECA7A4: MPIR_Allgather_impl (allgather.c:832)
==18132==    by 0x4EC8B5C: MPID_Allgather (mpid_coll.h:61)
==18132==    by 0x4ECB9F7: PMPI_Allgather (allgather.c:978)
==18132==    by 0x4008F5: main (noack_segv.c:18)
==18132==  Address 0x6f2f138 is 8 bytes after a block of size 16 alloc'd
==18132==    at 0x4C2FB55: calloc (in 
/usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==18132==    by 0x4008B0: main (noack_segv.c:15)


>
>      MPI_Comm_rank(comm, &rnk);
>      A = calloc(1, sizeof(uint64_t));
>      C = calloc(2, sizeof(uint64_t));
>      A[0] = rnk + 1;
>
>      MPI_Allgather(A, 1, MPI_UINT64_T, C, 1, MPI_UINT64_T, comm);

Your 'buf count tuple' is ok for A: every process sends one uint64

your 'buf count tuple' is too small for C if there are any more than 2 proceses .

When you say "more than one"... do you mean 2?

==rob


------------------------------

_______________________________________________
discuss mailing list
discuss at mpich.org
https://lists.mpich.org/mailman/listinfo/discuss

End of discuss Digest, Vol 45, Issue 7
**************************************
_______________________________________________
discuss mailing list     discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss


More information about the discuss mailing list