[mpich-discuss] Cannot catch std::bac_alloc?

Zhen Wang toddwz at gmail.com
Wed Apr 3 14:16:51 CDT 2019


OK. After help from several forums, I think I understand the cause of the
problem. As Jeff said, it has nothing to do with MPI. I do need 2 processes
to trigger this issue though. Hui's comment also helps.

Linux allows over committing. See here
<https://www.kernel.org/doc/Documentation/vm/overcommit-accounting> and here
<https://www.etalabs.net/overcommit.html>. In my case, say the machine has
32GB RAM, 2GB is already used and each MPI process is trying to allocate
8GB at a time. Then this happens:

Initial memory usage: 2GB
After the first memory allocation: 18GB
In the second memory allocation, each MPI process thinks it has enough
space for 8GB because of over committing, and writes to it. But that
requires 34GB, exceeding RAM size. So the out of memory killer on Linux
kills one of the MPI process (sends a SIGKILL signal), and the other MPI
process receives a SIGTERM signal.

This also explains my questions above.

The reason this problem doesn't happen on Windows is Windows doesn't allow
over commit
<https://superuser.com/questions/1194263/will-microsoft-windows-10-overcommit-memory>.
I don't know about Mac.

Thanks again everyone.

Best regards,
Zhen


On Wed, Apr 3, 2019 at 1:23 PM Joachim Protze via discuss <discuss at mpich.org>
wrote:

> Jeff,
>
> as I understand the original question, this is the intention of the
> statement: reproducibly trigger an allocation exception, which should be
> caught by the try-catch block.
>
>
> Hui,
>
> As you will probably experience, your intended behavior will not be
> reached with replacing the vector resize by allocation with new:
>
> std::vector<double*> a(100);
> ...
> a[i] = new double[1000000000];
>
> The reason is, that new allocates virtual memory. You will see a
> segfault, when you try to initialize the whole array and request
> physical memory for the array.
> The vector resize operation allocates the memory and immediately
> initializes the memory. Therefore not the allocation itself fails, but
> finally the access to the memory during initialization.
> I have no explanation why you can see the bad_alloc when executing it
> only on one rank.
>
> Best
> Joachim
>
>
> On 4/3/19 6:51 PM, Jeff Hammond via discuss wrote:
> > This error has nothing to do with MPI.
> >
> > These two lines are the problem:
> >
> > std::vector<std::vector<double> > a(100);
> > a[i].resize(1000000000);
> >
> > You are trying to allocate 1e9 std::vector<double>, which I assume is
> > many gigabytes of memory.
> >
> > Jeff
> >
> >
> > On Wed, Apr 3, 2019 at 8:58 AM Zhou, Hui via discuss <discuss at mpich.org
> > <mailto:discuss at mpich.org>> wrote:
> >  >
> >  > I just tested on my Mac, the system keep popping out message asking
> > me to force kill some application to free up memory, but the code never
> > crashed, just stalled. So I guess the out-of-memory behavior is
> > operating system dependent. My guess is when you have multiple process
> > competing for memory, it may cause some race issue in the os that the OS
> >   returns a valid address but discover out-of memory later because the
> > memory is stolen by another process.
> >  >
> >  > —
> >  > Hui Zhou
> >  > T: 630-252-3430
> >  >
> >  >
> >  >
> >  >
> >  >
> >  >
> >  >
> >  >
> >  > On Apr 3, 2019, at 8:42 AM, Zhen Wang via discuss <discuss at mpich.org
> > <mailto:discuss at mpich.org>> wrote:
> >  >
> >  > Hi,
> >  >
> >  > I have difficulty catching std::bac_alloc in an MPI environment. The
> > code is attached. I'm uisng gcc 6.3 on SUSE Linux Enterprise Server 11
> > (x86_64). mpich is built from source. The commands are as follows:
> >  >
> >  > Build
> >  > g++ -I<mpich-3.3-opt/include> -L<mpich-3.3-opt/lib> -lmpi memory.cpp
> >  >
> >  > Run
> >  > <mpich-3.3-opt/bin/mpiexec> -n 2 a.out
> >  >
> >  > Output
> >  > 0
> >  > 0
> >  > 1
> >  > 1
> >  >
> >  >
> >
> ===================================================================================
> >  > =   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
> >  > =   PID 16067 RUNNING AT <machine name>
> >  > =   EXIT CODE: 9
> >  > =   CLEANING UP REMAINING PROCESSES
> >  > =   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
> >  >
> >
> ===================================================================================
> >  > YOUR APPLICATION TERMINATED WITH THE EXIT STRING: Killed (signal 9)
> >  > This typically refers to a problem with your application.
> >  > Please see the FAQ page for debugging suggestions
> >  >
> >  > If I uncomment the line //if (rank == 0), i.e., only rank 0 allocates
> > memory, I'm able to catch bad_alloc as I expected. It seems that I am
> > misunderstanding something. Could you please help? Thanks a lot.
> >  >
> >  >
> >  > Best regards,
> >  > Zhen
> >  > <memory.cpp>_______________________________________________
> >  > discuss mailing list discuss at mpich.org <mailto:discuss at mpich.org>
> >  > To manage subscription options or unsubscribe:
> >  > https://lists.mpich.org/mailman/listinfo/discuss
> >  >
> >  >
> >  > _______________________________________________
> >  > discuss mailing list discuss at mpich.org <mailto:discuss at mpich.org>
> >  > To manage subscription options or unsubscribe:
> >  > https://lists.mpich.org/mailman/listinfo/discuss
> >
> >
> >
> > --
> > Jeff Hammond
> > jeff.science at gmail.com <mailto:jeff.science at gmail.com>
> > http://jeffhammond.github.io/
> >
> > _______________________________________________
> > discuss mailing list     discuss at mpich.org
> > To manage subscription options or unsubscribe:
> > https://lists.mpich.org/mailman/listinfo/discuss
> >
>
>
> --
> Dipl.-Inf. Joachim Protze
>
> IT Center
> Group: High Performance Computing
> Division: Computational Science and Engineering
> RWTH Aachen University
> Seffenter Weg 23
> D 52074  Aachen (Germany)
> Tel: +49 241 80- 24765
> Fax: +49 241 80-624765
> protze at itc.rwth-aachen.de
> www.itc.rwth-aachen.de
>
> _______________________________________________
> discuss mailing list     discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20190403/529e6acb/attachment.html>


More information about the discuss mailing list