<div dir="ltr">OK. After help from several forums, I think I understand the cause of the problem. As Jeff said, it has nothing to do with MPI. I do need 2 processes to trigger this issue though. Hui's comment also helps.<div><br></div><div>Linux allows over committing. See <a href="https://www.kernel.org/doc/Documentation/vm/overcommit-accounting">here</a> and <a href="https://www.etalabs.net/overcommit.html">here</a>. In my case, say the machine has 32GB RAM, 2GB is already used and each MPI process is trying to allocate 8GB at a time. Then this happens:</div><div><br></div><div>Initial memory usage: 2GB</div><div>After the first memory allocation: 18GB</div><div>In the second memory allocation, each MPI process thinks it has enough space for 8GB because of over committing, and writes to it. But that requires 34GB, exceeding RAM size. So the out of memory killer on Linux kills one of the MPI process (sends a SIGKILL signal), and the other MPI process receives a SIGTERM signal.</div><div><br></div><div>This also explains my questions above.</div><div><div><br></div><div>The reason this problem doesn't happen on Windows is <a href="https://superuser.com/questions/1194263/will-microsoft-windows-10-overcommit-memory">Windows doesn't allow over commit</a>. I don't know about Mac.</div><div><br></div><div>Thanks again everyone.</div></div><div><br></div><div><div dir="ltr" class="gmail_signature" data-smartmail="gmail_signature"><div dir="ltr">Best regards,<div>Zhen</div></div></div></div><br></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Wed, Apr 3, 2019 at 1:23 PM Joachim Protze via discuss <<a href="mailto:discuss@mpich.org">discuss@mpich.org</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">Jeff,<br>
<br>
as I understand the original question, this is the intention of the <br>
statement: reproducibly trigger an allocation exception, which should be <br>
caught by the try-catch block.<br>
<br>
<br>
Hui,<br>
<br>
As you will probably experience, your intended behavior will not be <br>
reached with replacing the vector resize by allocation with new:<br>
<br>
std::vector<double*> a(100);<br>
...<br>
a[i] = new double[1000000000];<br>
<br>
The reason is, that new allocates virtual memory. You will see a <br>
segfault, when you try to initialize the whole array and request <br>
physical memory for the array.<br>
The vector resize operation allocates the memory and immediately <br>
initializes the memory. Therefore not the allocation itself fails, but <br>
finally the access to the memory during initialization.<br>
I have no explanation why you can see the bad_alloc when executing it <br>
only on one rank.<br>
<br>
Best<br>
Joachim<br>
<br>
<br>
On 4/3/19 6:51 PM, Jeff Hammond via discuss wrote:<br>
> This error has nothing to do with MPI.<br>
> <br>
> These two lines are the problem:<br>
> <br>
> std::vector<std::vector<double> > a(100);<br>
> a[i].resize(1000000000);<br>
> <br>
> You are trying to allocate 1e9 std::vector<double>, which I assume is <br>
> many gigabytes of memory.<br>
> <br>
> Jeff<br>
> <br>
> <br>
> On Wed, Apr 3, 2019 at 8:58 AM Zhou, Hui via discuss <<a href="mailto:discuss@mpich.org" target="_blank">discuss@mpich.org</a> <br>
> <mailto:<a href="mailto:discuss@mpich.org" target="_blank">discuss@mpich.org</a>>> wrote:<br>
> ><br>
> > I just tested on my Mac, the system keep popping out message asking <br>
> me to force kill some application to free up memory, but the code never <br>
> crashed, just stalled. So I guess the out-of-memory behavior is <br>
> operating system dependent. My guess is when you have multiple process <br>
> competing for memory, it may cause some race issue in the os that the OS <br>
> returns a valid address but discover out-of memory later because the <br>
> memory is stolen by another process.<br>
> ><br>
> > —<br>
> > Hui Zhou<br>
> > T: 630-252-3430<br>
> ><br>
> ><br>
> ><br>
> ><br>
> ><br>
> ><br>
> ><br>
> ><br>
> > On Apr 3, 2019, at 8:42 AM, Zhen Wang via discuss <<a href="mailto:discuss@mpich.org" target="_blank">discuss@mpich.org</a> <br>
> <mailto:<a href="mailto:discuss@mpich.org" target="_blank">discuss@mpich.org</a>>> wrote:<br>
> ><br>
> > Hi,<br>
> ><br>
> > I have difficulty catching std::bac_alloc in an MPI environment. The <br>
> code is attached. I'm uisng gcc 6.3 on SUSE Linux Enterprise Server 11 <br>
> (x86_64). mpich is built from source. The commands are as follows:<br>
> ><br>
> > Build<br>
> > g++ -I<mpich-3.3-opt/include> -L<mpich-3.3-opt/lib> -lmpi memory.cpp<br>
> ><br>
> > Run<br>
> > <mpich-3.3-opt/bin/mpiexec> -n 2 a.out<br>
> ><br>
> > Output<br>
> > 0<br>
> > 0<br>
> > 1<br>
> > 1<br>
> ><br>
> > <br>
> ===================================================================================<br>
> > = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES<br>
> > = PID 16067 RUNNING AT <machine name><br>
> > = EXIT CODE: 9<br>
> > = CLEANING UP REMAINING PROCESSES<br>
> > = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES<br>
> > <br>
> ===================================================================================<br>
> > YOUR APPLICATION TERMINATED WITH THE EXIT STRING: Killed (signal 9)<br>
> > This typically refers to a problem with your application.<br>
> > Please see the FAQ page for debugging suggestions<br>
> ><br>
> > If I uncomment the line //if (rank == 0), i.e., only rank 0 allocates <br>
> memory, I'm able to catch bad_alloc as I expected. It seems that I am <br>
> misunderstanding something. Could you please help? Thanks a lot.<br>
> ><br>
> ><br>
> > Best regards,<br>
> > Zhen<br>
> > <memory.cpp>_______________________________________________<br>
> > discuss mailing list <a href="mailto:discuss@mpich.org" target="_blank">discuss@mpich.org</a> <mailto:<a href="mailto:discuss@mpich.org" target="_blank">discuss@mpich.org</a>><br>
> > To manage subscription options or unsubscribe:<br>
> > <a href="https://lists.mpich.org/mailman/listinfo/discuss" rel="noreferrer" target="_blank">https://lists.mpich.org/mailman/listinfo/discuss</a><br>
> ><br>
> ><br>
> > _______________________________________________<br>
> > discuss mailing list <a href="mailto:discuss@mpich.org" target="_blank">discuss@mpich.org</a> <mailto:<a href="mailto:discuss@mpich.org" target="_blank">discuss@mpich.org</a>><br>
> > To manage subscription options or unsubscribe:<br>
> > <a href="https://lists.mpich.org/mailman/listinfo/discuss" rel="noreferrer" target="_blank">https://lists.mpich.org/mailman/listinfo/discuss</a><br>
> <br>
> <br>
> <br>
> --<br>
> Jeff Hammond<br>
> <a href="mailto:jeff.science@gmail.com" target="_blank">jeff.science@gmail.com</a> <mailto:<a href="mailto:jeff.science@gmail.com" target="_blank">jeff.science@gmail.com</a>><br>
> <a href="http://jeffhammond.github.io/" rel="noreferrer" target="_blank">http://jeffhammond.github.io/</a><br>
> <br>
> _______________________________________________<br>
> discuss mailing list <a href="mailto:discuss@mpich.org" target="_blank">discuss@mpich.org</a><br>
> To manage subscription options or unsubscribe:<br>
> <a href="https://lists.mpich.org/mailman/listinfo/discuss" rel="noreferrer" target="_blank">https://lists.mpich.org/mailman/listinfo/discuss</a><br>
> <br>
<br>
<br>
-- <br>
Dipl.-Inf. Joachim Protze<br>
<br>
IT Center<br>
Group: High Performance Computing<br>
Division: Computational Science and Engineering<br>
RWTH Aachen University<br>
Seffenter Weg 23<br>
D 52074 Aachen (Germany)<br>
Tel: +49 241 80- 24765<br>
Fax: +49 241 80-624765<br>
<a href="mailto:protze@itc.rwth-aachen.de" target="_blank">protze@itc.rwth-aachen.de</a><br>
<a href="http://www.itc.rwth-aachen.de" rel="noreferrer" target="_blank">www.itc.rwth-aachen.de</a><br>
<br>
_______________________________________________<br>
discuss mailing list <a href="mailto:discuss@mpich.org" target="_blank">discuss@mpich.org</a><br>
To manage subscription options or unsubscribe:<br>
<a href="https://lists.mpich.org/mailman/listinfo/discuss" rel="noreferrer" target="_blank">https://lists.mpich.org/mailman/listinfo/discuss</a><br>
</blockquote></div>