[mpich-discuss] Error Running MPICH for Photochemical Modeling

Lu, Huiwei huiweilu at mcs.anl.gov
Sun Sep 21 08:48:12 CDT 2014


Abhishek,

Could you reduce the example to one single file with less than 300 lines of code that reproduce the problem?
Although we would like to help you, but looking at tarball of 147,000 line of Fortran code is too much effort for us.

Also, another thing you could try is to use valgrind for detecting memory errors in the application. You can see the instructions of how to build MPICH with valgrind here:
http://wiki.mpich.org/mpich/index.php/Support_for_Debugging_Memory_Allocation

(please don’t drop the discuss list)

Thanks,
—
Huiwei

On Sep 19, 2014, at 6:45 PM, Abhishek Bhat <abhat at trinityconsultants.com> wrote:

> Huiwei And Sangmin,
> 
> Please find attached -
> 1. The compiled application
> 2. Source Code Tarball
> 3. The failed job
> 
> Any help is really appreciated 
> 
> Thank you
> Abhishek
> ................................................................................................................
> Abhishek Bhat, PhD, EPI,
> Senior Consultant
> 
> 
> -----Original Message-----
> From: Lu, Huiwei [mailto:huiweilu at mcs.anl.gov] 
> Sent: Friday, September 19, 2014 5:30 PM
> To: discuss at mpich.org
> Subject: Re: [mpich-discuss] Error Running MPICH for Photochemical Modeling
> 
> Hi, Abhishek,
> 
> As mentioned in previous email, it looks like the problem lies in the application. Is it possible that you can provide us with a minimum example that fails? So that we can looked at the code and reproduce the problem on our machines.
> 
> Thanks,
> -
> Huiwei
> 
> On Sep 17, 2014, at 3:29 PM, Abhishek Bhat <abhat at trinityconsultants.com> wrote:
> 
>> Sangmin,
>> 
>> Fatal error in MPI_Recv: A process has failed, error stack:
>> MPI_Recv(187).............: MPI_Recv(buf=0x7fff21bc04b0, count=644490, 
>> MPI_REAL, src=1, tag=14131, MPI_COMM_WORLD, status=0x7fff227c47a0) 
>> failed
>> dequeue_and_set_error(865): Communication error with rank 1
>> 
>> ===================================================================================
>> =   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
>> =   PID 4183 RUNNING AT dfw-camx
>> =   EXIT CODE: 1
>> =   CLEANING UP REMAINING PROCESSES
>> =   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
>> ======================================================================
>> ============= [proxy:0:4 at dfw-camx-n4] HYD_pmcd_pmip_control_cmd_cb 
>> (pm/pmiserv/pmip_cb.c:885): assert (!closed) failed 
>> [proxy:0:4 at dfw-camx-n4] HYDT_dmxu_poll_wait_for_event 
>> (tools/demux/demux_poll.c:76): callback returned error status 
>> [proxy:0:4 at dfw-camx-n4] main (pm/pmiserv/pmip.c:206): demux engine 
>> error waiting for event [proxy:0:2 at dfw-camx-n2] 
>> HYD_pmcd_pmip_control_cmd_cb (pm/pmiserv/pmip_cb.c:885): assert 
>> (!closed) failed [proxy:0:2 at dfw-camx-n2] HYDT_dmxu_poll_wait_for_event 
>> (tools/demux/demux_poll.c:76): callback returned error status 
>> [proxy:0:2 at dfw-camx-n2] main (pm/pmiserv/pmip.c:206): demux engine 
>> error waiting for event [proxy:0:3 at dfw-camx-n3] 
>> HYD_pmcd_pmip_control_cmd_cb (pm/pmiserv/pmip_cb.c:885): assert 
>> (!closed) failed [proxy:0:3 at dfw-camx-n3] HYDT_dmxu_poll_wait_for_event 
>> (tools/demux/demux_poll.c:76): callback returned error status 
>> [proxy:0:3 at dfw-camx-n3] main (pm/pmiserv/pmip.c:206): demux engine 
>> error waiting for event [proxy:0:6 at dfw-camx-n6] 
>> HYD_pmcd_pmip_control_cmd_cb (pm/pmiserv/pmip_cb.c:885): assert 
>> (!closed) failed [proxy:0:6 at dfw-camx-n6] HYDT_dmxu_poll_wait_for_event 
>> (tools/demux/demux_poll.c:76): callback returned error status 
>> [proxy:0:6 at dfw-camx-n6] main (pm/pmiserv/pmip.c:206): demux engine 
>> error waiting for event [proxy:0:5 at dfw-camx-n5] 
>> HYD_pmcd_pmip_control_cmd_cb (pm/pmiserv/pmip_cb.c:885): assert 
>> (!closed) failed [proxy:0:5 at dfw-camx-n5] HYDT_dmxu_poll_wait_for_event 
>> (tools/demux/demux_poll.c:76): callback returned error status 
>> [proxy:0:5 at dfw-camx-n5] main (pm/pmiserv/pmip.c:206): demux engine 
>> error waiting for event [proxy:0:7 at dfw-camx-n7] 
>> HYD_pmcd_pmip_control_cmd_cb (pm/pmiserv/pmip_cb.c:885): assert 
>> (!closed) failed [proxy:0:7 at dfw-camx-n7] HYDT_dmxu_poll_wait_for_event 
>> (tools/demux/demux_poll.c:76): callback returned error status 
>> [proxy:0:7 at dfw-camx-n7] main (pm/pmiserv/pmip.c:206): demux engine 
>> error waiting for event [mpiexec at dfw-camx] 
>> HYDT_bscu_wait_for_completion (tools/bootstrap/utils/bscu_wait.c:76): 
>> one of the processes terminated badly; aborting [mpiexec at dfw-camx] 
>> HYDT_bsci_wait_for_completion (tools/bootstrap/src/bsci_wait.c:23): 
>> launcher returned error waiting for completion [mpiexec at dfw-camx] 
>> HYD_pmci_wait_for_completion (pm/pmiserv/pmiserv_pmci.c:218): launcher 
>> returned error waiting for completion [mpiexec at dfw-camx] main 
>> (ui/mpich/mpiexec.c:344): process manager error waiting for completion
>> 
>> This is what I got from the exitallcodes.
>> 
>> 
>> Anything helpful??
>> 
>> Thank you very much for all help
>> Abhishek
>> 
>> ................................................................................................................
>> Abhishek Bhat, PhD, EPI,
>> Senior Consultant
>> 
>> 
>> From: Seo, Sangmin [mailto:sseo at anl.gov]
>> Sent: Wednesday, September 17, 2014 1:17 PM
>> To: Abhishek Bhat
>> Subject: Re: [mpich-discuss] Error Running MPICH for Photochemical 
>> Modeling
>> 
>> 
>> On Sep 17, 2014, at 1:08 PM, Abhishek Bhat <abhat at trinityconsultants.com> wrote:
>> 
>> 
>> Sangmin,
>> 
>> What should be the correct syntax for print all exitcodes - If I use
>> 
>> if( ! { mpiexec -machinefile nodes -np $NUMPROCS -print-all-exitcodes 
>> $EXEC -mpich-dbg=file -mpich-dbg-class=all -mpich-dbg-level=verbose } 
>> )
>> 
>> This is correct. And, the output will be shown on your terminal, not in file, like:
>> [mpiexec at host] Exit codes: [host] 0,0
>> 
>> 
>> 
>> I am getting error saying "-print-all-exitcodes" is not a valid local 
>> parameters
>> 
>> Which version of MPICH are you using? Can you let me know the result of the following?
>> $ mpiexec -info
>> 
>> - Sangmin
>> 
>> 
>> ______________________________________________________________________
>> ___
>> 
>> The information transmitted is intended only for the person or entity 
>> to which it is addressed and may contain confidential and/or 
>> privileged material. Any review, retransmission, dissemination or 
>> other use of, or taking of any action in reliance upon, this 
>> information by persons or entities other than the intended recipient 
>> is prohibited. If you received this in error, please contact the 
>> sender and delete the material from any computer.
>> ______________________________________________________________________
>> ___ _______________________________________________
>> discuss mailing list     discuss at mpich.org
>> To manage subscription options or unsubscribe:
>> https://lists.mpich.org/mailman/listinfo/discuss
> 
> _______________________________________________
> discuss mailing list     discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss
> 
> -- 
> _________________________________________________________________________
> 
> The information transmitted is intended only for the person or entity to
> which it is addressed and may contain confidential and/or privileged
> material. Any review, retransmission, dissemination or other use of, or
> taking of any action in reliance upon, this information by persons or
> entities other than the intended recipient is prohibited. If you received
> this in error, please contact the sender and delete the material from any
> computer.
> _________________________________________________________________________
> <Application.zip>

_______________________________________________
discuss mailing list     discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss


More information about the discuss mailing list