[mpich-discuss] Fatal error in PMPI_Barrier: A process has failed, error stack:

Balaji, Pavan balaji at anl.gov
Thu Mar 27 11:05:36 CDT 2014


Tony,

You didn’t quite mention it, but I assume you meant the problem exists even with two identical nodes?

What about the /etc/hosts files?  Are they consistent?

Did you try the options on this FAQ post:

http://wiki.mpich.org/mpich/index.php/Frequently_Asked_Questions#Q:_My_MPI_program_aborts_with_an_error_saying_it_cannot_communicate_with_other_processes

  — Pavan

On Mar 27, 2014, at 10:14 AM, Tony Ladd <tladd at che.ufl.edu> wrote:

> Pavan
> 
> Same OS - but it was different hardware. So I tried the cpi example on two identical nodes (Dell Optiplex 745) this morning. The OS is Centos 6.5 and the installation on these client nodes is entirely automated so I am sure the configurations on the two boxes are identical (the install is new and the boxes have not been used so far). I used the version of cpi compiled during the installation of mpich. Here is the log file.
> 
> I am also including the installation logs in case that helps - I have separate logs of the configure, make, and install stages.
> 
> Tony
> 
> 
> On 03/27/2014 02:02 AM, Balaji, Pavan wrote:
>> Are both the nodes similar in architecture and OS configuration?
>> 
>> Are the /etc/hosts files on both machines consistent?
>> 
>>   — Pavan
>> 
>> On Mar 26, 2014, at 9:01 PM, Tony Ladd <tladd at che.ufl.edu> wrote:
>> 
>>> Rajeev
>>> 
>>> There is a firewall on svr but its configured to accept all packets on the interface connected to the internal domain (where pc5 lives). I had already checked that stopping iptables off made no difference, but I just tried it again on the cpi example. The result was the same.
>>> 
>>> Tony
>>> 
>>> 
>>> On 03/26/2014 09:45 PM, Rajeev Thakur wrote:
>>>> Is there a firewall on either machine that is in the way of communication?
>>>> 
>>>> Rajeev
>>>> 
>>>> On Mar 26, 2014, at 8:28 PM, Tony Ladd <tladd at che.ufl.edu>
>>>>  wrote:
>>>> 
>>>>> No - you get the same error - it looks as if process 1 (on the remote node) is not starting
>>>>> 
>>>>> svr:tladd(netbench)> mpirun -n 2 -f hosts /global/usr/src/mpich-3.0.4/examples/cpi
>>>>> Process 0 of 2 is on svr.che.ufl.edu
>>>>> Fatal error in PMPI_Reduce: A process has failed, error stack:
>>>>> PMPI_Reduce(1217)...............: MPI_Reduce(sbuf=0x7fff30ecced8, rbuf=0x7fff30ecced0, count=1, MPI_DOUBLE,
>>>>> 
>>>>> But if I reverse the order in the host file (pc5 first and then svr) apparently both processes start
>>>>> 
>>>>> svr:tladd(netbench)> mpirun -n 2 -f hosts /global/usr/src/mpich-3.0.4/examples/cpi
>>>>> Process 1 of 2 is on svr.che.ufl.edu
>>>>> Process 0 of 2 is on pc5
>>>>> Fatal error in PMPI_Reduce: A process has failed, error stack:
>>>>> PMPI_Reduce(1217)...............: MPI_Reduce(sbuf=0x7fff4d776348, rbuf=0x7fff4d776340, count=1, MPI_DOUBLE,
>>>>> 
>>>>> But with the same result in the end.
>>>>> 
>>>>> Tony
>>>>> 
>>>>> 
>>>>> 
>>>>> On 03/26/2014 08:18 PM, Rajeev Thakur wrote:
>>>>>> Does the cpi example run across two machines?
>>>>>> 
>>>>>> Rajeev
>>>>>> 
>>>>>> On Mar 26, 2014, at 7:13 PM, Tony Ladd <tladd at che.ufl.edu>
>>>>>>  wrote:
>>>>>> 
>>>>>>> Rajeev
>>>>>>> 
>>>>>>> Sorry about that. I was switching back and forth from openmpi to mpich. But it does not make a difference. Here is a clean log from a fresh terminal - no mention of openmpi
>>>>>>> 
>>>>>>> Tony
>>>>>>> 
>>>>>>> PS - its a CentOS 6.5install - should have mentioned it before.
>>>>>>> 
>>>>>>> -- 
>>>>>>> Tony Ladd
>>>>>>> 
>>>>>>> Chemical Engineering Department
>>>>>>> University of Florida
>>>>>>> Gainesville, Florida 32611-6005
>>>>>>> USA
>>>>>>> 
>>>>>>> Email: tladd-"(AT)"-che.ufl.edu
>>>>>>> Web    http://ladd.che.ufl.edu
>>>>>>> 
>>>>>>> Tel:   (352)-392-6509
>>>>>>> FAX:   (352)-392-9514
>>>>>>> 
>>>>>>> <mpich.log>_______________________________________________
>>>>>>> discuss mailing list     discuss at mpich.org
>>>>>>> To manage subscription options or unsubscribe:
>>>>>>> https://lists.mpich.org/mailman/listinfo/discuss
>>>>>> _______________________________________________
>>>>>> discuss mailing list     discuss at mpich.org
>>>>>> To manage subscription options or unsubscribe:
>>>>>> https://lists.mpich.org/mailman/listinfo/discuss
>>>>> -- 
>>>>> Tony Ladd
>>>>> 
>>>>> Chemical Engineering Department
>>>>> University of Florida
>>>>> Gainesville, Florida 32611-6005
>>>>> USA
>>>>> 
>>>>> Email: tladd-"(AT)"-che.ufl.edu
>>>>> Web    http://ladd.che.ufl.edu
>>>>> 
>>>>> Tel:   (352)-392-6509
>>>>> FAX:   (352)-392-9514
>>>>> 
>>>>> _______________________________________________
>>>>> discuss mailing list     discuss at mpich.org
>>>>> To manage subscription options or unsubscribe:
>>>>> https://lists.mpich.org/mailman/listinfo/discuss
>>>> _______________________________________________
>>>> discuss mailing list     discuss at mpich.org
>>>> To manage subscription options or unsubscribe:
>>>> https://lists.mpich.org/mailman/listinfo/discuss
>>> -- 
>>> Tony Ladd
>>> 
>>> Chemical Engineering Department
>>> University of Florida
>>> Gainesville, Florida 32611-6005
>>> USA
>>> 
>>> Email: tladd-"(AT)"-che.ufl.edu
>>> Web    http://ladd.che.ufl.edu
>>> 
>>> Tel:   (352)-392-6509
>>> FAX:   (352)-392-9514
>>> 
>>> _______________________________________________
>>> discuss mailing list     discuss at mpich.org
>>> To manage subscription options or unsubscribe:
>>> https://lists.mpich.org/mailman/listinfo/discuss
>> _______________________________________________
>> discuss mailing list     discuss at mpich.org
>> To manage subscription options or unsubscribe:
>> https://lists.mpich.org/mailman/listinfo/discuss
> 
> -- 
> Tony Ladd
> 
> Chemical Engineering Department
> University of Florida
> Gainesville, Florida 32611-6005
> USA
> 
> Email: tladd-"(AT)"-che.ufl.edu
> Web    http://ladd.che.ufl.edu
> 
> Tel:   (352)-392-6509
> FAX:   (352)-392-9514
> 
> <mpich.log><config.log><configure.log><install.log><make.log>_______________________________________________
> discuss mailing list     discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss




More information about the discuss mailing list