[mpich-discuss] Fatal error in PMPI_Barrier: A process has failed, error stack:

Tony Ladd tladd at che.ufl.edu
Thu Mar 27 12:12:35 CDT 2014


Pavan

Unfortunately yes.

The remote hosts info comes from the server via NIS so each node's 
/etc/hosts just has its local info in it (see below). I have had this 
working for years including with mpich. Just to be sure I made a new 
/etc/hosts for each node with both nodes in it. But the error still 
occurred. Besides I can ssh back and forth (without password) using the 
hostnames so I don't think this is the problem.

I tried everything in the FAQ before I posted.

Tony


# Host table for ladd domain

127.0.0.1       pc5 localhost

192.168.1.105   pc5.ladd pc5            # Localhost

192.168.1.1     svr.ladd svr            # Server

Output from ypcat -k hosts. Note: nsswitch has the line
hosts:      files nis dns

svr.che.ufl.edu 10.227.108.60   svr.che.ufl.edu
pc4 192.168.1.104   pc4.ladd pc4
pc6.ladd 192.168.1.106   pc6.ladd pc6
svr.ladd 192.168.1.1     svr.ladd svr
pc9 192.168.1.109   pc9.ladd pc9
pc7 192.168.1.107   pc7.ladd pc7
pc3 192.168.1.103   pc3.ladd pc3
pc1 192.168.1.101   pc1.ladd pc1
pc5.ladd 192.168.1.105   pc5.ladd pc5
localhost 127.0.0.1       svr localhost
pc3.ladd 192.168.1.103   pc3.ladd pc3
pc8 192.168.1.108   pc8.ladd pc8
pc6 192.168.1.106   pc6.ladd pc6
checs 10.227.121.221  checs.che.ufl.edu checs
pc2 192.168.1.102   pc2.ladd pc2
prn.ladd 192.168.1.11    prn.ladd prn
pc8.ladd 192.168.1.108   pc8.ladd pc8
svr 192.168.1.1     svr.ladd svr
pc4.ladd 192.168.1.104   pc4.ladd pc4
pc2.ladd 192.168.1.102   pc2.ladd pc2
pc5 192.168.1.105   pc5.ladd pc5
pc9.ladd 192.168.1.109   pc9.ladd pc9
pc7.ladd 192.168.1.107   pc7.ladd pc7
checs.che.ufl.edu 10.227.121.221  checs.che.ufl.edu checs
pc1.ladd 192.168.1.101   pc1.ladd pc1
prn 192.168.1.11    prn.ladd prn




On 03/27/2014 12:05 PM, Balaji, Pavan wrote:
> Tony,
>
> You didn’t quite mention it, but I assume you meant the problem exists even with two identical nodes?
>
> What about the /etc/hosts files?  Are they consistent?
>
> Did you try the options on this FAQ post:
>
> http://wiki.mpich.org/mpich/index.php/Frequently_Asked_Questions#Q:_My_MPI_program_aborts_with_an_error_saying_it_cannot_communicate_with_other_processes
>
>    — Pavan
>
> On Mar 27, 2014, at 10:14 AM, Tony Ladd <tladd at che.ufl.edu> wrote:
>
>> Pavan
>>
>> Same OS - but it was different hardware. So I tried the cpi example on two identical nodes (Dell Optiplex 745) this morning. The OS is Centos 6.5 and the installation on these client nodes is entirely automated so I am sure the configurations on the two boxes are identical (the install is new and the boxes have not been used so far). I used the version of cpi compiled during the installation of mpich. Here is the log file.
>>
>> I am also including the installation logs in case that helps - I have separate logs of the configure, make, and install stages.
>>
>> Tony
>>
>>
>> On 03/27/2014 02:02 AM, Balaji, Pavan wrote:
>>> Are both the nodes similar in architecture and OS configuration?
>>>
>>> Are the /etc/hosts files on both machines consistent?
>>>
>>>    — Pavan
>>>
>>> On Mar 26, 2014, at 9:01 PM, Tony Ladd <tladd at che.ufl.edu> wrote:
>>>
>>>> Rajeev
>>>>
>>>> There is a firewall on svr but its configured to accept all packets on the interface connected to the internal domain (where pc5 lives). I had already checked that stopping iptables off made no difference, but I just tried it again on the cpi example. The result was the same.
>>>>
>>>> Tony
>>>>
>>>>
>>>> On 03/26/2014 09:45 PM, Rajeev Thakur wrote:
>>>>> Is there a firewall on either machine that is in the way of communication?
>>>>>
>>>>> Rajeev
>>>>>
>>>>> On Mar 26, 2014, at 8:28 PM, Tony Ladd <tladd at che.ufl.edu>
>>>>>   wrote:
>>>>>
>>>>>> No - you get the same error - it looks as if process 1 (on the remote node) is not starting
>>>>>>
>>>>>> svr:tladd(netbench)> mpirun -n 2 -f hosts /global/usr/src/mpich-3.0.4/examples/cpi
>>>>>> Process 0 of 2 is on svr.che.ufl.edu
>>>>>> Fatal error in PMPI_Reduce: A process has failed, error stack:
>>>>>> PMPI_Reduce(1217)...............: MPI_Reduce(sbuf=0x7fff30ecced8, rbuf=0x7fff30ecced0, count=1, MPI_DOUBLE,
>>>>>>
>>>>>> But if I reverse the order in the host file (pc5 first and then svr) apparently both processes start
>>>>>>
>>>>>> svr:tladd(netbench)> mpirun -n 2 -f hosts /global/usr/src/mpich-3.0.4/examples/cpi
>>>>>> Process 1 of 2 is on svr.che.ufl.edu
>>>>>> Process 0 of 2 is on pc5
>>>>>> Fatal error in PMPI_Reduce: A process has failed, error stack:
>>>>>> PMPI_Reduce(1217)...............: MPI_Reduce(sbuf=0x7fff4d776348, rbuf=0x7fff4d776340, count=1, MPI_DOUBLE,
>>>>>>
>>>>>> But with the same result in the end.
>>>>>>
>>>>>> Tony
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 03/26/2014 08:18 PM, Rajeev Thakur wrote:
>>>>>>> Does the cpi example run across two machines?
>>>>>>>
>>>>>>> Rajeev
>>>>>>>
>>>>>>> On Mar 26, 2014, at 7:13 PM, Tony Ladd <tladd at che.ufl.edu>
>>>>>>>   wrote:
>>>>>>>
>>>>>>>> Rajeev
>>>>>>>>
>>>>>>>> Sorry about that. I was switching back and forth from openmpi to mpich. But it does not make a difference. Here is a clean log from a fresh terminal - no mention of openmpi
>>>>>>>>
>>>>>>>> Tony
>>>>>>>>
>>>>>>>> PS - its a CentOS 6.5install - should have mentioned it before.
>>>>>>>>
>>>>>>>> -- 
>>>>>>>> Tony Ladd
>>>>>>>>
>>>>>>>> Chemical Engineering Department
>>>>>>>> University of Florida
>>>>>>>> Gainesville, Florida 32611-6005
>>>>>>>> USA
>>>>>>>>
>>>>>>>> Email: tladd-"(AT)"-che.ufl.edu
>>>>>>>> Web    http://ladd.che.ufl.edu
>>>>>>>>
>>>>>>>> Tel:   (352)-392-6509
>>>>>>>> FAX:   (352)-392-9514
>>>>>>>>
>>>>>>>> <mpich.log>_______________________________________________
>>>>>>>> discuss mailing list     discuss at mpich.org
>>>>>>>> To manage subscription options or unsubscribe:
>>>>>>>> https://lists.mpich.org/mailman/listinfo/discuss
>>>>>>> _______________________________________________
>>>>>>> discuss mailing list     discuss at mpich.org
>>>>>>> To manage subscription options or unsubscribe:
>>>>>>> https://lists.mpich.org/mailman/listinfo/discuss
>>>>>> -- 
>>>>>> Tony Ladd
>>>>>>
>>>>>> Chemical Engineering Department
>>>>>> University of Florida
>>>>>> Gainesville, Florida 32611-6005
>>>>>> USA
>>>>>>
>>>>>> Email: tladd-"(AT)"-che.ufl.edu
>>>>>> Web    http://ladd.che.ufl.edu
>>>>>>
>>>>>> Tel:   (352)-392-6509
>>>>>> FAX:   (352)-392-9514
>>>>>>
>>>>>> _______________________________________________
>>>>>> discuss mailing list     discuss at mpich.org
>>>>>> To manage subscription options or unsubscribe:
>>>>>> https://lists.mpich.org/mailman/listinfo/discuss
>>>>> _______________________________________________
>>>>> discuss mailing list     discuss at mpich.org
>>>>> To manage subscription options or unsubscribe:
>>>>> https://lists.mpich.org/mailman/listinfo/discuss
>>>> -- 
>>>> Tony Ladd
>>>>
>>>> Chemical Engineering Department
>>>> University of Florida
>>>> Gainesville, Florida 32611-6005
>>>> USA
>>>>
>>>> Email: tladd-"(AT)"-che.ufl.edu
>>>> Web    http://ladd.che.ufl.edu
>>>>
>>>> Tel:   (352)-392-6509
>>>> FAX:   (352)-392-9514
>>>>
>>>> _______________________________________________
>>>> discuss mailing list     discuss at mpich.org
>>>> To manage subscription options or unsubscribe:
>>>> https://lists.mpich.org/mailman/listinfo/discuss
>>> _______________________________________________
>>> discuss mailing list     discuss at mpich.org
>>> To manage subscription options or unsubscribe:
>>> https://lists.mpich.org/mailman/listinfo/discuss
>> -- 
>> Tony Ladd
>>
>> Chemical Engineering Department
>> University of Florida
>> Gainesville, Florida 32611-6005
>> USA
>>
>> Email: tladd-"(AT)"-che.ufl.edu
>> Web    http://ladd.che.ufl.edu
>>
>> Tel:   (352)-392-6509
>> FAX:   (352)-392-9514
>>
>> <mpich.log><config.log><configure.log><install.log><make.log>_______________________________________________
>> discuss mailing list     discuss at mpich.org
>> To manage subscription options or unsubscribe:
>> https://lists.mpich.org/mailman/listinfo/discuss
> _______________________________________________
> discuss mailing list     discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss

-- 
Tony Ladd

Chemical Engineering Department
University of Florida
Gainesville, Florida 32611-6005
USA

Email: tladd-"(AT)"-che.ufl.edu
Web    http://ladd.che.ufl.edu

Tel:   (352)-392-6509
FAX:   (352)-392-9514




More information about the discuss mailing list