[mpich-discuss] Fatal error in PMPI_Barrier: A process has failed, error stack:

Tony Ladd tladd at che.ufl.edu
Thu Mar 27 12:45:55 CDT 2014


Ken

Very clever - that worked. Thanks to both of you. I found by trial and 
error that it is the remote machine (pc6 in this case) that needs to 
have its FQDN specified not the local one, which I guess makes sense. 
The logs (with HYDRA_DEBUG=1) show that with the FQDN it finds the IP of 
the remote machine but with the short name it does not.

For my education can you explain what " Communication with that machine 
is being bound to the localhost interface from the look of it" means. 
What mechanism does mpich2 use to get the IP of the remote host because 
everything I try (ping ssh) does not care?

Thanks again.

Tony


On 03/27/2014 01:17 PM, Kenneth Raffenetti wrote:
> Try using pc5.ladd (or another fully qualified name) instead of pc5 in 
> your MPI hosts file. Communication with that machine is being bound to 
> the localhost interface from the look of it.
>
> Ken
>
> On 03/27/2014 12:12 PM, Tony Ladd wrote:
>> Pavan
>>
>> Unfortunately yes.
>>
>> The remote hosts info comes from the server via NIS so each node's
>> /etc/hosts just has its local info in it (see below). I have had this
>> working for years including with mpich. Just to be sure I made a new
>> /etc/hosts for each node with both nodes in it. But the error still
>> occurred. Besides I can ssh back and forth (without password) using the
>> hostnames so I don't think this is the problem.
>>
>> I tried everything in the FAQ before I posted.
>>
>> Tony
>>
>>
>> # Host table for ladd domain
>>
>> 127.0.0.1       pc5 localhost
>>
>> 192.168.1.105   pc5.ladd pc5            # Localhost
>>
>> 192.168.1.1     svr.ladd svr            # Server
>>
>> Output from ypcat -k hosts. Note: nsswitch has the line
>> hosts:      files nis dns
>>
>> svr.che.ufl.edu 10.227.108.60   svr.che.ufl.edu
>> pc4 192.168.1.104   pc4.ladd pc4
>> pc6.ladd 192.168.1.106   pc6.ladd pc6
>> svr.ladd 192.168.1.1     svr.ladd svr
>> pc9 192.168.1.109   pc9.ladd pc9
>> pc7 192.168.1.107   pc7.ladd pc7
>> pc3 192.168.1.103   pc3.ladd pc3
>> pc1 192.168.1.101   pc1.ladd pc1
>> pc5.ladd 192.168.1.105   pc5.ladd pc5
>> localhost 127.0.0.1       svr localhost
>> pc3.ladd 192.168.1.103   pc3.ladd pc3
>> pc8 192.168.1.108   pc8.ladd pc8
>> pc6 192.168.1.106   pc6.ladd pc6
>> checs 10.227.121.221  checs.che.ufl.edu checs
>> pc2 192.168.1.102   pc2.ladd pc2
>> prn.ladd 192.168.1.11    prn.ladd prn
>> pc8.ladd 192.168.1.108   pc8.ladd pc8
>> svr 192.168.1.1     svr.ladd svr
>> pc4.ladd 192.168.1.104   pc4.ladd pc4
>> pc2.ladd 192.168.1.102   pc2.ladd pc2
>> pc5 192.168.1.105   pc5.ladd pc5
>> pc9.ladd 192.168.1.109   pc9.ladd pc9
>> pc7.ladd 192.168.1.107   pc7.ladd pc7
>> checs.che.ufl.edu 10.227.121.221  checs.che.ufl.edu checs
>> pc1.ladd 192.168.1.101   pc1.ladd pc1
>> prn 192.168.1.11    prn.ladd prn
>>
>>
>>
>>
>> On 03/27/2014 12:05 PM, Balaji, Pavan wrote:
>>> Tony,
>>>
>>> You didn’t quite mention it, but I assume you meant the problem exists
>>> even with two identical nodes?
>>>
>>> What about the /etc/hosts files?  Are they consistent?
>>>
>>> Did you try the options on this FAQ post:
>>>
>>> http://wiki.mpich.org/mpich/index.php/Frequently_Asked_Questions#Q:_My_MPI_program_aborts_with_an_error_saying_it_cannot_communicate_with_other_processes 
>>>
>>>
>>>
>>>    — Pavan
>>>
>>> On Mar 27, 2014, at 10:14 AM, Tony Ladd <tladd at che.ufl.edu> wrote:
>>>
>>>> Pavan
>>>>
>>>> Same OS - but it was different hardware. So I tried the cpi example
>>>> on two identical nodes (Dell Optiplex 745) this morning. The OS is
>>>> Centos 6.5 and the installation on these client nodes is entirely
>>>> automated so I am sure the configurations on the two boxes are
>>>> identical (the install is new and the boxes have not been used so
>>>> far). I used the version of cpi compiled during the installation of
>>>> mpich. Here is the log file.
>>>>
>>>> I am also including the installation logs in case that helps - I have
>>>> separate logs of the configure, make, and install stages.
>>>>
>>>> Tony
>>>>
>>>>
>>>> On 03/27/2014 02:02 AM, Balaji, Pavan wrote:
>>>>> Are both the nodes similar in architecture and OS configuration?
>>>>>
>>>>> Are the /etc/hosts files on both machines consistent?
>>>>>
>>>>>    — Pavan
>>>>>
>>>>> On Mar 26, 2014, at 9:01 PM, Tony Ladd <tladd at che.ufl.edu> wrote:
>>>>>
>>>>>> Rajeev
>>>>>>
>>>>>> There is a firewall on svr but its configured to accept all packets
>>>>>> on the interface connected to the internal domain (where pc5
>>>>>> lives). I had already checked that stopping iptables off made no
>>>>>> difference, but I just tried it again on the cpi example. The
>>>>>> result was the same.
>>>>>>
>>>>>> Tony
>>>>>>
>>>>>>
>>>>>> On 03/26/2014 09:45 PM, Rajeev Thakur wrote:
>>>>>>> Is there a firewall on either machine that is in the way of
>>>>>>> communication?
>>>>>>>
>>>>>>> Rajeev
>>>>>>>
>>>>>>> On Mar 26, 2014, at 8:28 PM, Tony Ladd <tladd at che.ufl.edu>
>>>>>>>   wrote:
>>>>>>>
>>>>>>>> No - you get the same error - it looks as if process 1 (on the
>>>>>>>> remote node) is not starting
>>>>>>>>
>>>>>>>> svr:tladd(netbench)> mpirun -n 2 -f hosts
>>>>>>>> /global/usr/src/mpich-3.0.4/examples/cpi
>>>>>>>> Process 0 of 2 is on svr.che.ufl.edu
>>>>>>>> Fatal error in PMPI_Reduce: A process has failed, error stack:
>>>>>>>> PMPI_Reduce(1217)...............: MPI_Reduce(sbuf=0x7fff30ecced8,
>>>>>>>> rbuf=0x7fff30ecced0, count=1, MPI_DOUBLE,
>>>>>>>>
>>>>>>>> But if I reverse the order in the host file (pc5 first and then
>>>>>>>> svr) apparently both processes start
>>>>>>>>
>>>>>>>> svr:tladd(netbench)> mpirun -n 2 -f hosts
>>>>>>>> /global/usr/src/mpich-3.0.4/examples/cpi
>>>>>>>> Process 1 of 2 is on svr.che.ufl.edu
>>>>>>>> Process 0 of 2 is on pc5
>>>>>>>> Fatal error in PMPI_Reduce: A process has failed, error stack:
>>>>>>>> PMPI_Reduce(1217)...............: MPI_Reduce(sbuf=0x7fff4d776348,
>>>>>>>> rbuf=0x7fff4d776340, count=1, MPI_DOUBLE,
>>>>>>>>
>>>>>>>> But with the same result in the end.
>>>>>>>>
>>>>>>>> Tony
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On 03/26/2014 08:18 PM, Rajeev Thakur wrote:
>>>>>>>>> Does the cpi example run across two machines?
>>>>>>>>>
>>>>>>>>> Rajeev
>>>>>>>>>
>>>>>>>>> On Mar 26, 2014, at 7:13 PM, Tony Ladd <tladd at che.ufl.edu>
>>>>>>>>>   wrote:
>>>>>>>>>
>>>>>>>>>> Rajeev
>>>>>>>>>>
>>>>>>>>>> Sorry about that. I was switching back and forth from openmpi
>>>>>>>>>> to mpich. But it does not make a difference. Here is a clean
>>>>>>>>>> log from a fresh terminal - no mention of openmpi
>>>>>>>>>>
>>>>>>>>>> Tony
>>>>>>>>>>
>>>>>>>>>> PS - its a CentOS 6.5install - should have mentioned it before.
>>>>>>>>>>
>>>>>>>>>> -- 
>>>>>>>>>> Tony Ladd
>>>>>>>>>>
>>>>>>>>>> Chemical Engineering Department
>>>>>>>>>> University of Florida
>>>>>>>>>> Gainesville, Florida 32611-6005
>>>>>>>>>> USA
>>>>>>>>>>
>>>>>>>>>> Email: tladd-"(AT)"-che.ufl.edu
>>>>>>>>>> Web    http://ladd.che.ufl.edu
>>>>>>>>>>
>>>>>>>>>> Tel:   (352)-392-6509
>>>>>>>>>> FAX:   (352)-392-9514
>>>>>>>>>>
>>>>>>>>>> <mpich.log>_______________________________________________
>>>>>>>>>> discuss mailing list     discuss at mpich.org
>>>>>>>>>> To manage subscription options or unsubscribe:
>>>>>>>>>> https://lists.mpich.org/mailman/listinfo/discuss
>>>>>>>>> _______________________________________________
>>>>>>>>> discuss mailing list     discuss at mpich.org
>>>>>>>>> To manage subscription options or unsubscribe:
>>>>>>>>> https://lists.mpich.org/mailman/listinfo/discuss
>>>>>>>> -- 
>>>>>>>> Tony Ladd
>>>>>>>>
>>>>>>>> Chemical Engineering Department
>>>>>>>> University of Florida
>>>>>>>> Gainesville, Florida 32611-6005
>>>>>>>> USA
>>>>>>>>
>>>>>>>> Email: tladd-"(AT)"-che.ufl.edu
>>>>>>>> Web    http://ladd.che.ufl.edu
>>>>>>>>
>>>>>>>> Tel:   (352)-392-6509
>>>>>>>> FAX:   (352)-392-9514
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> discuss mailing list     discuss at mpich.org
>>>>>>>> To manage subscription options or unsubscribe:
>>>>>>>> https://lists.mpich.org/mailman/listinfo/discuss
>>>>>>> _______________________________________________
>>>>>>> discuss mailing list     discuss at mpich.org
>>>>>>> To manage subscription options or unsubscribe:
>>>>>>> https://lists.mpich.org/mailman/listinfo/discuss
>>>>>> -- 
>>>>>> Tony Ladd
>>>>>>
>>>>>> Chemical Engineering Department
>>>>>> University of Florida
>>>>>> Gainesville, Florida 32611-6005
>>>>>> USA
>>>>>>
>>>>>> Email: tladd-"(AT)"-che.ufl.edu
>>>>>> Web    http://ladd.che.ufl.edu
>>>>>>
>>>>>> Tel:   (352)-392-6509
>>>>>> FAX:   (352)-392-9514
>>>>>>
>>>>>> _______________________________________________
>>>>>> discuss mailing list     discuss at mpich.org
>>>>>> To manage subscription options or unsubscribe:
>>>>>> https://lists.mpich.org/mailman/listinfo/discuss
>>>>> _______________________________________________
>>>>> discuss mailing list     discuss at mpich.org
>>>>> To manage subscription options or unsubscribe:
>>>>> https://lists.mpich.org/mailman/listinfo/discuss
>>>> -- 
>>>> Tony Ladd
>>>>
>>>> Chemical Engineering Department
>>>> University of Florida
>>>> Gainesville, Florida 32611-6005
>>>> USA
>>>>
>>>> Email: tladd-"(AT)"-che.ufl.edu
>>>> Web    http://ladd.che.ufl.edu
>>>>
>>>> Tel:   (352)-392-6509
>>>> FAX:   (352)-392-9514
>>>>
>>>> <mpich.log><config.log><configure.log><install.log><make.log>_______________________________________________ 
>>>>
>>>>
>>>> discuss mailing list     discuss at mpich.org
>>>> To manage subscription options or unsubscribe:
>>>> https://lists.mpich.org/mailman/listinfo/discuss
>>> _______________________________________________
>>> discuss mailing list     discuss at mpich.org
>>> To manage subscription options or unsubscribe:
>>> https://lists.mpich.org/mailman/listinfo/discuss
>>
> _______________________________________________
> discuss mailing list     discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss

-- 
Tony Ladd

Chemical Engineering Department
University of Florida
Gainesville, Florida 32611-6005
USA

Email: tladd-"(AT)"-che.ufl.edu
Web    http://ladd.che.ufl.edu

Tel:   (352)-392-6509
FAX:   (352)-392-9514

-------------- next part --------------
A non-text attachment was scrubbed...
Name: mpich.log
Type: text/x-log
Size: 14801 bytes
Desc: not available
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20140327/50f05edf/attachment.bin>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: mpichOK.log
Type: text/x-log
Size: 16202 bytes
Desc: not available
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20140327/50f05edf/attachment-0001.bin>


More information about the discuss mailing list