[mpich-discuss] Fatal error in PMPI_Barrier: A process has failed, error stack:

Tony Ladd tladd at che.ufl.edu
Thu Mar 27 10:14:29 CDT 2014


Pavan

Same OS - but it was different hardware. So I tried the cpi example on 
two identical nodes (Dell Optiplex 745) this morning. The OS is Centos 
6.5 and the installation on these client nodes is entirely automated so 
I am sure the configurations on the two boxes are identical (the install 
is new and the boxes have not been used so far). I used the version of 
cpi compiled during the installation of mpich. Here is the log file.

I am also including the installation logs in case that helps - I have 
separate logs of the configure, make, and install stages.

Tony


On 03/27/2014 02:02 AM, Balaji, Pavan wrote:
> Are both the nodes similar in architecture and OS configuration?
>
> Are the /etc/hosts files on both machines consistent?
>
>    — Pavan
>
> On Mar 26, 2014, at 9:01 PM, Tony Ladd <tladd at che.ufl.edu> wrote:
>
>> Rajeev
>>
>> There is a firewall on svr but its configured to accept all packets on the interface connected to the internal domain (where pc5 lives). I had already checked that stopping iptables off made no difference, but I just tried it again on the cpi example. The result was the same.
>>
>> Tony
>>
>>
>> On 03/26/2014 09:45 PM, Rajeev Thakur wrote:
>>> Is there a firewall on either machine that is in the way of communication?
>>>
>>> Rajeev
>>>
>>> On Mar 26, 2014, at 8:28 PM, Tony Ladd <tladd at che.ufl.edu>
>>>   wrote:
>>>
>>>> No - you get the same error - it looks as if process 1 (on the remote node) is not starting
>>>>
>>>> svr:tladd(netbench)> mpirun -n 2 -f hosts /global/usr/src/mpich-3.0.4/examples/cpi
>>>> Process 0 of 2 is on svr.che.ufl.edu
>>>> Fatal error in PMPI_Reduce: A process has failed, error stack:
>>>> PMPI_Reduce(1217)...............: MPI_Reduce(sbuf=0x7fff30ecced8, rbuf=0x7fff30ecced0, count=1, MPI_DOUBLE,
>>>>
>>>> But if I reverse the order in the host file (pc5 first and then svr) apparently both processes start
>>>>
>>>> svr:tladd(netbench)> mpirun -n 2 -f hosts /global/usr/src/mpich-3.0.4/examples/cpi
>>>> Process 1 of 2 is on svr.che.ufl.edu
>>>> Process 0 of 2 is on pc5
>>>> Fatal error in PMPI_Reduce: A process has failed, error stack:
>>>> PMPI_Reduce(1217)...............: MPI_Reduce(sbuf=0x7fff4d776348, rbuf=0x7fff4d776340, count=1, MPI_DOUBLE,
>>>>
>>>> But with the same result in the end.
>>>>
>>>> Tony
>>>>
>>>>
>>>>
>>>> On 03/26/2014 08:18 PM, Rajeev Thakur wrote:
>>>>> Does the cpi example run across two machines?
>>>>>
>>>>> Rajeev
>>>>>
>>>>> On Mar 26, 2014, at 7:13 PM, Tony Ladd <tladd at che.ufl.edu>
>>>>>   wrote:
>>>>>
>>>>>> Rajeev
>>>>>>
>>>>>> Sorry about that. I was switching back and forth from openmpi to mpich. But it does not make a difference. Here is a clean log from a fresh terminal - no mention of openmpi
>>>>>>
>>>>>> Tony
>>>>>>
>>>>>> PS - its a CentOS 6.5install - should have mentioned it before.
>>>>>>
>>>>>> -- 
>>>>>> Tony Ladd
>>>>>>
>>>>>> Chemical Engineering Department
>>>>>> University of Florida
>>>>>> Gainesville, Florida 32611-6005
>>>>>> USA
>>>>>>
>>>>>> Email: tladd-"(AT)"-che.ufl.edu
>>>>>> Web    http://ladd.che.ufl.edu
>>>>>>
>>>>>> Tel:   (352)-392-6509
>>>>>> FAX:   (352)-392-9514
>>>>>>
>>>>>> <mpich.log>_______________________________________________
>>>>>> discuss mailing list     discuss at mpich.org
>>>>>> To manage subscription options or unsubscribe:
>>>>>> https://lists.mpich.org/mailman/listinfo/discuss
>>>>> _______________________________________________
>>>>> discuss mailing list     discuss at mpich.org
>>>>> To manage subscription options or unsubscribe:
>>>>> https://lists.mpich.org/mailman/listinfo/discuss
>>>> -- 
>>>> Tony Ladd
>>>>
>>>> Chemical Engineering Department
>>>> University of Florida
>>>> Gainesville, Florida 32611-6005
>>>> USA
>>>>
>>>> Email: tladd-"(AT)"-che.ufl.edu
>>>> Web    http://ladd.che.ufl.edu
>>>>
>>>> Tel:   (352)-392-6509
>>>> FAX:   (352)-392-9514
>>>>
>>>> _______________________________________________
>>>> discuss mailing list     discuss at mpich.org
>>>> To manage subscription options or unsubscribe:
>>>> https://lists.mpich.org/mailman/listinfo/discuss
>>> _______________________________________________
>>> discuss mailing list     discuss at mpich.org
>>> To manage subscription options or unsubscribe:
>>> https://lists.mpich.org/mailman/listinfo/discuss
>> -- 
>> Tony Ladd
>>
>> Chemical Engineering Department
>> University of Florida
>> Gainesville, Florida 32611-6005
>> USA
>>
>> Email: tladd-"(AT)"-che.ufl.edu
>> Web    http://ladd.che.ufl.edu
>>
>> Tel:   (352)-392-6509
>> FAX:   (352)-392-9514
>>
>> _______________________________________________
>> discuss mailing list     discuss at mpich.org
>> To manage subscription options or unsubscribe:
>> https://lists.mpich.org/mailman/listinfo/discuss
> _______________________________________________
> discuss mailing list     discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss

-- 
Tony Ladd

Chemical Engineering Department
University of Florida
Gainesville, Florida 32611-6005
USA

Email: tladd-"(AT)"-che.ufl.edu
Web    http://ladd.che.ufl.edu

Tel:   (352)-392-6509
FAX:   (352)-392-9514

-------------- next part --------------
A non-text attachment was scrubbed...
Name: mpich.log
Type: text/x-log
Size: 17284 bytes
Desc: not available
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20140327/626e605e/attachment.bin>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: config.log
Type: text/x-log
Size: 537816 bytes
Desc: not available
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20140327/626e605e/attachment-0001.bin>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: configure.log
Type: text/x-log
Size: 79329 bytes
Desc: not available
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20140327/626e605e/attachment-0002.bin>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: install.log
Type: text/x-log
Size: 53109 bytes
Desc: not available
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20140327/626e605e/attachment-0003.bin>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: make.log
Type: text/x-log
Size: 74238 bytes
Desc: not available
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20140327/626e605e/attachment-0004.bin>


More information about the discuss mailing list