[mpich-discuss] Fatal error in PMPI_Barrier: A process has failed, error stack:
Tony Ladd
tladd at che.ufl.edu
Thu Mar 27 10:26:12 CDT 2014
Pavan
I did wonder if my environment (the Python stuff) might be causing a
problem. So I tried again with a minimal .bashrc. Same problem but the
log file is a bit cleaner (and there is no firewall on either of these
nodes)
Tony
On 03/27/2014 11:14 AM, Tony Ladd wrote:
> Pavan
>
> Same OS - but it was different hardware. So I tried the cpi example on
> two identical nodes (Dell Optiplex 745) this morning. The OS is Centos
> 6.5 and the installation on these client nodes is entirely automated
> so I am sure the configurations on the two boxes are identical (the
> install is new and the boxes have not been used so far). I used the
> version of cpi compiled during the installation of mpich. Here is the
> log file.
>
> I am also including the installation logs in case that helps - I have
> separate logs of the configure, make, and install stages.
>
> Tony
>
>
> On 03/27/2014 02:02 AM, Balaji, Pavan wrote:
>> Are both the nodes similar in architecture and OS configuration?
>>
>> Are the /etc/hosts files on both machines consistent?
>>
>> — Pavan
>>
>> On Mar 26, 2014, at 9:01 PM, Tony Ladd <tladd at che.ufl.edu> wrote:
>>
>>> Rajeev
>>>
>>> There is a firewall on svr but its configured to accept all packets
>>> on the interface connected to the internal domain (where pc5 lives).
>>> I had already checked that stopping iptables off made no difference,
>>> but I just tried it again on the cpi example. The result was the same.
>>>
>>> Tony
>>>
>>>
>>> On 03/26/2014 09:45 PM, Rajeev Thakur wrote:
>>>> Is there a firewall on either machine that is in the way of
>>>> communication?
>>>>
>>>> Rajeev
>>>>
>>>> On Mar 26, 2014, at 8:28 PM, Tony Ladd <tladd at che.ufl.edu>
>>>> wrote:
>>>>
>>>>> No - you get the same error - it looks as if process 1 (on the
>>>>> remote node) is not starting
>>>>>
>>>>> svr:tladd(netbench)> mpirun -n 2 -f hosts
>>>>> /global/usr/src/mpich-3.0.4/examples/cpi
>>>>> Process 0 of 2 is on svr.che.ufl.edu
>>>>> Fatal error in PMPI_Reduce: A process has failed, error stack:
>>>>> PMPI_Reduce(1217)...............: MPI_Reduce(sbuf=0x7fff30ecced8,
>>>>> rbuf=0x7fff30ecced0, count=1, MPI_DOUBLE,
>>>>>
>>>>> But if I reverse the order in the host file (pc5 first and then
>>>>> svr) apparently both processes start
>>>>>
>>>>> svr:tladd(netbench)> mpirun -n 2 -f hosts
>>>>> /global/usr/src/mpich-3.0.4/examples/cpi
>>>>> Process 1 of 2 is on svr.che.ufl.edu
>>>>> Process 0 of 2 is on pc5
>>>>> Fatal error in PMPI_Reduce: A process has failed, error stack:
>>>>> PMPI_Reduce(1217)...............: MPI_Reduce(sbuf=0x7fff4d776348,
>>>>> rbuf=0x7fff4d776340, count=1, MPI_DOUBLE,
>>>>>
>>>>> But with the same result in the end.
>>>>>
>>>>> Tony
>>>>>
>>>>>
>>>>>
>>>>> On 03/26/2014 08:18 PM, Rajeev Thakur wrote:
>>>>>> Does the cpi example run across two machines?
>>>>>>
>>>>>> Rajeev
>>>>>>
>>>>>> On Mar 26, 2014, at 7:13 PM, Tony Ladd <tladd at che.ufl.edu>
>>>>>> wrote:
>>>>>>
>>>>>>> Rajeev
>>>>>>>
>>>>>>> Sorry about that. I was switching back and forth from openmpi to
>>>>>>> mpich. But it does not make a difference. Here is a clean log
>>>>>>> from a fresh terminal - no mention of openmpi
>>>>>>>
>>>>>>> Tony
>>>>>>>
>>>>>>> PS - its a CentOS 6.5install - should have mentioned it before.
>>>>>>>
>>>>>>> --
>>>>>>> Tony Ladd
>>>>>>>
>>>>>>> Chemical Engineering Department
>>>>>>> University of Florida
>>>>>>> Gainesville, Florida 32611-6005
>>>>>>> USA
>>>>>>>
>>>>>>> Email: tladd-"(AT)"-che.ufl.edu
>>>>>>> Web http://ladd.che.ufl.edu
>>>>>>>
>>>>>>> Tel: (352)-392-6509
>>>>>>> FAX: (352)-392-9514
>>>>>>>
>>>>>>> <mpich.log>_______________________________________________
>>>>>>> discuss mailing list discuss at mpich.org
>>>>>>> To manage subscription options or unsubscribe:
>>>>>>> https://lists.mpich.org/mailman/listinfo/discuss
>>>>>> _______________________________________________
>>>>>> discuss mailing list discuss at mpich.org
>>>>>> To manage subscription options or unsubscribe:
>>>>>> https://lists.mpich.org/mailman/listinfo/discuss
>>>>> --
>>>>> Tony Ladd
>>>>>
>>>>> Chemical Engineering Department
>>>>> University of Florida
>>>>> Gainesville, Florida 32611-6005
>>>>> USA
>>>>>
>>>>> Email: tladd-"(AT)"-che.ufl.edu
>>>>> Web http://ladd.che.ufl.edu
>>>>>
>>>>> Tel: (352)-392-6509
>>>>> FAX: (352)-392-9514
>>>>>
>>>>> _______________________________________________
>>>>> discuss mailing list discuss at mpich.org
>>>>> To manage subscription options or unsubscribe:
>>>>> https://lists.mpich.org/mailman/listinfo/discuss
>>>> _______________________________________________
>>>> discuss mailing list discuss at mpich.org
>>>> To manage subscription options or unsubscribe:
>>>> https://lists.mpich.org/mailman/listinfo/discuss
>>> --
>>> Tony Ladd
>>>
>>> Chemical Engineering Department
>>> University of Florida
>>> Gainesville, Florida 32611-6005
>>> USA
>>>
>>> Email: tladd-"(AT)"-che.ufl.edu
>>> Web http://ladd.che.ufl.edu
>>>
>>> Tel: (352)-392-6509
>>> FAX: (352)-392-9514
>>>
>>> _______________________________________________
>>> discuss mailing list discuss at mpich.org
>>> To manage subscription options or unsubscribe:
>>> https://lists.mpich.org/mailman/listinfo/discuss
>> _______________________________________________
>> discuss mailing list discuss at mpich.org
>> To manage subscription options or unsubscribe:
>> https://lists.mpich.org/mailman/listinfo/discuss
>
>
>
> _______________________________________________
> discuss mailing list discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss
--
Tony Ladd
Chemical Engineering Department
University of Florida
Gainesville, Florida 32611-6005
USA
Email: tladd-"(AT)"-che.ufl.edu
Web http://ladd.che.ufl.edu
Tel: (352)-392-6509
FAX: (352)-392-9514
-------------- next part --------------
A non-text attachment was scrubbed...
Name: mpich.log
Type: text/x-log
Size: 14801 bytes
Desc: not available
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20140327/25b1b827/attachment.bin>
More information about the discuss
mailing list