[mpich-discuss] MPI_init very slow with more than 3 nodes

Bixente BODO GOMEZ bixente.bodo at ehu.es
Mon Dec 2 11:38:27 CST 2013


Hello.

The network guys have reviewed the logs of the switches and have found  
no problems.
I have removed the nfs server from the machine file and with two  
working nodes the delays have appeared.
Todays delays have been of 4 minutes to over an hour.
The attachment contains an excerpt from syslog that has appeared  
several times on a node when the delay was large.  I don't know if  
that can help.

Bixente.

Pavan Balaji <balaji at mcs.anl.gov> escribió:

> Try to ssh between the nodes and see how long it takes.  It might  
> give some hint on what’s going on.
>
>   — Pavan
>
> On Nov 29, 2013, at 5:21 AM, Bixente BODO GOMEZ <bixente.bodo at ehu.es> wrote:
>
>> Hi.
>>
>> In the attachment I send the tests I've done. Yesterday I had to  
>> wait 20 minutes; now only 4 or 6.
>> I will ask about the network.
>>
>>
>> Pavan Balaji <balaji at mcs.anl.gov> escribió:
>>
>>> It is possible there’s something really slow on your network.   
>>> Just to eliminate MPI_INIT as a possible cause, can you try a  
>>> non-MPI program:  maybe /bin/true or /bin/hostname?
>>>
>>> % mpiexec -f fila5 -np 8 /bin/true
>>>
>>>  — Pavan
>>>
>>> On Nov 28, 2013, at 8:10 AM, Bixente BODO GOMEZ  
>>> <bixente.bodo at ehu.es> wrote:
>>>
>>>> Goods.
>>>>
>>>> I'm testing a mpich cluster (3.0.4) with 7 nodes quad core,  
>>>> Ubuntu 12.04.  The master has the home directory
>>>> and the nodes get it by nfs.  I have change RPCNFSDCOUN from 8 to 64.
>>>>
>>>> The programs go fine with master and ONE of the other nodes, but  
>>>> when I start them with more nodes,
>>>> MPI_Init (I think so) takes long time (~20 minutes).  At these  
>>>> time in all nodes there is many network
>>>> (read and write) and master's hard disk activity.  For exemple:
>>>>
>>>> mpiu at u105251:~$ date; mpirun -f fila5 -np 8 test/hello; date
>>>> mié nov 27 15:17:29 CET 2013
>>>> Hola desde el procesador u105251. 0 de 8
>>>> Hola desde el procesador u105251. 1 de 8
>>>> Hola desde el procesador u105251. 2 de 8
>>>> Hola desde el procesador u105251. 3 de 8
>>>> Hola desde el procesador u103972. 4 de 8
>>>> Hola desde el procesador u103972. 5 de 8
>>>> Hola desde el procesador u103972. 6 de 8
>>>> Hola desde el procesador u103972. 7 de 8
>>>> mié nov 27 15:17:30 CET 2013
>>>> mpiu at u105251:~$ date; mpirun -f fila5 -np 16 test/hello; date
>>>> mié nov 27 15:17:39 CET 2013
>>>> Hola desde el procesador u105251. 2 de 16
>>>> Hola desde el procesador u105251. 0 de 16
>>>> Hola desde el procesador u105251. 1 de 16
>>>> Hola desde el procesador u105251. 3 de 16
>>>> Hola desde el procesador u103972. 4 de 16
>>>> Hola desde el procesador u103950. 8 de 16
>>>> Hola desde el procesador u103976.12 de 16
>>>> Hola desde el procesador u103972. 5 de 16
>>>> Hola desde el procesador u103950. 9 de 16
>>>> Hola desde el procesador u103976.13 de 16
>>>> Hola desde el procesador u103972. 7 de 16
>>>> Hola desde el procesador u103950.10 de 16
>>>> Hola desde el procesador u103976.14 de 16
>>>> Hola desde el procesador u103972. 6 de 16
>>>> Hola desde el procesador u103950.11 de 16
>>>> Hola desde el procesador u103976.15 de 16
>>>> mié nov 27 15:36:18 CET 2013
>>>> mpiu at u105251:~$
>>>>
>>>> When the programs start, i.e. since the first C instrucction,  
>>>> they run fine.  For that I think that the problem
>>>> is MPI_init
>>>>
>>>> Anybody kowns why?
>>>> Thank.
>>>>
>>>> _______________________________________________
>>>> discuss mailing list     discuss at mpich.org
>>>> To manage subscription options or unsubscribe:
>>>> https://lists.mpich.org/mailman/listinfo/discuss
>>>
>>> --
>>> Pavan Balaji
>>> http://www.mcs.anl.gov/~balaji
>>>
>>> _______________________________________________
>>> discuss mailing list     discuss at mpich.org
>>> To manage subscription options or unsubscribe:
>>> https://lists.mpich.org/mailman/listinfo/discuss
>>
>>
>> <test.txt>_______________________________________________
>> discuss mailing list     discuss at mpich.org
>> To manage subscription options or unsubscribe:
>> https://lists.mpich.org/mailman/listinfo/discuss
>
> --
> Pavan Balaji
> http://www.mcs.anl.gov/~balaji
>
> _______________________________________________
> discuss mailing list     discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss






More information about the discuss mailing list