[mpich-discuss] osu_latency test: why 8KB takes less time than 4KB and 2KB takes less time than 1KB?

Min Si msi at anl.gov
Mon Jul 2 13:10:23 CDT 2018


Could you please try mpich-3.3b3 ?
http://www.mpich.org/static/downloads/3.3b3/mpich-3.3b3.tar.gz

Min
On 2018/07/02 13:01, Abu Naser wrote:
>
> Hello Min,
>
>
> I have downloaded it from 
> http://www.mpich.org/static/downloads/3.2.1/mpich-3.2.1.tar.gz but it 
> did not work. I have received almost same error. Except this time no 
> process information from my remote machine.
>
> *Previously I have received this - *
>
> /Process 3 of 4 is on dhcp16194/
> /Process 1 of 4 is on dhcp16194/
> /Process 0 of 4 is on dhcp16198/
> /Process 2 of 4 is on dhcp16198/
>
> *With the new source code -*
>
> /Process 0 of 4 is on dhcp16198/
> /Process 2 of 4 is on dhcp16198/
>
>
> *Entire error message is:*
>
> /Process 0 of 4 is on dhcp16198/
> /Process 2 of 4 is on dhcp16198/
> /Fatal error in PMPI_Bcast: Unknown error class, error stack:/
> /PMPI_Bcast(1600)............................: 
> MPI_Bcast(buf=0x7ffd1ee145f0, count=1, MPI_INT, root=0, 
> MPI_COMM_WORLD) failed/
> /MPIR_Bcast_impl(1452).......................: /
> /MPIR_Bcast(1476)............................: /
> /MPIR_Bcast_intra(1249)......................: /
> /MPIR_SMP_Bcast(1081)........................: /
> /MPIR_Bcast_binomial(285)....................: /
> /MPIC_Send(303)..............................: /
> /MPIC_Wait(226)..............................: /
> /MPIDI_CH3i_Progress_wait(242)...............: an error occurred while 
> handling an event returned by MPIDU_Sock_Wait()/
> /MPIDI_CH3I_Progress_handle_sock_event(698)..: /
> /MPIDI_CH3_Sockconn_handle_connect_event(597): [ch3:sock] failed to 
> connnect to remote process/
> /MPIDU_Socki_handle_connect(808).............: connection failure 
> (set=0,sock=1,errno=111:Connection refused)/
> /MPIR_SMP_Bcast(1088)........................: /
> /MPIR_Bcast_binomial(310)....................: Failure during collective/
> /Fatal error in PMPI_Bcast: Other MPI error, error stack:/
> /PMPI_Bcast(1600)........: MPI_Bcast(buf=0x7ffe2eeb90f0, count=1, 
> MPI_INT, root=0, MPI_COMM_WORLD) failed/
> /MPIR_Bcast_impl(1452)...: /
> /MPIR_Bcast(1476)........: /
> /MPIR_Bcast_intra(1249)..: /
> /MPIR_SMP_Bcast(1088)....: /
> /MPIR_Bcast_binomial(310): Failure during collective/
>
> Again if I configure the new source with tcp, it works fine.
>
>
> Thank You.
>
>
> Best Regards,
>
> Abu Naser
>
> ------------------------------------------------------------------------
> *From:* Min Si <msi at anl.gov>
> *Sent:* Monday, July 2, 2018 11:56:51 AM
> *To:* discuss at mpich.org
> *Subject:* Re: [mpich-discuss] osu_latency test: why 8KB takes less 
> time than 4KB and 2KB takes less time than 1KB?
> Hi Abu,
>
> Thanks for reporting this. Can you please try the latest release with 
> ch3/sock and see if you still have this error ?
>
> Min
> On 2018/07/01 21:47, Abu Naser wrote:
>>
>> Hello Min,
>>
>>
>> After compiling my mpich-3.2.1 with sock, while I was trying to run  
>> any program including osu benchmark or examples/cpi  in two machines, 
>> I have received following error -
>>
>>
>> /Process 3 of 4 is on dhcp16194/
>> /Process 1 of 4 is on dhcp16194/
>> /Process 0 of 4 is on dhcp16198/
>> /Process 2 of 4 is on dhcp16198/
>> /Fatal error in PMPI_Bcast: Unknown error class, error stack:/
>> /PMPI_Bcast(1600)............................: 
>> MPI_Bcast(buf=0x7ffc1808542c, count=1, MPI_INT, root=0, 
>> MPI_COMM_WORLD) failed/
>> /MPIR_Bcast_impl(1452).......................: /
>> /MPIR_Bcast(1476)............................: /
>> /MPIR_Bcast_intra(1249)......................: /
>> /MPIR_SMP_Bcast(1081)........................: /
>> /MPIR_Bcast_binomial(285)....................: /
>> /MPIC_Send(303)..............................: /
>> /MPIC_Wait(226)..............................: /
>> /MPIDI_CH3i_Progress_wait(242)...............: an error occurred 
>> while handling an event returned by MPIDU_Sock_Wait()/
>> /MPIDI_CH3I_Progress_handle_sock_event(698)..: /
>> /MPIDI_CH3_Sockconn_handle_connect_event(597): [ch3:sock] failed to 
>> connnect to remote process/
>> /MPIDU_Socki_handle_connect(808).............: connection failure 
>> (set=0,sock=1,errno=111:Connection refused)/
>> /MPIR_SMP_Bcast(1088)........................: /
>> /MPIR_Bcast_binomial(310)....................: Failure during collective/
>> /Fatal error in PMPI_Bcast: Other MPI error, error stack:/
>> /PMPI_Bcast(1600)........: MPI_Bcast(buf=0x7ffd9eeebdac, count=1, 
>> MPI_INT, root=0, MPI_COMM_WORLD) failed/
>> /MPIR_Bcast_impl(1452)...: /
>> /MPIR_Bcast(1476)........: /
>> /MPIR_Bcast_intra(1249)..: /
>> /MPIR_SMP_Bcast(1088)....: /
>> /MPIR_Bcast_binomial(310): Failure during collective/
>>
>> I checked the mpich FAQ and also mpich discussion list. Based on that 
>> I have checked followingsand found  they are fine in my machines -
>>
>> - firewall is disabled in both machine
>>
>> - I can do password less ssh in both machine
>>
>> - /etc/hosts in both machine configured with ip address and name properly
>>
>> - I have updated the library path and used absolute path for mpiexec
>>
>> - Most importantly when I configured and build mpich with tcp, it 
>> works fine.
>>
>>
>>  I think I am missing something but could not figure out yet. Any 
>> help would be appreciated.
>>
>>
>> Thank you.
>>
>>
>>
>>
>>
>>
>> Best Regards,
>>
>> Abu Naser
>>
>> ------------------------------------------------------------------------
>> *From:* Min Si <msi at anl.gov> <mailto:msi at anl.gov>
>> *Sent:* Tuesday, June 26, 2018 12:54:29 PM
>> *To:* discuss at mpich.org <mailto:discuss at mpich.org>
>> *Subject:* Re: [mpich-discuss] osu_latency test: why 8KB takes less 
>> time than 4KB and 2KB takes less time than 1KB?
>> Hi Abu,
>>
>> I think the results are stable enough. Perhaps you could also try the 
>> following tests, and see if similar trend exists:
>> - MPICH/socket (set `--with-device=ch3:sock` at configure)
>> - A socket-based pingpong test without MPI.
>>
>> At this point, I could not think of any MPI-specific design for 2k/8k 
>> messages. My guess is that it is related to your network connection.
>>
>> Min
>>
>> On 2018/06/24 11:09, Abu Naser wrote:
>>>
>>> Hello Min and Jeff,
>>>
>>>
>>> Here is my experiment results. Default number of iterations in 
>>> osu_latency for 0B – 8KB is 10,000. With that setting I had run the 
>>> osu_latency 100 times and found standard deviation 33 for 8KB 
>>> message size.
>>>
>>>
>>> So later I have set the iteration to 50,000 and 100,000 for 1KB – 
>>> 16KB message size. Then run osu_latency for 100 times for each 
>>> setting and take the average and standard deviation.
>>>
>>>
>>> *Msg Size in Bytes*
>>>
>>> 	
>>>
>>> *Avg time in us (50K iterations)*
>>>
>>> 	
>>>
>>> *Avg time in us (100k iterations)*
>>>
>>> 	
>>>
>>> *Standard deviation (50K iterations)*
>>>
>>> 	
>>>
>>> *Standard deviation (100K iterations)*
>>>
>>> 1k
>>>
>>> 	
>>>
>>> 85.10
>>>
>>> 	
>>>
>>> 84.9
>>>
>>> 	
>>>
>>> 0.55
>>>
>>> 	
>>>
>>> 0.45
>>>
>>> 2k
>>>
>>> 	
>>>
>>> 75.79
>>>
>>> 	
>>>
>>> 74.63
>>>
>>> 	
>>>
>>> 5.09
>>>
>>> 	
>>>
>>> 4.44
>>>
>>> 4k
>>>
>>> 	
>>>
>>> 273.80
>>>
>>> 	
>>>
>>> 274.71
>>>
>>> 	
>>>
>>> 4.18
>>>
>>> 	
>>>
>>> 2.45
>>>
>>> 8k
>>>
>>> 	
>>>
>>> 258.56
>>>
>>> 	
>>>
>>> 249.83
>>>
>>> 	
>>>
>>> 21.14
>>>
>>> 	
>>>
>>> 28
>>>
>>> 16k
>>>
>>> 	
>>>
>>> 281.31
>>>
>>> 	
>>>
>>> 281.02
>>>
>>> 	
>>>
>>> 3.22
>>>
>>> 	
>>>
>>> 4.10
>>>
>>>
>>>
>>> The standard deviation of 8K message is so high and that implies it 
>>> actually not producing any consistent latency time. Looks 
>>> like that's the reason for 8K is taking less time than 4K.
>>>
>>>
>>> Meanwhile, 2K has standard deviation less than 5 but 1K message 
>>> latency timing are more densely populated than 2K. So probably this 
>>> is the explanation for 2K message less latency time.
>>>
>>>
>>> Thank you for your suggestions.
>>>
>>>
>>>
>>>
>>> Best Regards,
>>>
>>> Abu Naser
>>>
>>> ------------------------------------------------------------------------
>>> *From:* Abu Naser
>>> *Sent:* Wednesday, June 20, 2018 1:48:53 PM
>>> *To:* discuss at mpich.org <mailto:discuss at mpich.org>
>>> *Subject:* Re: [mpich-discuss] osu_latency test: why 8KB takes less 
>>> time than 4KB and 2KB takes less time than 1KB?
>>>
>>> Hello Min,
>>>
>>>
>>> Thanks for the clarification.  I will do the experiment.
>>>
>>>
>>> Thanks.
>>>
>>> Best Regards,
>>>
>>> Abu Naser
>>>
>>> ------------------------------------------------------------------------
>>> *From:* Min Si <msi at anl.gov> <mailto:msi at anl.gov>
>>> *Sent:* Wednesday, June 20, 2018 1:39:30 PM
>>> *To:* discuss at mpich.org <mailto:discuss at mpich.org>
>>> *Subject:* Re: [mpich-discuss] osu_latency test: why 8KB takes less 
>>> time than 4KB and 2KB takes less time than 1KB?
>>> Hi Abu,
>>>
>>> I think Jeff means that you should run your experiment with more 
>>> iterations in order to get a stable results.
>>> - Increase the iteration of for loop in each execution (I think osu 
>>> benchmark allows you to set it)
>>> - Run the experiments 10 or 100 times, and take the average and 
>>> standard deviation.
>>>
>>> If you see a very small standard deviation (e.g., <=5%), then the 
>>> trend is stable and you might not see such gaps.
>>>
>>> Best regards,
>>> Min
>>> On 2018/06/20 12:14, Abu Naser wrote:
>>>>
>>>> Hello Jeff,
>>>>
>>>>
>>>> Yes, I am using a switch and other machines are also connected with 
>>>> that switch.
>>>>
>>>> If I remove other machines and just use my two node with the 
>>>> switch, then will it improve the performance by 200 ~ 400 iterations?
>>>>
>>>> Meanwhile I will give a try with a single dedicated cable.
>>>>
>>>>
>>>> Thank you.
>>>>
>>>>
>>>> Best Regards,
>>>>
>>>> Abu Naser
>>>>
>>>> ------------------------------------------------------------------------
>>>> *From:* Jeff Hammond <jeff.science at gmail.com> 
>>>> <mailto:jeff.science at gmail.com>
>>>> *Sent:* Wednesday, June 20, 2018 12:52:06 PM
>>>> *To:* MPICH
>>>> *Subject:* Re: [mpich-discuss] osu_latency test: why 8KB takes less 
>>>> time than 4KB and 2KB takes less time than 1KB?
>>>> Is the ethernet connection a single dedicated cable between the two 
>>>> machines or are you running through a switch that handles other 
>>>> traffic?
>>>>
>>>> My best guess is that this is noise and that you may be able to 
>>>> avoid it by running a very long time, e.g. 10000 iterations.
>>>>
>>>> Jeff
>>>>
>>>> On Wed, Jun 20, 2018 at 6:53 AM, Abu Naser <an16e at my.fsu.edu 
>>>> <mailto:an16e at my.fsu.edu>> wrote:
>>>>
>>>>
>>>>     Good day to all,
>>>>
>>>>
>>>>     I had run point to point osu_latency test in two nodes for 200
>>>>     times.  Followings are the average time in microsecond for
>>>>     various size of the messages -
>>>>
>>>>     1KB    84.8514 us
>>>>     2KB    73.52535 us
>>>>     4KB    272.55275 us
>>>>     8KB    234.86385 us
>>>>     16KB    288.88 us
>>>>     32KB    523.3725 us
>>>>     64KB    910.4025 us
>>>>
>>>>
>>>>     From the above looks like, 2KB message has less latency than 1
>>>>     KB and 8KB has less latency than 4KB.
>>>>
>>>>     I was looking for explanation of this behavior  but did not get
>>>>     any.
>>>>
>>>>
>>>>      1. MPIR_CVAR_CH3_EAGER_MAX_MSG_SIZEis set to 128KB. So none of
>>>>         the above message size is using Rendezvous protocol. Is
>>>>         there any partition inside eager protocol (e.g. 0 - 512
>>>>         bytes, 1KB - 8KB, 16KB - 64KB)? If yes then what are the
>>>>         boundaries for them? Can I log them with debug-event-logging?
>>>>
>>>>
>>>>     Setup I am using:
>>>>
>>>>     - two nodes has intel core i7, one with 16gb memory another one 8gb
>>>>
>>>>     - mpich 3.2.1, configured and build to use nemesis tcp
>>>>
>>>>     - 1gb Ethernet connection
>>>>
>>>>     - NFS is using for sharing
>>>>
>>>>     - osu_latency : uses MPI_Send and MPI_Recv
>>>>
>>>>     - MPIR_CVAR_CH3_EAGER_MAX_MSG_SIZE= 131072 (128KB)
>>>>
>>>>
>>>>     Can anyone help me on that? Thanks in advance.
>>>>
>>>>
>>>>
>>>>
>>>>     Best Regards,
>>>>
>>>>     Abu Naser
>>>>
>>>>
>>>>     _______________________________________________
>>>>     discuss mailing list discuss at mpich.org <mailto:discuss at mpich.org>
>>>>     To manage subscription options or unsubscribe:
>>>>     https://lists.mpich.org/mailman/listinfo/discuss
>>>>     <https://lists.mpich.org/mailman/listinfo/discuss>
>>>>
>>>>
>>>>
>>>>
>>>> -- 
>>>> Jeff Hammond
>>>> jeff.science at gmail.com <mailto:jeff.science at gmail.com>
>>>> http://jeffhammond.github.io/
>>>>
>>>>
>>>> _______________________________________________
>>>> discuss mailing listdiscuss at mpich.org <mailto:discuss at mpich.org>
>>>> To manage subscription options or unsubscribe:
>>>> https://lists.mpich.org/mailman/listinfo/discuss
>>>
>>>
>>>
>>> _______________________________________________
>>> discuss mailing listdiscuss at mpich.org <mailto:discuss at mpich.org>
>>> To manage subscription options or unsubscribe:
>>> https://lists.mpich.org/mailman/listinfo/discuss
>>
>>
>>
>> _______________________________________________
>> discuss mailing listdiscuss at mpich.org <mailto:discuss at mpich.org>
>> To manage subscription options or unsubscribe:
>> https://lists.mpich.org/mailman/listinfo/discuss
>
>
>
> _______________________________________________
> discuss mailing list     discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20180702/4bd2f081/attachment.html>
-------------- next part --------------
_______________________________________________
discuss mailing list     discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss


More information about the discuss mailing list