[mpich-discuss] CPU usage versus Nodes, Threads

Ruben Faelens faelens at kth.se
Thu Oct 23 10:57:44 CDT 2014


Hi Qiguo,

You should try to collect performance statistics, especially regarding your
specific nodes and what they are doing at every moment in time.
If I understand correctly, your algorithm does the following:
- Master thread: read in data, split it up into pieces, transfer pieces to
slaves
- Slave thread: do calculation, transfer data back to master
- Master: recombine data, do calculation, split data back up, transfer
pieces to slaves
- etc...

The reason you do not see linear performance scaling could be due to the
following:
- The master thread recombining and splitting the data set may be
responsible for a large part of the work (and therefore is the bottleneck)
- Work is not divided equally. A significant part of the time is spent
waiting on one slave node who has a more difficult problem (takes longer)
than the rest.
- There is a common dataset. I/O takes a larger part of the time when more
slave nodes are used.

The only way to know for sure, is to simply generate a log file that shows
the time every process starts and ends a specific process step. Output the
time when
- the slave starts receiving data
- starts calculation
- starts sending results back
- starts waiting for his next piece of data
This will clearly show you what each node is doing at each moment in time,
and should identify the bottleneck.

/ Ruben


On Thu, Oct 23, 2014 at 5:27 PM, Qiguo Jing <qjing at trinityconsultants.com>
wrote:

>  Hi Bob,
>
>
>
> Thanks for your suggestions.   Here are more tests.  We actually have
> three clusters.
>
>
>
> Cluster 1 and 2:  8 nodes, (2 Processors, 4 cores/processor, no HT – Total
> 8 Threads)/node
>
> Cluster 3:              8 nodes, (1 Processors, 4 cores /processor, HT –
> Total 8 Threads )/node
>
>
>
> We also have a standalone machine:  2 processors, 6 cores/processor, HT –
> total 24 threads.
>
>
>
>
>
> For one particular case:
>
>
>
> Cluster 1 and 2 take 48 min to finish with 8 nodes, 8 threads/node,  60%
> CPU usage;  53 min to finish with 3 nodes, 8 threads/node, 90% CPU usage;
>
>
>
> Cluster 3 takes 227 min to finish with 8 nodes, 8 threads/node, 20% CPU
> usage; 207 min to finish with 3 nodes, 8 threads/node, 50% CPU usage;
>
>
>
> Standalone machine takes 82 min to finish with 24 threads, 100% CPU usage.
>
>
>
> It looks like with 24 threads, they should be pretty busy?   Could the
> above phenomena be a hardware issue more than software?
>
>
>
> Qiguo
>
>
>
> *From:* Bob Ilgner [mailto:bobilgner at gmail.com <bobilgner at gmail.com>]
> *Sent:* Thursday, October 23, 2014 1:11 AM
> *To:* discuss at mpich.org
> *Subject:* Re: [mpich-discuss] CPU usage versus Nodes, Threads
>
>
>
> Hi Qiguo,
>
>
>
> From the results table it looks as if you are using a computationally
> sparse algorithm. i.e. the problem may not lie with the inefficiency of
> your communication between threads but just that your algorithm is not
> keeping the processor busy enough for a large number of threads. The only
> way that you will know for sure whether this is a comms issue or an
> algorithmic one is to use a profiling tool, such as Vampir or Paraver.
>
>
>
> With the profiling result you will be able to determine whether you need
> to make algorithmic changes in your bulk processing and enhance comms as
> per Huiwei's notes.
>
>
>
> I would be interested to know what your profiling shows.
>
>
>
> Regards, bob
>
>
>
> On Thu, Oct 23, 2014 at 12:02 AM, Qiguo Jing <qjing at trinityconsultants.com>
> wrote:
>
>  Hi All,
>
>
>
> We have a parallel program running on a cluster.  We recently found a
> case, which decreases the CPU usage and increase the run-time when
> increases Nodes.   Below is the results table.
>
>
>
> The particular run requires a lot of data communication between nodes.
>
>
>
> Any thoughts about this phenomena?  Or is there any way we can improve the
> CPU usage when using higher number of nodes?
>
>
>
> Average CPU Usage (%)
>
> Number of Nodes
>
> Number of Threads/Node
>
> 100
>
> 1
>
> 8
>
> 92
>
> 2
>
> 8
>
> 50
>
> 3
>
> 8
>
> 40
>
> 4
>
> 8
>
> 35
>
> 5
>
> 8
>
> 30
>
> 6
>
> 8
>
> 25
>
> 7
>
> 8
>
> 20
>
> 8
>
> 8
>
> 20
>
> 8
>
> 4
>
>
>
>
>
> Thanks!
>
>
> _________________________________________________________________________
>
> The information transmitted is intended only for the person or entity to
> which it is addressed and may contain confidential and/or privileged
> material. Any review, retransmission, dissemination or other use of, or
> taking of any action in reliance upon, this information by persons or
> entities other than the intended recipient is prohibited. If you received
> this in error, please contact the sender and delete the material from any
> computer.
> _________________________________________________________________________
>
> _______________________________________________
> discuss mailing list     discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss
>
>
>
>
> _________________________________________________________________________
>
> The information transmitted is intended only for the person or entity to
> which it is addressed and may contain confidential and/or privileged
> material. Any review, retransmission, dissemination or other use of, or
> taking of any action in reliance upon, this information by persons or
> entities other than the intended recipient is prohibited. If you received
> this in error, please contact the sender and delete the material from any
> computer.
> _________________________________________________________________________
>
> _________________________________________________________________________
>
> The information transmitted is intended only for the person or entity to
> which it is addressed and may contain confidential and/or privileged
> material. Any review, retransmission, dissemination or other use of, or
> taking of any action in reliance upon, this information by persons or
> entities other than the intended recipient is prohibited. If you received
> this in error, please contact the sender and delete the material from any
> computer.
> _________________________________________________________________________
>
> _______________________________________________
> discuss mailing list     discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss
>



-- 
/ Ruben FAELENS
+32 494 06 72 59
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20141023/2bf788b0/attachment.html>
-------------- next part --------------
_______________________________________________
discuss mailing list     discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss


More information about the discuss mailing list