[mpich-discuss] Need help to run hybrid code.

Mon Aug 28 09:28:11 CDT 2017

First of all, you are using Open MPI and MVAPICH below; I don't see 
MPICH. If you are using MPICH, then please tell us the version (show the 
output of mpichversion).

 >   Starting omp_dotprod_hybrid. Using           4  Cores...
 >   Core           3  using          16  threads
 >   Core           0  using          16  threads
 >   Core           2  using          16  threads
 >   Core           1  using          16  threads

Second, you are using 16 threads per core; naturally you won't scale.

Third, even if you were using 1 OpenMP thread per hardware thread, there 
is no guarantee that your program will scale (as in, time reduces while 
increasing the number of threads). You program might have scalability 
bottlenecks that have nothing to do with MPI. My first advice is to 
forget about MPI and make sure you program scales with only OpenMP. If 
your machine is truly a 64-way threads one, then use 1 MPI process per 
node and investigate the issue.

Finally, if you still suspect bad thread bindings, then check how they 
are bound to hardware threads instead of measuring execution time. Use 
sched_getcpu() to query thread placement on Linux and tools like hwloc 
to understand the machine topology.

I doubt your problems are related to issues in MPICH, which is the 
purpose of this list. Thus, I suggest to move this discuss out of the 
list and reply to me personally instead of replying to the list.

Halim
www.mcs.anl.gov/~aamer

On 8/28/17 8:39 AM, Pasha Pashaei wrote:
> Dear friends
> I am going to run a hybrid MPI+OPENMP code.
> As you can see in the below I played with the number of threads in various cases while running my main code with (Openmpi,Mpich,Mpich2).
> As you can see in OpenMPi and Mpich it seems that openmp did not work at all as Total time did not change considerably. But in Mpich2 Total computational time increased with increasing the number of threads. It could be because of using virtual threads instead of physical threads(or you said that over-subscribing).
> 
> 
> Hybrid code result (MPI + OpenMP) :
> Your suggestions:
> mvapich
> mpirun -np 4 -genv OMP_NUM_THREADS 1 --bind-to hwthread:1 ./pjet.gfortran > output.txt
> Total time = 7.290E+02
> mpirun -np 4 -genv OMP_NUM_THREADS 8 --bind-to hwthread:8 ./pjet.gfortran > output.txt
> Total time =  4.940E+02
> mpirun -np 4 -genv OMP_NUM_THREADS 8 --bind-to hwthread:8  -map-by hwthread:8 ./pjet.gfortran > output.txt
> Total time =  4.960E+02
> mpirun -np 4 -genv OMP_NUM_THREADS 16 --bind-to hwthread:16 ./pjet.gfortran > output.txt
> Total time = 4.502E+02
> mpirun -np 4 -genv OMP_NUM_THREADS 16 -bind-to core:16 -map-by core:16 ./pjet.gfortran > output.txt
> Total time =  4.628E+02
> 
> Pervios commands
> OpenMPI 1.8.1
> mpirun -np 4 -x OMP_NUM_THREADS=1 -bind-to socket -map-by socket ./pjet.gfortran > output.txt
> Total time =  4.475E+02
> mpirun -np 4 -x OMP_NUM_THREADS=8 -bind-to socket -map-by socket ./pjet.gfortran > output.txt
> Total time =  4.525E+02
> mpirun -np 4 -x OMP_NUM_THREADS=16 -bind-to socket -map-by socket ./pjet.gfortran > output.txt
> Total time =  4.611E+02
> 
> mvapich
> mpirun -np 4 -genv OMP_NUM_THREADS 1 -bind-to socket -map-by socket ./pjet.gfortran > output.txt
> Total time = 4.441E+02
> mpirun -np 4 -genv OMP_NUM_THREADS 4 -bind-to socket -map-by socket ./pjet.gfortran > output.txt
> Total time = 4.535E+02
> mpirun -np 4 -genv OMP_NUM_THREADS 8 -bind-to socket -map-by socket ./pjet.gfortran > output.txt
> Total time =  4.552E+02
> mpirun -np 4 -genv OMP_NUM_THREADS 16 -bind-to socket -map-by socket ./pjet.gfortran > output.txt
> Total time =  4.591E+02
> 
> mvapich2
> mpirun -np 4 -genv OMP_NUM_THREADS 1 -bind-to socket -map-by socket ./pjet.gfortran > output.txt
> Total time = 4.935E+02
> mpirun -np 4 -genv OMP_NUM_THREADS 4 -bind-to socket -map-by socket ./pjet.gfortran > output.txt
> Total time = 5.562E+02
> mpirun -np 4 -genv OMP_NUM_THREADS 8 -bind-to socket -map-by socket ./pjet.gfortran > output.txt
> Total time = 6.392E+02
> mpirun -np 4 -genv OMP_NUM_THREADS 16 -bind-to socket -map-by socket ./pjet.gfortran > output.txt
> Total time = 8.170E+02
> 
> Then I used a simple "hybrid.f90" code and its result which I used to check whether the computer can recognize correct value of cores and threads or not. It showed that the correct values in all three in different cases.
> here is its result:
> 
>   Starting omp_dotprod_hybrid. Using           4  Cores...
>   Core           3  using          16  threads
>   Core           0  using          16  threads
>   Core           2  using          16  threads
>   Core           1  using          16  threads
>   Core  1 thread  0  partial sum =   0.0000000000000000
>   Core  3 thread  0  partial sum =   0.0000000000000000
>   Core  1 thread  4  partial sum =   0.0000000000000000
>   Core  3 thread  7  partial sum =   200.00000000000000
>   Core  1 thread  8  partial sum =   200.00000000000000
>   Core  3 thread  9  partial sum =   200.00000000000000
>   Core  1 thread  11 partial sum =   200.00000000000000
>   Core  3 thread  3  partial sum =   200.00000000000000
>   Core  1 thread  2  partial sum =   0.0000000000000000
>   Core  3 thread  5  partial sum =   0.0000000000000000
>   Core  1 thread  3  partial sum =   200.00000000000000
>   Core  3 thread  2  partial sum =   0.0000000000000000
>   Core  1 thread  13 partial sum =   200.00000000000000
>   Core  3 thread  12 partial sum =   200.00000000000000
>   Core  1 thread  1  partial sum =   200.00000000000000
>   Core  3 thread  1  partial sum =   200.00000000000000
>   Core  3 thread  8  partial sum =   0.0000000000000000
>   Core  1 thread  7  partial sum =   0.0000000000000000
>   Core  3 thread  11 partial sum =   200.00000000000000
>   Core  1 thread  15 partial sum =   0.0000000000000000
>   Core  3 thread  15 partial sum =   0.0000000000000000
>   Core  1 thread  10 partial sum =   200.00000000000000
>   Core  1 thread  9  partial sum =   200.00000000000000
>   Core  3 thread  13 partial sum =   0.0000000000000000
>   Core  1 thread  5  partial sum =   200.00000000000000
>   Core  3 thread  6  partial sum =   0.0000000000000000
>   Core  3 thread  4  partial sum =   0.0000000000000000
>   Core  1 thread  6  partial sum =   0.0000000000000000
>   Core  3 thread  14 partial sum =   200.00000000000000
>   Core  1 thread  12 partial sum =   0.0000000000000000
>   Core  3 thread  10 partial sum =   200.00000000000000
>   Core  0 thread  0  partial sum =   0.0000000000000000
>   Core  0 thread  14 partial sum =   200.00000000000000
>   Core  0 thread  8  partial sum =   200.00000000000000
>   Core  0 thread  7  partial sum =   0.0000000000000000
>   Core  0 thread  15 partial sum =   200.00000000000000
>   Core  0 thread  5  partial sum =   200.00000000000000
>   Core  0 thread  9  partial sum =   200.00000000000000
>   Core  0 thread  11 partial sum =   0.0000000000000000
>   Core  0 thread  10 partial sum =   200.00000000000000
>   Core  0 thread  6  partial sum =   200.00000000000000
>   Core  0 thread  3  partial sum =   200.00000000000000
>   Core  0 thread  4  partial sum =   0.0000000000000000
>   Core  0 thread  2  partial sum =   0.0000000000000000
>   Core  0 thread  13 partial sum =   0.0000000000000000
>   Core  0 thread  12 partial sum =   0.0000000000000000
>   Core  0 thread  1  partial sum =   0.0000000000000000
>   Core  0 partial sum =   1600.0000000000000
>   Core  2 thread  3  partial sum =   0.0000000000000000
>   Core  2 thread  15 partial sum =   0.0000000000000000
>   Core  2 thread  0  partial sum =   0.0000000000000000
>   Core  2 thread  2  partial sum =   200.00000000000000
>   Core  2 thread  4  partial sum =   0.0000000000000000
>   Core  2 thread  5  partial sum =   0.0000000000000000
>   Core  2 thread  9  partial sum =   200.00000000000000
>   Core  2 thread  7  partial sum =   0.0000000000000000
>   Core  2 thread  14 partial sum =   200.00000000000000
>   Core  2 thread  8  partial sum =   200.00000000000000
>   Core  2 thread  12 partial sum =   200.00000000000000
>   Core  2 thread  10 partial sum =   200.00000000000000
>   Core  2 thread  6  partial sum =   200.00000000000000
>   Core  2 thread  1  partial sum =   0.0000000000000000
>   Core  2 thread  13 partial sum =   0.0000000000000000
>   Core  2 thread  11 partial sum =   200.00000000000000
>   Core  2 partial sum =   1600.0000000000000
>   Core  3 partial sum =   1600.0000000000000
>   Core  1 thread  14  partial sum =   0.0000000000000000
>   Core  1 partial sum =   1600.0000000000000
>   Done. Hybrid version: global sum  = 6400.0000000000000
> 
> 
> 
> Please tell me If I should check something. I still getting nowhere.
> Best regards
> 
> Pasha Pashaei
> 
> 
> 
> 
> _______________________________________________
> discuss mailing list     discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss
> 
_______________________________________________
discuss mailing list     discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss