From bogus@does.not.exist.com Thu Apr 17 12:32:09 2025 From: bogus@does.not.exist.com () Date: Thu, 17 Apr 2025 17:32:09 -0000 Subject: No subject Message-ID: ld you elaborate on why the dynamic process model is not sufficient for you= r needs?

Halim


www.mcs.anl.gov/~aamer <http://www.mcs.anl.gov/~= aamer> <http://www.mcs.anl.gov/~aamer> <<= a href=3D"http://www.mcs.anl.gov/~aamer" target=3D"_blank" rel=3D"noreferre= r">http://www.mcs.anl.gov/~aamer>

On 5/26/17 9:11 AM, sanjeev s wrote:


Hi mpich,

I have a requirement where in we need to add start stop application
instances on the fly before starting a job.Is there any mpich service
available. I looked through dynamic process model, but its not sufficing our need.

More precisely my requirement is suppose I started 4 instances of my
application. Now I want to add one more instance dynamically to this set
Is there any tool which MPICH supports for fault tolerance behavior?

Thanks
Sanjeev





_______________________________________________
discuss mailing list     https://lists.mpich.org/mailman/listinfo/discus= s



_______________________________________________
discuss mailing list     https://lists.mpich.org/mailman/listinfo/discus= s








_______________________________________________
discuss mailing list     https://lists.mpich.org/mailman/listinfo/discus= s
















_______________________________________________
discuss mailing list     https://lists.mpich.org/mailman/listinfo/discus= s








_______________________________________________
discuss mailing list     https://lists.mpich.org/mailman/listinfo/discus= s

--001a11473f4a441a450550f239c9-- --===============5627409794820161001== Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Content-Disposition: inline _______________________________________________ discuss mailing list discuss at mpich.org To manage subscription options or unsubscribe: https://lists.mpich.org/mailman/listinfo/discuss --===============5627409794820161001==-- From bogus@does.not.exist.com Thu Apr 17 12:32:09 2025 From: bogus@does.not.exist.com () Date: Thu, 17 Apr 2025 17:32:09 -0000 Subject: No subject Message-ID: ld you elaborate on why the dynamic process model is not sufficient for you= r needs?

Halim
www.mcs.anl.gov/~aamer <http://www.mcs.anl.gov/~= aamer> <http://www.mcs.anl.gov/~aamer>

On 5/26/17 9:11 AM, sanjeev s wrote:


Hi mpich,

I have a requirement where in we need to add start stop application
instances on the fly before starting a job.Is there any mpich service
available. I looked through dynamic process model, but its not sufficing our need.

More precisely my requirement is suppose I started 4 instances of my
application. Now I want to add one more instance dynamically to this set
Is there any tool which MPICH supports for fault tolerance behavior?

Thanks
Sanjeev





_______________________________________________
discuss mailing list     https://lists.mpich.org/mailman/listinfo/discus= s



_______________________________________________
discuss mailing list     https://lists.mpich.org/mailman/listinfo/discus= s








_______________________________________________
discuss mailing list     https://lists.mpich.org/mailman/listinfo/discus= s
















_______________________________________________
discuss mailing list     https://lists.mpich.org/mailman/listinfo/discus= s

--001a113d859a7bd8fe0550df4ee5-- --===============8098711719025613836== Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Content-Disposition: inline _______________________________________________ discuss mailing list discuss at mpich.org To manage subscription options or unsubscribe: https://lists.mpich.org/mailman/listinfo/discuss --===============8098711719025613836==-- From bogus@does.not.exist.com Thu Apr 17 12:32:09 2025 From: bogus@does.not.exist.com () Date: Thu, 17 Apr 2025 17:32:09 -0000 Subject: No subject Message-ID: ld you elaborate on why the dynamic process model is not sufficient for you= r needs?

Halim
www.mcs.anl.gov/~aamer <http://www.mcs.anl.gov/~aamer= >

On 5/26/17 9:11 AM, sanjeev s wrote:


Hi mpich,

I have a requirement where in we need to add start stop application
instances on the fly before starting a job.Is there any mpich service
available. I looked through dynamic process model, but its not sufficing our need.

More precisely my requirement is suppose I started 4 instances of my
application. Now I want to add one more instance dynamically to this set
Is there any tool which MPICH supports for fault tolerance behavior?

Thanks
Sanjeev





_______________________________________________
discuss mailing list     discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discus= s



_______________________________________________
discuss mailing list     discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discus= s








_______________________________________________
discuss mailing list     discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discus= s


--f403045f55784cc1290550c88b83-- --===============4430482482998220797== Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Content-Disposition: inline _______________________________________________ discuss mailing list discuss at mpich.org To manage subscription options or unsubscribe: https://lists.mpich.org/mailman/listinfo/discuss --===============4430482482998220797==-- From bogus@does.not.exist.com Thu Apr 17 12:32:09 2025 From: bogus@does.not.exist.com () Date: Thu, 17 Apr 2025 17:32:09 -0000 Subject: No subject Message-ID: ld you elaborate on why the dynamic process model is not sufficient for you= r needs?

Halim
www.mcs.anl.gov/~aamer <http://www.mcs.anl.gov/~aamer= >

On 5/26/17 9:11 AM, sanjeev s wrote:


Hi mpich,

I have a requirement where in we need to add start stop application
instances on the fly before starting a job.Is there any mpich service
available. I looked through dynamic process model, but its not sufficing our need.

More precisely my requirement is suppose I started 4 instances of my
application. Now I want to add one more instance dynamically to this set
Is there any tool which MPICH supports for fault tolerance behavior?

Thanks
Sanjeev





_______________________________________________
discuss mailing list     https://lists.mpich.org/mailman/listinfo/discus= s



_______________________________________________
discuss mailing list     https://lists.mpich.org/mailman/listinfo/discus= s








_______________________________________________
discuss mailing list     https://lists.mpich.org/mailman/listinfo/discus= s

--001a113d859ad7206a0550a12507-- --===============7312916256483474646== Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Content-Disposition: inline _______________________________________________ discuss mailing list discuss at mpich.org To manage subscription options or unsubscribe: https://lists.mpich.org/mailman/listinfo/discuss --===============7312916256483474646==-- From bogus@does.not.exist.com Thu Apr 17 12:32:09 2025 From: bogus@does.not.exist.com () Date: Thu, 17 Apr 2025 17:32:09 -0000 Subject: No subject Message-ID: ld you elaborate on why the dynamic process model is not sufficient for you= r needs?

Halim
www.mcs.anl.gov/~aamer


On 5/26/17 9:11 AM, sanjeev s wrote:
Hi mpich,

I have a requirement where in we need to add start stop application
instances on the fly before starting a job.Is there any mpich service
available. I looked through dynamic process model, but its not sufficing our need.

More precisely my requirement is suppose I started 4 instances of my
application. Now I want to add one more instance dynamically to this set
Is there any tool which MPICH supports for fault tolerance behavior?

Thanks
Sanjeev



_______________________________________________
discuss mailing list     discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discus= s

_______________________________________________
discuss mailing list     discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discus= s

--94eb2c1309286f930a05506f28f3-- --===============0395035545747262554== Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Content-Disposition: inline _______________________________________________ discuss mailing list discuss at mpich.org To manage subscription options or unsubscribe: https://lists.mpich.org/mailman/listinfo/discuss --===============0395035545747262554==-- From bogus@does.not.exist.com Thu Apr 17 12:32:09 2025 From: bogus@does.not.exist.com () Date: Thu, 17 Apr 2025 17:32:09 -0000 Subject: No subject Message-ID: $ ./mpi-run.sh Running cpi on machines.u2.mpi Process 0 of 1 is on UNIT2 pi is approximately 3.1415926544231341, Error is 0.0000000008333410 wall clock time =3D 0.001327 Done! If I edit the script to change the mpiexec line like this: sudo -E LD_LIBRARY_PATH=3D${ELIBS} mpiexec --allow-run-as-root -machinefile= /home/linaro/.machines.u2.mpi -n 1 /home/linaro/myMPI/cpi Now I get (edited for brevity): $ ./mpi-run.sh Running cpi on machines.u2.mpi linaro at UNIT2's password: PATH=3D/usr/local/bin[...]: Command not found. export: Command not found. LD_LIBRARY_PATH=3D/usr/local/lib[...]: Command not found. export: Command not found. DYLD_LIBRARY_PATH: Undefined variable. And it just stops there. Note that the LD_LIBRARY_PATH being reported is *= not* the one passed in by the script. I don't think it's managing to reach= the mpi execution stage itself. If the machinefile lists more than one host, the password prompts appear tw= o at a time and interfere with each other such that no login succeeds (alth= ough all machines have the same password). Googling around, I've seen this series of error outputs in a wide variety o= f other contexts, including Open MPI but also some completely unrelated app= lication suites and SDKs. My problem is that the mpi binaries I need to run on the hosts absolutely r= equire sudo elevation. Is sudo mpiexec the way to go? What is going on in= my example case? Daniel U. Thibault RDDC - Centre de recherches de Valcartier | DRDC - Valcartier Research Cent= re NAC : 918V QSDJ Gouvernement du Canada | Government of Canada --_000_48CF5AC71E61DB46B70D0F388054EFFD86F70ED5VALE02valcartie_ Content-Type: text/html; charset="us-ascii" Content-Transfer-Encoding: quoted-printable
   I have a small network of machines all running the same O= S (Linaro Ubuntu Linux); they were all cloned from the same disk image and = differ only in their machine names (UNIT1 through UNIT4).
 
   I can ssh between them at will, trusty has been establish= ed and I no longer get asked for a password upon connecting.  MPICH is= installed from the Ubuntu repository (not quite the latest version: mpiexe= c reports version OpenRTE 1.8.1, the mpich package is 3.0.4-6ubuntu1), and I can run a demo like cpi with no issues, using a l= ittle mpi-run.sh bash script (the default shell is tcsh, however) :
 
(begin script)
#!/bin/bash
 
set -e
 
ESDK=3D${EPIPHANY_HOME}
ELIBS=3D${ESDK}/tools/host/lib:${LD_LIBRARY_PATH}
EHDF=3D${EPIPHANY_HDF}
 
echo "Running cpi on machines.u2.mpi"
LD_LIBRARY_PATH=3D${ELIBS} mpiexec --allow-run-as-root -machinefile /h= ome/linaro/.machines.u2.mpi -n 1 /home/linaro/myMPI/cpi
echo "Done!"
(end script)
 
.machines.u2.mpi consists of the one line:
 
linaro at UNIT2
 
From UNIT1, if I do:
 
$ ./mpi-run.sh
Running cpi on machines.u2.mpi
Process 0 of 1 is on UNIT2
pi is approximately 3.1415926544231341, Error is 0.0000000008333410
wall clock time =3D 0.001327
Done!
 
If I edit the script to change the mpiexec line like this:
 
sudo -E LD_LIBRARY_PATH=3D${ELIBS} mpiexec --allow-run-as-root -machin= efile /home/linaro/.machines.u2.mpi -n 1 /home/linaro/myMPI/cpi
 
Now I get (edited for brevity):
 
$ ./mpi-run.sh
Running cpi on machines.u2.mpi
linaro at UNIT2’s password:
PATH=3D/usr/local/bin[…]: Command not found.
export: Command not found.
LD_LIBRARY_PATH=3D/usr/local/lib[…]: Command not found.
export: Command not found.
DYLD_LIBRARY_PATH: Undefined variable.
 
And it just stops there.  Note that the LD_LIBRARY_PATH being rep= orted is *not* the one passed in by the script.  I don’t = think it’s managing to reach the mpi execution stage itself.
 
If the machinefile lists more than one host, the password prompts appe= ar two at a time and interfere with each other such that no login succeeds = (although all machines have the same password).
 
Googling around, I’ve seen this series of error outputs in a wid= e variety of other contexts, including Open MPI but also some completely un= related application suites and SDKs.
 
My problem is that the mpi binaries I need to run on the hosts absolut= ely require sudo elevation.  Is sudo mpiexec the way to go?  What= is going on in my example case?
 
Daniel U. Thibault
RDDC - Centre de recherches de Valcartier | DRDC - Valcartier Research Cent= re
NAC : = 918V QSDJ <http://www.travelgis.= com/map.asp?addr=3D918V%20QSDJ= >
Gouvernement du Canada | Government of Canada
<http://www.valcartier.drdc-rddc.gc.ca/>
 
--_000_48CF5AC71E61DB46B70D0F388054EFFD86F70ED5VALE02valcartie_-- --===============9162204499992570155== Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Content-Disposition: inline _______________________________________________ discuss mailing list discuss at mpich.org To manage subscription options or unsubscribe: https://lists.mpich.org/mailman/listinfo/discuss --===============9162204499992570155==-- From bogus@does.not.exist.com Thu Apr 17 12:32:09 2025 From: bogus@does.not.exist.com () Date: Thu, 17 Apr 2025 17:32:09 -0000 Subject: No subject Message-ID: MPI_Barrier before MPI_Allreduce can improve performance significantly. This is hardly the full story. It would be useful to know more about what you are trying to accomplish. Best, Jeff On Wed, Feb 25, 2015 at 3:07 PM, Junchao Zhang wrote: > Yes. Many collectives have optimizations for power-of-two processes. In > MPICH's source code allreduce.c, you can find the following comments. > > /* This is the default implementation of allreduce. The algorithm is: > > Algorithm: MPI_Allreduce > > For the heterogeneous case, we call MPI_Reduce followed by MPI_Bcast > in order to meet the requirement that all processes must have the > same result. For the homogeneous case, we use the following algorithms. > > > For long messages and for builtin ops and if count >= pof2 (where > pof2 is the nearest power-of-two less than or equal to the number > of processes), we use Rabenseifner's algorithm (see > http://www.hlrs.de/mpi/myreduce.html). > This algorithm implements the allreduce in two steps: first a > reduce-scatter, followed by an allgather. A recursive-halving > algorithm (beginning with processes that are distance 1 apart) is > used for the reduce-scatter, and a recursive doubling > algorithm is used for the allgather. The non-power-of-two case is > handled by dropping to the nearest lower power-of-two: the first > few even-numbered processes send their data to their right neighbors > (rank+1), and the reduce-scatter and allgather happen among the remaining > power-of-two processes. At the end, the first few even-numbered > processes get the result from their right neighbors. > > For the power-of-two case, the cost for the reduce-scatter is > lgp.alpha + n.((p-1)/p).beta + n.((p-1)/p).gamma. The cost for the > allgather lgp.alpha + n.((p-1)/p).beta. Therefore, the > total cost is: > Cost = 2.lgp.alpha + 2.n.((p-1)/p).beta + n.((p-1)/p).gamma > > For the non-power-of-two case, > Cost = (2.floor(lgp)+2).alpha + (2.((p-1)/p) + 2).n.beta + > n.(1+(p-1)/p).gamma > > > For short messages, for user-defined ops, and for count < pof2 > we use a recursive doubling algorithm (similar to the one in > MPI_Allgather). We use this algorithm in the case of user-defined ops > because in this case derived datatypes are allowed, and the user > could pass basic datatypes on one process and derived on another as > long as the type maps are the same. Breaking up derived datatypes > to do the reduce-scatter is tricky. > > Cost = lgp.alpha + n.lgp.beta + n.lgp.gamma > > Possible improvements: > > End Algorithm: MPI_Allreduce > */ > > --Junchao Zhang > > On Wed, Feb 25, 2015 at 2:59 PM, Aiman Fang wrote: >> >> Hi, >> >> I came across a problem in experiments that makes me wondering if there is >> any optimization of collective calls, such as MPI_Allreduce, for 2^n number >> of ranks? >> >> We did some experiments on Argonne Vesta system to measure the time of >> MPI_Allreduce calls using 511, 512 and 513 processes. (one process per >> node). In each run, the synthetic benchmark first did some computation and >> then called MPI_Allreduce 30 times, for total 100 loops. We measured the >> total time spent on communication. >> >> We found that 512-process run gives the best performance. The time for >> 511, 512 and 513 processes are 0.1492, 0.1449 and 0.1547 seconds >> respectively. 512-proc outperforms 511-proc by 3.7%, and 513-proc by 6.7%. >> >> The mpich version we used is as follows. >> >> $ mpichversion >> MPICH Version: 3.1.2 >> MPICH Release date: Mon Jul 21 16:00:21 CDT 2014 >> MPICH Device: pamid >> MPICH configure: --prefix=/home/fujita/soft/mpich-3.1.2 >> --host=powerpc64-bgq-linux --with-device=pamid --with-file-system=gpfs:BGQ >> --disable-wrapper-rpath >> MPICH CC: powerpc64-bgq-linux-gcc -O2 >> MPICH CXX: powerpc64-bgq-linux-g++ -O2 >> MPICH F77: powerpc64-bgq-linux-gfortran -O2 >> MPICH FC: powerpc64-bgq-linux-gfortran -O2 >> >> Thanks! >> >> Best, >> Aiman >> _______________________________________________ >> discuss mailing list discuss at mpich.org >> To manage subscription options or unsubscribe: >> https://lists.mpich.org/mailman/listinfo/discuss > > > > _______________________________________________ > discuss mailing list discuss at mpich.org > To manage subscription options or unsubscribe: > https://lists.mpich.org/mailman/listinfo/discuss -- Jeff Hammond jeff.science at gmail.com http://jeffhammond.github.io/ _______________________________________________ discuss mailing list discuss at mpich.org To manage subscription options or unsubscribe: https://lists.mpich.org/mailman/listinfo/discuss From bogus@does.not.exist.com Thu Apr 17 12:32:09 2025 From: bogus@does.not.exist.com () Date: Thu, 17 Apr 2025 17:32:09 -0000 Subject: No subject Message-ID: --Junchao Zhang On Wed, Nov 26, 2014 at 4:25 PM, Amin Hassani wrote: > I disabled the whole firewall in those machines but, still get the same > problem. connection refuse. > I run the program in another set of totally different machines that we > have, but still same problem. > Any other thought where can be the problem? > > Thanks. > > Amin Hassani, > CIS department at UAB, > Birmingham, AL, USA. > > On Wed, Nov 26, 2014 at 9:25 AM, Kenneth Raffenetti > wrote: > >> The connection refused makes me think a firewall is getting in the way. >> Is TCP communication limited to specific ports on the cluster? If so, yo= u >> can use this envvar to enforce a range of ports in MPICH. >> >> MPIR_CVAR_CH3_PORT_RANGE >> Description: The MPIR_CVAR_CH3_PORT_RANGE environment variable allow= s >> you to specify the range of TCP ports to be used by the process manager = and >> the MPICH library. The format of this variable is :. To spec= ify >> any available port, use 0:0. >> Default: {0,0} >> >> >> On 11/25/2014 11:50 PM, Amin Hassani wrote: >> >>> Tried with the new configure too. same problem :( >>> >>> $ mpirun -hostfile hosts-hydra -np 2 test_dup >>> Fatal error in MPI_Send: Unknown error class, error stack: >>> MPI_Send(174)..............: MPI_Send(buf=3D0x7fffd90c76c8, count=3D1, >>> MPI_INT, dest=3D1, tag=3D0, MPI_COMM_WORLD) failed >>> MPID_nem_tcp_connpoll(1832): Communication error with rank 1: Connectio= n >>> refused >>> >>> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D >>> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D >>> =3D BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES >>> =3D PID 5459 RUNNING AT oakmnt-0-a >>> =3D EXIT CODE: 1 >>> =3D CLEANING UP REMAINING PROCESSES >>> =3D YOU CAN IGNORE THE BELOW CLEANUP MESSAGES >>> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D >>> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D >>> [proxy:0:1 at oakmnt-0-b] HYD_pmcd_pmip_control_cmd_cb >>> (../../../../src/pm/hydra/pm/pmiserv/pmip_cb.c:885): assert (!closed) >>> failed >>> [proxy:0:1 at oakmnt-0-b] HYDT_dmxu_poll_wait_for_event >>> (../../../../src/pm/hydra/tools/demux/demux_poll.c:76): callback >>> returned error status >>> [proxy:0:1 at oakmnt-0-b] main >>> (../../../../src/pm/hydra/pm/pmiserv/pmip.c:206): demux engine error >>> waiting for event >>> [mpiexec at oakmnt-0-a] HYDT_bscu_wait_for_completion >>> (../../../../src/pm/hydra/tools/bootstrap/utils/bscu_wait.c:76): one of >>> the processes terminated badly; aborting >>> [mpiexec at oakmnt-0-a] HYDT_bsci_wait_for_completion >>> (../../../../src/pm/hydra/tools/bootstrap/src/bsci_wait.c:23): launcher >>> returned error waiting for completion >>> [mpiexec at oakmnt-0-a] HYD_pmci_wait_for_completion >>> (../../../../src/pm/hydra/pm/pmiserv/pmiserv_pmci.c:218): launcher >>> returned error waiting for completion >>> [mpiexec at oakmnt-0-a] main >>> (../../../../src/pm/hydra/ui/mpich/mpiexec.c:344): process manager erro= r >>> waiting for completion >>> >>> >>> Amin Hassani, >>> CIS department at UAB, >>> Birmingham, AL, USA. >>> >>> On Tue, Nov 25, 2014 at 11:44 PM, Lu, Huiwei >> > wrote: >>> >>> So the error only happens when there is communication. >>> >>> It may be caused by IB as your guessed before. Could you try to >>> reconfigure MPICH using "./configure --with-device=3Dch3:nemesis:tc= p=E2=80=9D >>> and try again? >>> >>> =E2=80=94 >>> Huiwei >>> >>> > On Nov 25, 2014, at 11:23 PM, Amin Hassani >> > wrote: >>> > >>> > Yes it works. >>> > output: >>> > >>> > $ mpirun -hostfile hosts-hydra -np 2 test >>> > rank 1 >>> > rank 0 >>> > >>> > >>> > Amin Hassani, >>> > CIS department at UAB, >>> > Birmingham, AL, USA. >>> > >>> > On Tue, Nov 25, 2014 at 11:20 PM, Lu, Huiwei >>> > wrote: >>> > Could you try to run the following simple code to see if it work= s? >>> > >>> > #include >>> > #include >>> > int main(int argc, char** argv) >>> > { >>> > int rank, size; >>> > MPI_Init(&argc, &argv); >>> > MPI_Comm_rank(MPI_COMM_WORLD, &rank); >>> > printf("rank %d\n", rank); >>> > MPI_Finalize(); >>> > return 0; >>> > } >>> > >>> > =E2=80=94 >>> > Huiwei >>> > >>> > > On Nov 25, 2014, at 11:11 PM, Amin Hassani >>> > wrote: >>> > > >>> > > No, I checked. Also I always install my MPI's in >>> /nethome/students/ahassani/usr/mpi. I never install them in >>> /nethome/students/ahassani/usr. So MPI files will never get there. >>> Even if put the /usr/mpi/bin in front of /usr/bin, it won't affect >>> anything. There has never been any mpi installed in /usr/bin. >>> > > >>> > > Thank you. >>> > > _______________________________________________ >>> > > discuss mailing list discuss at mpich.org >> discuss at mpich.org> >>> > > To manage subscription options or unsubscribe: >>> > > https://lists.mpich.org/mailman/listinfo/discuss >>> > >>> > _______________________________________________ >>> > discuss mailing list discuss at mpich.org >>> > To manage subscription options or unsubscribe: >>> > https://lists.mpich.org/mailman/listinfo/discuss >>> > >>> > _______________________________________________ >>> > discuss mailing list discuss at mpich.org >>> > To manage subscription options or unsubscribe: >>> > https://lists.mpich.org/mailman/listinfo/discuss >>> >>> _______________________________________________ >>> discuss mailing list discuss at mpich.org >>> To manage subscription options or unsubscribe: >>> https://lists.mpich.org/mailman/listinfo/discuss >>> >>> >>> >>> >>> _______________________________________________ >>> discuss mailing list discuss at mpich.org >>> To manage subscription options or unsubscribe: >>> https://lists.mpich.org/mailman/listinfo/discuss >>> >>> _______________________________________________ >> discuss mailing list discuss at mpich.org >> To manage subscription options or unsubscribe: >> https://lists.mpich.org/mailman/listinfo/discuss >> > > > _______________________________________________ > discuss mailing list discuss at mpich.org > To manage subscription options or unsubscribe: > https://lists.mpich.org/mailman/listinfo/discuss > --001a11c108e02e57a30508cb605e Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
I have no idea. You may try to trace all events as said= at h= ttp://wiki.mpich.org/mpich/index.php/Debug_Event_Logging
From= the trace log,  one may find out something abnormal. 

--Junchao Zhang

On Wed, Nov 26, 2014 at 4:25 PM, Amin Hassan= i <ahassani at cis.uab.edu> wrote:
I disabled the whole firewall in those = machines but, still get the same problem. connection refuse.
I run the program in another set of totally different machines that we hav= e, but still same problem.
Any other thought where can be th= e problem?

Thanks.

Amin Hassani,
CIS department at UAB,
Birmingham, AL, USA.

On Wed, Nov 26= , 2014 at 9:25 AM, Kenneth Raffenetti <raffenet at mcs.anl.gov> wrote:
The connection refused makes m= e think a firewall is getting in the way. Is TCP communication limited to s= pecific ports on the cluster? If so, you can use this envvar to enforce a r= ange of ports in MPICH.

MPIR_CVAR_CH3_PORT_RANGE
    Description: The MPIR_CVAR_CH3_PORT_RANGE environment variabl= e allows you to specify the range of TCP ports to be used by the process ma= nager and the MPICH library. The format of this variable is <low>:<= ;high>.  To specify any available port, use 0:0.
    Default: {0,0}


On 11/25/2014 11:50 PM, Amin Hassani wrote:
Tried with the new configure too. same problem :(

$ mpirun -hostfile hosts-hydra -np 2  test_dup
Fatal error in MPI_Send: Unknown error class, error stack:
MPI_Send(174)..............: MPI_Send(buf=3D0x7fffd90c76c8, count=3D1,
MPI_INT, dest=3D1, tag=3D0, MPI_COMM_WORLD) failed
MPID_nem_tcp_connpoll(1832): Communication error with rank 1: Connection refused

=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
=3D   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=3D   PID 5459 RUNNING AT oakmnt-0-a
=3D   EXIT CODE: 1
=3D   CLEANING UP REMAINING PROCESSES
=3D   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
[proxy:0:1 at oakmnt-0-b] HYD_pmcd_pmip_control_cmd_cb
(../../../../src/pm/hydra/pm/pmiserv/pmip_cb.c:885): assert (!closed= ) failed
[proxy:0:1 at oakmnt-0-b] HYDT_dmxu_poll_wait_for_event
(../../../../src/pm/hydra/tools/demux/demux_poll.c:76): callback
returned error status
[proxy:0:1 at oakmnt-0-b] main
(../../../../src/pm/hydra/pm/pmiserv/pmip.c:206): demux engine error=
waiting for event
[mpiexec at oakmnt-0-a] HYDT_bscu_wait_for_completion
(../../../../src/pm/hydra/tools/bootstrap/utils/bscu_wait.c:7= 6): one of
the processes terminated badly; aborting
[mpiexec at oakmnt-0-a] HYDT_bsci_wait_for_completion
(../../../../src/pm/hydra/tools/bootstrap/src/bsci_wait.c:23)= : launcher
returned error waiting for completion
[mpiexec at oakmnt-0-a] HYD_pmci_wait_for_completion
(../../../../src/pm/hydra/pm/pmiserv/pmiserv_pmci.c:218): launcher returned error waiting for completion
[mpiexec at oakmnt-0-a] main
(../../../../src/pm/hydra/ui/mpich/mpiexec.c:344): process manager e= rror
waiting for completion


Amin Hassani,
CIS department at UAB,
Birmingham, AL, USA.

On Tue, Nov 25, 2014 at 11:44 PM, Lu, Huiwei <huiweilu at mcs.anl.gov
<mailto:huiwei= lu at mcs.anl.gov>> wrote:

    So the error only happens when there is communication.

    It may be caused by IB as your guessed before. Could you try = to
    reconfigure MPICH using "./configure --with-device=3Dch3= :nemesis:tcp=E2=80=9D
    and try again?

    =E2=80=94
    Huiwei

     > On Nov 25, 2014, at 11:23 PM, Amin Hassani <ahassani at cis.uab.edu=
    <mailto:ahassani at cis.uab.edu>> wrote:
     >
     > Yes it works.
     > output:
     >
     > $ mpirun -hostfile hosts-hydra -np 2  test      > rank 1
     > rank 0
     >
     >
     > Amin Hassani,
     > CIS department at UAB,
     > Birmingham, AL, USA.
     >
     > On Tue, Nov 25, 2014 at 11:20 PM, Lu, Huiwei
    <huiweilu at mcs.anl.gov <mailto:huiweilu at mcs.anl.gov>> wrote:
     > Could you try to run the following simple code to = see if it works?
     >
     > #include <mpi.h>
     > #include <stdio.h>
     > int main(int argc, char** argv)
     > {
     >     int rank, size;
     >     MPI_Init(&argc, &argv);=
     >     MPI_Comm_rank(MPI_COMM_WORLD, &= amp;rank);
     >     printf("rank %d\n", r= ank);
     >     MPI_Finalize();
     >     return 0;
     > }
     >
     > =E2=80=94
     > Huiwei
     >
     > > On Nov 25, 2014, at 11:11 PM, Amin Hassani
    <ahassani at cis.uab.edu <mailto:ahassani at cis.uab.edu>> wrote:
     > >
     > > No, I checked. Also I always install my MPI's= in
    /nethome/students/ahassani/usr/mpi. I never install th= em in
    /nethome/students/ahassani/usr. So MPI files will neve= r get there.
    Even if put the /usr/mpi/bin in front of /usr/bin, it won't a= ffect
    anything. There has never been any mpi installed in /usr/bin.=
     > >
     > > Thank you.
     > > ______________________________________= _________
     > > discuss mailing list discuss at mpich.org <mailto:discuss at mpich.org>=
     > > To manage subscription options or unsubscribe= :
     > > https://lists.mpich.org/mailman/li= stinfo/discuss
     >
     > ___________________________________________= ____
     > discuss mailing list discuss at mpich.org <mailto:discuss at mpich.org>
     > To manage subscription options or unsubscribe:
     > https://lists.mpich.org/mailman/listinf= o/discuss
     >
     > ___________________________________________= ____
     > discuss mailing list discuss at mpich.org <mailto:discuss at mpich.org>
     > To manage subscription options or unsubscribe:
     > https://lists.mpich.org/mailman/listinf= o/discuss

    _______________________________________________
    discuss mailing list discuss at mpich.org <mailto:discuss at mpich.org>
    To manage subscription options or unsubscribe:
    https://lists.mpich.org/mailman/listinfo/discuss




_______________________________________________
discuss mailing list     
discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss

_______________________________________________
discuss mailing list     discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss
=


_______________________________________________
discuss mailing list     https://lists.mpich.org/mailman/listinfo/discuss

--001a11c108e02e57a30508cb605e-- --===============2762240988201403401== Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Content-Disposition: inline _______________________________________________ discuss mailing list discuss at mpich.org To manage subscription options or unsubscribe: https://lists.mpich.org/mailman/listinfo/discuss --===============2762240988201403401==-- From bogus@does.not.exist.com Thu Apr 17 12:32:09 2025 From: bogus@does.not.exist.com () Date: Thu, 17 Apr 2025 17:32:09 -0000 Subject: No subject Message-ID: e algorithm. i.e. the problem may not lie with the inefficiency of your com= munication between threads but just that your algorithm is not keeping the = processor busy enough for a large number of threads. The only way that you = will know for sure whether this is a comms issue or an algorithmic one is t= o use a profiling tool, such as Vampir or Paraver. With the profiling result you will be able to determine whether you need to= make algorithmic changes in your bulk processing and enhance comms as per = Huiwei's notes. I would be interested to know what your profiling shows. Regards, bob On Thu, Oct 23, 2014 at 12:02 AM, Qiguo Jing > wrote: Hi All, We have a parallel program running on a cluster. We recently found a case,= which decreases the CPU usage and increase the run-time when increases Nod= es. Below is the results table. The particular run requires a lot of data communication between nodes. Any thoughts about this phenomena? Or is there any way we can improve the = CPU usage when using higher number of nodes? Average CPU Usage (%) Number of Nodes Number of Threads/Node 100 1 8 92 2 8 50 3 8 40 4 8 35 5 8 30 6 8 25 7 8 20 8 8 20 8 4 Thanks! _________________________________________________________________________ The information transmitted is intended only for the person or entity to which it is addressed and may contain confidential and/or privileged material. Any review, retransmission, dissemination or other use of, or taking of any action in reliance upon, this information by persons or entities other than the intended recipient is prohibited. If you received this in error, please contact the sender and delete the material from any computer. _________________________________________________________________________ _______________________________________________ discuss mailing list discuss at mpich.org To manage subscription options or unsubscribe: https://lists.mpich.org/mailman/listinfo/discuss _________________________________________________________________________ The information transmitted is intended only for the person or entity to which it is addressed and may contain confidential and/or privileged material. Any review, retransmission, dissemination or other use of, or taking of any action in reliance upon, this information by persons or entities other than the intended recipient is prohibited. If you received this in error, please contact the sender and delete the material from any computer. _________________________________________________________________________ --=20 _________________________________________________________________________ The information transmitted is intended only for the person or entity to which it is addressed and may contain confidential and/or privileged material. Any review, retransmission, dissemination or other use of, or taking of any action in reliance upon, this information by persons or entities other than the intended recipient is prohibited. If you received this in error, please contact the sender and delete the material from any computer. _________________________________________________________________________ --_000_9D079B25EA4E3149B260846F07AE80754C59447FTCIEXCH03ustrin_ Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable

Hi Bob,

 

Thanks for your suggestions. &nb= sp; Here are more tests.  We actually have three clusters.<= /span>

 

Cluster 1 and 2:  8 nodes, (2 Pr= ocessors, 4 cores/processor, no HT =E2=80=93 Total 8 Threads)/node

Cluster 3:    &nb= sp;         8 nodes, (1 Processors,= 4 cores /processor, HT =E2=80=93 Total 8 Threads )/node<= /p>

 

We also have a standalone machine:&nb= sp; 2 processors, 6 cores/processor, HT =E2=80=93 total 24 threads.

 

 

For one particular case: 

 

Cluster 1 and 2 take 48 min to finish= with 8 nodes, 8 threads/node,  60% CPU usage;  53 min to finish = with 3 nodes, 8 threads/node, 90% CPU usage;

 

Cluster 3 takes 227 min to finish wit= h 8 nodes, 8 threads/node, 20% CPU usage; 207 min to finish with 3 nodes, 8= threads/node, 50% CPU usage;

 

Standalone machine takes 82 min to fi= nish with 24 threads, 100% CPU usage.

 

It looks like with 24 threads, they s= hould be pretty busy?   Could the above phenomena be a hardware i= ssue more than software?

 

Qiguo

 

From: Bob Ilgner [mailto:bobilgner at gmail.com]
Sent: Thursday, October 23, 2014 1:11 AM
To: discuss at mpich.org
Subject: Re: [mpich-discuss] CPU usage versus Nodes, Threads

 

Hi Qiguo,

 

From the results table it looks as if you are u= sing a computationally sparse algorithm. i.e. the problem may not lie with = the inefficiency of your communication between threads but just that your a= lgorithm is not keeping the processor busy enough for a large number of threads. The only way that you will know for = sure whether this is a comms issue or an algorithmic one is to use a profil= ing tool, such as Vampir or Paraver.

 

With the profiling result you will be able to determ= ine whether you need to make algorithmic changes in your bulk processing an= d enhance comms as per Huiwei's notes.

 

I would be interested to know what your profiling sh= ows.

 

Regards, bob

 

On Thu, Oct 23, 2014 at 12:02 AM, Qiguo Jing <qjing at trinit= yconsultants.com> wrote:

Hi All,

 

We have a parallel program running on a cluster.  We recently= found a case, which decreases the CPU usage and increase the run-time when= increases Nodes.   Below is the results table.

 

The particular run requires a lot of data communication between no= des.

 

Any thoughts about this phenomena?  Or is there any way we ca= n improve the CPU usage when using higher number of nodes?

 

Average CPU Usage (%)

Number of Nodes

Number of Threads/Node

100

1

8

92

2

8

50

3

8

40

4

8

35

5

8

30

6

8

25

7

8

20

8

8

20

8

4

 

 

Thanks!


_________________________________________________________________________
The information transmitted is intended only for the person or entity to which it is addressed and may contain confidential and/or privileged
material. Any review, retransmission, dissemination or other use of, or
taking of any action in reliance upon, this information by persons or
entities other than the intended recipient is prohibited. If you received this in error, please contact the sender and delete the material from any computer.
_________________________________________________________________________
_______________________________________________
discuss mailing list     https://lists.mpich.org/mailman/listinfo/discuss

 


_________________________________________________________________________
The information transmitted is intended only for the person or entity to which it is addressed and may contain confidential and/or privileged
material. Any review, retransmission, dissemination or other use of, or
taking of any action in reliance upon, this information by persons or
entities other than the intended recipient is prohibited. If you received this in error, please contact the sender and delete the material from any computer.
_________________________________________________________________________


_________________________________________________________________= ________

The information transmitted is intended only for the person= or entity to
which it is addressed and may contain confidential and/or = privileged
material. Any review, retransmission, dissemination or other= use of, or
taking of any action in reliance upon, this information by p= ersons or
entities other than the intended recipient is prohibited. If= you received
this in error, please contact the sender and delete the ma= terial from any
computer.
______________________________________= ___________________________________
= --_000_9D079B25EA4E3149B260846F07AE80754C59447FTCIEXCH03ustrin_-- --===============5606136153793979325== Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Content-Disposition: inline _______________________________________________ discuss mailing list discuss at mpich.org To manage subscription options or unsubscribe: https://lists.mpich.org/mailman/listinfo/discuss --===============5606136153793979325==-- From bogus@does.not.exist.com Thu Apr 17 12:32:09 2025 From: bogus@does.not.exist.com () Date: Thu, 17 Apr 2025 17:32:09 -0000 Subject: No subject Message-ID: e algorithm. i.e. the problem may not lie with the inefficiency of your com= munication between threads but just that your algorithm is not keeping the = processor busy enough for a large number of threads. The only way that you = will know for sure whether this is a comms issue or an algorithmic one is t= o use a profiling tool, such as Vampir or Paraver. With the profiling result you will be able to determine whether you need to= make algorithmic changes in your bulk processing and enhance comms as per = Huiwei's notes. I would be interested to know what your profiling shows. Regards, bob On Thu, Oct 23, 2014 at 12:02 AM, Qiguo Jing > wrote: Hi All, We have a parallel program running on a cluster. We recently found a case,= which decreases the CPU usage and increase the run-time when increases Nod= es. Below is the results table. The particular run requires a lot of data communication between nodes. Any thoughts about this phenomena? Or is there any way we can improve the = CPU usage when using higher number of nodes? Average CPU Usage (%) Number of Nodes Number of Threads/Node 100 1 8 92 2 8 50 3 8 40 4 8 35 5 8 30 6 8 25 7 8 20 8 8 20 8 4 Thanks! _________________________________________________________________________ The information transmitted is intended only for the person or entity to which it is addressed and may contain confidential and/or privileged material. Any review, retransmission, dissemination or other use of, or taking of any action in reliance upon, this information by persons or entities other than the intended recipient is prohibited. If you received this in error, please contact the sender and delete the material from any computer. _________________________________________________________________________ _______________________________________________ discuss mailing list discuss at mpich.org To manage subscription options or unsubscribe: https://lists.mpich.org/mailman/listinfo/discuss --=20 _________________________________________________________________________ The information transmitted is intended only for the person or entity to which it is addressed and may contain confidential and/or privileged material. Any review, retransmission, dissemination or other use of, or taking of any action in reliance upon, this information by persons or entities other than the intended recipient is prohibited. If you received this in error, please contact the sender and delete the material from any computer. _________________________________________________________________________ --_000_9D079B25EA4E3149B260846F07AE80754C594461TCIEXCH03ustrin_ Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable

Hi Bob,

 

Thanks for your suggestions. &nb= sp; Here are more tests.  We actually have three clusters.<= /span>

 

Cluster 1 and 2:  8 nodes, (2 Pr= ocessors, 4 cores/processor, no HT =E2=80=93 Total 8 Threads)/node

Cluster 3:    &nb= sp;         8 nodes, (1 Processors,= 4 cores /processor, HT =E2=80=93 Total 8 Threads )/node<= /p>

 

We also have a standalone machine:&nb= sp; 2 processors, 6 cores/processor, HT =E2=80=93 total 24 threads.

 

 

For one particular case: 

 

Cluster 1 and 2 take 48 min to finish= with 8 nodes, 8 threads/node,  60% CPU usage;  53 min to finish = with 3 nodes, 8 threads/node, 90% CPU usage;

 

Cluster 3 takes 227 min to finish wit= h 8 nodes, 8 threads/node, 20% CPU usage; 207 min to finish with 3 nodes, 8= threads/node, 50% CPU usage;

 

Standalone machine takes 24 min to fi= nish with 24 threads, 100% CPU usage.

 

It looks like with 24 threads, they s= hould be pretty busy?   Could the above phenomena be a hardware i= ssue more than software?

 

Qiguo

 

From: Bob Ilgner [mailto:bobilgner at g= mail.com]
Sent: Thursday, October 23, 2014 1:11 AM
To: discuss at mpich.org
Subject: Re: [mpich-discuss] CPU usage versus Nodes, Threads

 

Hi Qiguo,

 

From the results table it looks as if you are u= sing a computationally sparse algorithm. i.e. the problem may not lie with = the inefficiency of your communication between threads but just that your a= lgorithm is not keeping the processor busy enough for a large number of threads. The only way that you will know for = sure whether this is a comms issue or an algorithmic one is to use a profil= ing tool, such as Vampir or Paraver.

 

With the profiling result you will be able to determ= ine whether you need to make algorithmic changes in your bulk processing an= d enhance comms as per Huiwei's notes.

 

I would be interested to know what your profiling sh= ows.

 

Regards, bob

 

On Thu, Oct 23, 2014 at 12:02 AM, Qiguo Jing <qjing at trinit= yconsultants.com> wrote:

Hi All,

 

We have a parallel program running on a cluster.  We recently= found a case, which decreases the CPU usage and increase the run-time when= increases Nodes.   Below is the results table.

 

The particular run requires a lot of data communication between no= des.

 

Any thoughts about this phenomena?  Or is there any way we ca= n improve the CPU usage when using higher number of nodes?

 

Average CPU Usage (%)

Number of Nodes

Number of Threads/Node

100

1

8

92

2

8

50

3

8

40

4

8

35

5

8

30

6

8

25

7

8

20

8

8

20

8

4

 

 

Thanks!


_________________________________________________________________________
The information transmitted is intended only for the person or entity to which it is addressed and may contain confidential and/or privileged
material. Any review, retransmission, dissemination or other use of, or
taking of any action in reliance upon, this information by persons or
entities other than the intended recipient is prohibited. If you received this in error, please contact the sender and delete the material from any computer.
_________________________________________________________________________
_______________________________________________
discuss mailing list     https://lists.mpich.org/mailman/listinfo/discuss

 


_________________________________________________________________= ________

The information transmitted is intended only for the person= or entity to
which it is addressed and may contain confidential and/or = privileged
material. Any review, retransmission, dissemination or other= use of, or
taking of any action in reliance upon, this information by p= ersons or
entities other than the intended recipient is prohibited. If= you received
this in error, please contact the sender and delete the ma= terial from any
computer.
______________________________________= ___________________________________
= --_000_9D079B25EA4E3149B260846F07AE80754C594461TCIEXCH03ustrin_-- --===============5521172785385150317== Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Content-Disposition: inline _______________________________________________ discuss mailing list discuss at mpich.org To manage subscription options or unsubscribe: https://lists.mpich.org/mailman/listinfo/discuss --===============5521172785385150317==-- From bogus@does.not.exist.com Thu Apr 17 12:32:09 2025 From: bogus@does.not.exist.com () Date: Thu, 17 Apr 2025 17:32:09 -0000 Subject: No subject Message-ID: sparse algorithm. i.e. the problem may not lie with the inefficiency of your communication between threads but just that your algorithm is not keeping the processor busy enough for a large number of threads. The only way that you will know for sure whether this is a comms issue or an algorithmic one is to use a profiling tool, such as Vampir or Paraver. With the profiling result you will be able to determine whether you need to make algorithmic changes in your bulk processing and enhance comms as per Huiwei's notes. I would be interested to know what your profiling shows. Regards, bob On Thu, Oct 23, 2014 at 12:02 AM, Qiguo Jing wrote: > Hi All, > > > > We have a parallel program running on a cluster. We recently found a > case, which decreases the CPU usage and increase the run-time when > increases Nodes. Below is the results table. > > > > The particular run requires a lot of data communication between nodes. > > > > Any thoughts about this phenomena? Or is there any way we can improve the > CPU usage when using higher number of nodes? > > > > Average CPU Usage (%) > > Number of Nodes > > Number of Threads/Node > > 100 > > 1 > > 8 > > 92 > > 2 > > 8 > > 50 > > 3 > > 8 > > 40 > > 4 > > 8 > > 35 > > 5 > > 8 > > 30 > > 6 > > 8 > > 25 > > 7 > > 8 > > 20 > > 8 > > 8 > > 20 > > 8 > > 4 > > > > > > Thanks! > > _________________________________________________________________________ > > The information transmitted is intended only for the person or entity to > which it is addressed and may contain confidential and/or privileged > material. Any review, retransmission, dissemination or other use of, or > taking of any action in reliance upon, this information by persons or > entities other than the intended recipient is prohibited. If you received > this in error, please contact the sender and delete the material from any > computer. > _________________________________________________________________________ > > _______________________________________________ > discuss mailing list discuss at mpich.org > To manage subscription options or unsubscribe: > https://lists.mpich.org/mailman/listinfo/discuss > --20cf30363855c043ab050610eddd Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
Hi Qiguo,

From the results tab= le it looks as if you are using a computationally sparse algorithm. i.= e. the problem may not lie with the inefficiency of your communication betw= een threads but just that your algorithm is not keeping the processor busy = enough for a large number of threads. The only way that you will know for s= ure whether this is a comms issue or an algorithmic one is to use a profili= ng tool, such as Vampir or Paraver.

With the profi= ling result you will be able to determine whether you need to make algorith= mic changes in your bulk processing and enhance comms as per Huiwei's notes= .

I would be interested to know what your profilin= g shows.

Regards, bob

On Thu, Oct 23, 2014 at 12:02 AM, = Qiguo Jing <qjing at trinityconsultants.com> wrote:<= br>

Hi All,

 

We have a parallel program running on a cluster.&nbs= p; We recently found a case, which decreases the CPU usage and increase the= run-time when increases Nodes.   Below is the results table.<= /u>

 

The particular run requires a lot of data communicat= ion between nodes.

 

Any thoughts about this phenomena?  Or is there= any way we can improve the CPU usage when using higher number of nodes?

 

Average CPU Usage (%)

Number of Nodes

Number of Threads/Node

100

1

8

92

2

8

50

3

8

40

4

8

35

5

8

30

6

8

25

7

8

20

8

8

20

8

4

 

 

Thanks!


_____________________________________________________________= ____________

The information transmitted is intended only for the pe= rson or entity to
which it is addressed and may contain confidential and= /or privileged
material. Any review, retransmission, dissemination or o= ther use of, or
taking of any action in reliance upon, this information = by persons or
entities other than the intended recipient is prohibited. = If you received
this in error, please contact the sender and delete th= e material from any
computer.
________________________________= _________________________________________

___________________= ____________________________
discuss mailing list     https://lists.mpich.org/mailman/listinfo/discuss

--20cf30363855c043ab050610eddd-- --===============0923015504839801834== Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Content-Disposition: inline _______________________________________________ discuss mailing list discuss at mpich.org To manage subscription options or unsubscribe: https://lists.mpich.org/mailman/listinfo/discuss --===============0923015504839801834==-- From bogus@does.not.exist.com Thu Apr 17 12:32:09 2025 From: bogus@does.not.exist.com () Date: Thu, 17 Apr 2025 17:32:09 -0000 Subject: No subject Message-ID: loads and stores in the MPI_Win_lock_all epochs using MPI_Fetch_and_op (see attached files). This version behaves very similar to the original code and also fails from time to time. Putting a sleep into the acquire busy loop (usleep(100)) will make the code "much more robust" (I hack, I know, but indicating some underlying race condition?!). Let me know if you see any problems in the way I am using MPI_Fetch_and_op in a busy loop. Flushing or syncing is not necessary in this case, right? All work is done with export MPIR_CVAR_ASYNC_PROGRESS=1 on mpich-3.2 and mpich-3.3a2 On Wed, Mar 8, 2017 at 4:21 PM, Halim Amer wrote: > I cannot claim that I thoroughly verified the correctness of that code, so > take it with a grain of salt. Please keep in mind that it is a test code > from a tutorial book; those codes are meant for learning purposes not for > deployment. > > If your goal is to have a high performance RMA lock, I suggest you to look > into the recent HPDC'16 paper: "High-Performance Distributed RMA Locks". > > Halim > www.mcs.anl.gov/~aamer > > On 3/8/17 3:06 AM, Ask Jakobsen wrote: > >> You are absolutely correct, Halim. Removing the test lmem[nextRank] == -1 >> in release fixes the problem. Great work. Now I will try to understand why >> you are right. I hope the authors of the book will credit you for >> discovering the bug. >> >> So in conclusion you need to remove the above mentioned test AND enable >> asynchronous progression using the environment variable >> MPIR_CVAR_ASYNC_PROGRESS=1 in MPICH (BTW I still can't get the code to >> work >> in openmpi). >> >> On Tue, Mar 7, 2017 at 5:37 PM, Halim Amer wrote: >> >> detect that another process is being or already enqueued in the MCS >>>> >>> queue. >>> >>> Actually the problem occurs only when the waiting process already >>> enqueued >>> itself, i.e., the accumulate operation on the nextRank field succeeded. >>> >>> Halim >>> www.mcs.anl.gov/~aamer >>> >>> >>> On 3/7/17 10:29 AM, Halim Amer wrote: >>> >>> In the Release protocol, try removing this test: >>>> >>>> if (lmem[nextRank] == -1) { >>>> If-Block; >>>> } >>>> >>>> but keep the If-Block. >>>> >>>> The hang occurs because the process releasing the MCS lock fails to >>>> detect that another process is being or already enqueued in the MCS >>>> queue. >>>> >>>> Halim >>>> www.mcs.anl.gov/~aamer >>>> >>>> >>>> On 3/7/17 6:43 AM, Ask Jakobsen wrote: >>>> >>>> Thanks, Halim. I have now enabled asynchronous progress in MPICH (can't >>>>> find something similar in openmpi) and now all ranks acquire the lock >>>>> and >>>>> the program finish as expected. However if I put a while(1) loop >>>>> around the >>>>> acquire-release code in main.c it will fail again at random and go >>>>> into an >>>>> infinite loop. The simple unfair lock does not have this problem. >>>>> >>>>> On Tue, Mar 7, 2017 at 12:44 AM, Halim Amer wrote: >>>>> >>>>> My understanding is that this code assumes asynchronous progress. >>>>> >>>>>> An example of when the processes hang is as follows: >>>>>> >>>>>> 1) P0 Finishes MCSLockAcquire() >>>>>> 2) P1 is busy waiting in MCSLockAcquire() at >>>>>> do { >>>>>> MPI_Win_sync(win); >>>>>> } while (lmem[blocked] == 1); >>>>>> 3) P0 executes MCSLockRelease() >>>>>> 4) P0 waits on MPI_Win_lock_all() inside MCSLockRlease() >>>>>> >>>>>> Hang! >>>>>> >>>>>> For P1 to get out of the loop, P0 has to get out of >>>>>> MPI_Win_lock_all() and >>>>>> executes its Compare_and_swap(). >>>>>> >>>>>> For P0 to get out MPI_Win_lock_all(), it needs an ACK from P1 that it >>>>>> got >>>>>> the lock. >>>>>> >>>>>> P1 does not make communication progress because MPI_Win_sync is not >>>>>> required to do so. It only synchronizes private and public copies. >>>>>> >>>>>> For this hang to disappear, one can either trigger progress manually >>>>>> by >>>>>> using heavy-duty synchronization calls instead of Win_sync (e.g., >>>>>> Win_unlock_all + Win_lock_all), or enable asynchronous progress. >>>>>> >>>>>> To enable asynchronous progress in MPICH, set the >>>>>> MPIR_CVAR_ASYNC_PROGRESS >>>>>> env var to 1. >>>>>> >>>>>> Halim >>>>>> www.mcs.anl.gov/~aamer < >>>>>> http://www.mcs.anl.gov/%7Eaamer> >>>>>> >>>>>> >>>>>> On 3/6/17 1:11 PM, Ask Jakobsen wrote: >>>>>> >>>>>> I am testing on x86_64 platform. >>>>>> >>>>>>> >>>>>>> I have tried to built both the mpich and the mcs lock code with -O0 >>>>>>> to >>>>>>> avoid agressive optimization. After your suggestion I have also >>>>>>> tried to >>>>>>> make volatile int *pblocked pointing to lmem[blocked] in the >>>>>>> MCSLockAcquire >>>>>>> function and volatile int *pnextrank pointing to lmem[nextRank] in >>>>>>> MCSLockRelease, but it does not appear to make a difference. >>>>>>> >>>>>>> On suggestion from Richard Warren I have also tried building the code >>>>>>> using >>>>>>> openmpi-2.0.2 without any luck (however it appears to acquire the >>>>>>> lock a >>>>>>> couple of extra times before failing) which I find troubling. >>>>>>> >>>>>>> I think I will give up using local load/stores and will see if I can >>>>>>> figure >>>>>>> out if rewrite using MPI calls like MPI_Fetch_and_op as you suggest. >>>>>>> Thanks for your help. >>>>>>> >>>>>>> On Mon, Mar 6, 2017 at 7:20 PM, Jeff Hammond >>>>>> > >>>>>>> wrote: >>>>>>> >>>>>>> What processor architecture are you testing? >>>>>>> >>>>>>> >>>>>>>> Maybe set lmem to volatile or read it with MPI_Fetch_and_op rather >>>>>>>> than a >>>>>>>> load. MPI_Win_sync cannot prevent the compiler from caching *lmem >>>>>>>> in a >>>>>>>> register. >>>>>>>> >>>>>>>> Jeff >>>>>>>> >>>>>>>> On Sat, Mar 4, 2017 at 12:30 AM, Ask Jakobsen >>>>>>>> wrote: >>>>>>>> >>>>>>>> Hi, >>>>>>>> >>>>>>>> >>>>>>>>> I have downloaded the source code for the MCS lock from the >>>>>>>>> excellent >>>>>>>>> book "Using Advanced MPI" from http://www.mcs.anl.gov/researc >>>>>>>>> h/projects/mpi/usingmpi/examples-advmpi/rma2/mcs-lock.c >>>>>>>>> >>>>>>>>> I have made a very simple piece of test code for testing the MCS >>>>>>>>> lock >>>>>>>>> but >>>>>>>>> it works at random and often never escapes the busy loops in the >>>>>>>>> acquire >>>>>>>>> and release functions (see attached source code). The code appears >>>>>>>>> semantically correct to my eyes. >>>>>>>>> >>>>>>>>> #include >>>>>>>>> #include >>>>>>>>> #include "mcs-lock.h" >>>>>>>>> >>>>>>>>> int main(int argc, char *argv[]) >>>>>>>>> { >>>>>>>>> MPI_Win win; >>>>>>>>> MPI_Init( &argc, &argv ); >>>>>>>>> >>>>>>>>> MCSLockInit(MPI_COMM_WORLD, &win); >>>>>>>>> >>>>>>>>> int rank, size; >>>>>>>>> MPI_Comm_rank(MPI_COMM_WORLD, &rank); >>>>>>>>> MPI_Comm_size(MPI_COMM_WORLD, &size); >>>>>>>>> >>>>>>>>> printf("rank: %d, size: %d\n", rank, size); >>>>>>>>> >>>>>>>>> >>>>>>>>> MCSLockAcquire(win); >>>>>>>>> printf("rank %d aquired lock\n", rank); fflush(stdout); >>>>>>>>> MCSLockRelease(win); >>>>>>>>> >>>>>>>>> >>>>>>>>> MPI_Win_free(&win); >>>>>>>>> MPI_Finalize(); >>>>>>>>> return 0; >>>>>>>>> } >>>>>>>>> >>>>>>>>> >>>>>>>>> I have tested on several hardware platforms and mpich-3.2 and >>>>>>>>> mpich-3.3a2 >>>>>>>>> but with no luck. >>>>>>>>> >>>>>>>>> It appears that the MPI_Win_Sync are not "refreshing" the local >>>>>>>>> data or >>>>>>>>> I >>>>>>>>> have a bug I can't spot. >>>>>>>>> >>>>>>>>> A simple unfair lock like http://www.mcs.anl.gov/researc >>>>>>>>> h/projects/mpi/usingmpi/examples-advmpi/rma2/ga_mutex1.c works >>>>>>>>> perfectly. >>>>>>>>> >>>>>>>>> Best regards, Ask Jakobsen >>>>>>>>> >>>>>>>>> >>>>>>>>> _______________________________________________ >>>>>>>>> discuss mailing list discuss at mpich.org >>>>>>>>> To manage subscription options or unsubscribe: >>>>>>>>> https://lists.mpich.org/mailman/listinfo/discuss >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>> -- >>>>>>>> Jeff Hammond >>>>>>>> jeff.science at gmail.com >>>>>>>> http://jeffhammond.github.io/ >>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> discuss mailing list discuss at mpich.org >>>>>>>> To manage subscription options or unsubscribe: >>>>>>>> https://lists.mpich.org/mailman/listinfo/discuss >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>> >>>>>>> >>>>>>> _______________________________________________ >>>>>>> discuss mailing list discuss at mpich.org >>>>>>> To manage subscription options or unsubscribe: >>>>>>> https://lists.mpich.org/mailman/listinfo/discuss >>>>>>> >>>>>>> _______________________________________________ >>>>>>> >>>>>>> discuss mailing list discuss at mpich.org >>>>>> To manage subscription options or unsubscribe: >>>>>> https://lists.mpich.org/mailman/listinfo/discuss >>>>>> >>>>>> >>>>>> >>>>> >>>>> _______________________________________________ >>>>> discuss mailing list discuss at mpich.org >>>>> To manage subscription options or unsubscribe: >>>>> https://lists.mpich.org/mailman/listinfo/discuss >>>>> >>>>> _______________________________________________ >>>>> >>>> discuss mailing list discuss at mpich.org >>>> To manage subscription options or unsubscribe: >>>> https://lists.mpich.org/mailman/listinfo/discuss >>>> >>>> _______________________________________________ >>> discuss mailing list discuss at mpich.org >>> To manage subscription options or unsubscribe: >>> https://lists.mpich.org/mailman/listinfo/discuss >>> >>> >> >> >> _______________________________________________ >> discuss mailing list discuss at mpich.org >> To manage subscription options or unsubscribe: >> https://lists.mpich.org/mailman/listinfo/discuss >> >> _______________________________________________ > discuss mailing list discuss at mpich.org > To manage subscription options or unsubscribe: > https://lists.mpich.org/mailman/listinfo/discuss > --001a1141dbd03846c4054a8dd443 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Interestingly, according to the paper you suggested it appea= rs to include a similar test in pseudo code https://htor.inf.ethz= .ch/publications/img/hpclocks.pdf (see Listing 3 in paper).
Unfortunately, removing the test in the release protocol did n= ot solve the problem. The race condition is much more difficult to provoke,= but I managed when setting the size of the communicator to 3 (only tested = even sizes so far).

From Jeff's suggestion I have = attempted to rewrite the code removing local loads and stores in the MPI_Wi= n_lock_all epochs using MPI_Fetch_and_op (see attached files).

This version behaves very similar to the original code an= d also fails from time to time. Putting a sleep into the acquire busy loop = (usleep(100)) will make the code "much more robust" (I hack, I kn= ow, but indicating some underlying race condition?!). Let me know if you se= e any problems in the way I am using MPI_Fetch_and_op in a busy loop. Flush= ing or syncing is not necessary in this case, right?

All work is done with export MPIR_CVAR_ASYNC_PROGRESS=3D1 on= mpich-3.2 and mpich-3.3a2

On Wed, Mar 8, 2017 at = 4:21 PM, Halim Amer <aamer at anl.gov> wrote:
I cannot claim that I thoroughly verified = the correctness of that code, so take it with a grain of salt. Please keep = in mind that it is a test code from a tutorial book; those codes are meant = for learning purposes not for deployment.

If your goal is to have a high performance RMA lock, I suggest you to look = into the recent HPDC'16 paper: "High-Performance Distributed RMA Locks= ".

Halim
www.mcs.anl.gov/~aamer

On 3/8/17 3:06 AM, Ask Jakobsen wrote:
You are absolutely correct, Halim. Removing the test lmem[nextRank] =3D=3D = -1
in release fixes the problem. Great work. Now I will try to understand why<= br> you are right. I hope the authors of the book will credit you for
discovering the bug.

So in conclusion you need to remove the above mentioned test AND enable
asynchronous progression using the environment variable
MPIR_CVAR_ASYNC_PROGRESS=3D1 in MPICH (BTW I still can't get the code to wo= rk
in openmpi).

On Tue, Mar 7, 2017 at 5:37 PM, Halim Amer <aamer at anl.gov> wrote:

detect that another process is being or already enqueued in the MCS
queue.

Actually the problem occurs only when the waiting process already enqueued<= br> itself, i.e., the accumulate operation on the nextRank field succeeded.

Halim
www.mcs.anl.gov/~aamer <http://www.mcs.anl.gov/%7E= aamer>


On 3/7/17 10:29 AM, Halim Amer wrote:

In the Release protocol, try removing this test:

if (lmem[nextRank] =3D=3D -1) {
   If-Block;
}

but keep the If-Block.

The hang occurs because the process releasing the MCS lock fails to
detect that another process is being or already enqueued in the MCS queue.<= br>
Halim
www.mcs.anl.gov/~aamer <http://www.mcs.anl.gov/%7E= aamer>


On 3/7/17 6:43 AM, Ask Jakobsen wrote:

Thanks, Halim. I have now enabled asynchronous progress in MPICH (can't
find something similar in openmpi) and now all ranks acquire the lock and the program finish as expected. However if I put a while(1) loop
around the
acquire-release code in main.c it will fail again at random and go
into an
infinite loop. The simple unfair lock does not have this problem.

On Tue, Mar 7, 2017 at 12:44 AM, Halim Amer <aamer at anl.gov> wrote:

My understanding is that this code assumes asynchronous progress.
An example of when the processes hang is as follows:

1) P0 Finishes MCSLockAcquire()
2) P1 is busy waiting in MCSLockAcquire() at
do {
      MPI_Win_sync(win);
   } while (lmem[blocked] =3D=3D 1);
3) P0 executes MCSLockRelease()
4) P0 waits on MPI_Win_lock_all() inside MCSLockRlease()

Hang!

For P1 to get out of the loop, P0 has to get out of
MPI_Win_lock_all() and
executes its Compare_and_swap().

For P0 to get out MPI_Win_lock_all(), it needs an ACK from P1 that it
got
the lock.

P1 does not make communication progress because MPI_Win_sync is not
required to do so. It only synchronizes private and public copies.

For this hang to disappear, one can either trigger progress manually by
using heavy-duty synchronization calls instead of Win_sync (e.g.,
Win_unlock_all + Win_lock_all), or enable asynchronous progress.

To enable asynchronous progress in MPICH, set the
MPIR_CVAR_ASYNC_PROGRESS
env var to 1.

Halim
www.mcs.anl.gov/~aamer <http://www.mcs.anl.gov/%7Eaam= er> <
http://www.mcs.anl.gov/%7Eaamer>


On 3/6/17 1:11 PM, Ask Jakobsen wrote:

 I am testing on x86_64 platform.

I have tried to built both the mpich and the mcs lock code with -O0 to
avoid agressive optimization. After your suggestion I have also
tried to
make volatile int *pblocked pointing to lmem[blocked] in the
MCSLockAcquire
function and volatile int *pnextrank pointing to lmem[nextRank] in
MCSLockRelease, but it does not appear to make a difference.

On suggestion from Richard Warren I have also tried building the code
using
openmpi-2.0.2 without any luck (however it appears to acquire the
lock a
couple of extra times before failing) which I find troubling.

I think I will give up using local load/stores and will see if I can
figure
out if rewrite using MPI calls like MPI_Fetch_and_op  as you suggest.<= br> Thanks for your help.

On Mon, Mar 6, 2017 at 7:20 PM, Jeff Hammond <jeff.science at gmail.com>
wrote:

What processor architecture are you testing?


Maybe set lmem to volatile or read it with MPI_Fetch_and_op rather
than a
load.  MPI_Win_sync cannot prevent the compiler from caching *lmem
in a
register.

Jeff

On Sat, Mar 4, 2017 at 12:30 AM, Ask Jakobsen <afj at qeye-labs.com>
wrote:

Hi,


I have downloaded the source code for the MCS lock from the excellent
book "Using Advanced MPI" from http://www.mcs.anl.gov/resear= c
h/projects/mpi/usingmpi/examples-advmpi/rma2/mcs-lock.c

I have made a very simple piece of test code for testing the MCS lock
but
it works at random and often never escapes the busy loops in the
acquire
and release functions (see attached source code). The code appears
semantically correct to my eyes.

#include <stdio.h>
#include <mpi.h>
#include "mcs-lock.h"

int main(int argc, char *argv[])
{
  MPI_Win win;
  MPI_Init( &argc, &argv );

  MCSLockInit(MPI_COMM_WORLD, &win);

  int rank, size;
  MPI_Comm_rank(MPI_COMM_WORLD, &rank);
  MPI_Comm_size(MPI_COMM_WORLD, &size);

  printf("rank: %d, size: %d\n", rank, size);


  MCSLockAcquire(win);
  printf("rank %d aquired lock\n", rank);   fflush= (stdout);
  MCSLockRelease(win);


  MPI_Win_free(&win);
  MPI_Finalize();
  return 0;
}


I have tested on several hardware platforms and mpich-3.2 and
mpich-3.3a2
but with no luck.

It appears that the MPI_Win_Sync are not "refreshing" the local data or
I
have a bug I can't spot.

A simple unfair lock like http://www.mcs.anl.gov/researc
h/projects/mpi/usingmpi/examples-advmpi/rma2/ga_mutex1.c works
perfectly.

Best regards, Ask Jakobsen


_______________________________________________
discuss mailing list     discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discus= s




--
Jeff Hammond
jeff.science at gm= ail.com
http://jeffhammond.github.io/

_______________________________________________
discuss mailing list     discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discus= s






_______________________________________________
discuss mailing list     discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discus= s

_______________________________________________

discuss mailing list     discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discus= s




_______________________________________________
discuss mailing list     discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discus= s

_______________________________________________
discuss mailing list     discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discus= s

_______________________________________________
discuss mailing list     discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discus= s




_______________________________________________
discuss mailing list     discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discus= s

_______________________________________________
discuss mailing list     discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discus= s
--001a1141dbd03846c4054a8dd443-- --001a1141dbd03846ca054a8dd445 Content-Type: text/x-csrc; charset="US-ASCII"; name="main.c" Content-Disposition: attachment; filename="main.c" Content-Transfer-Encoding: base64 X-Attachment-Id: f_j0732una0 I2luY2x1ZGUgPHN0ZGlvLmg+CiNpbmNsdWRlIDxtcGkuaD4KI2luY2x1ZGUgIm1jcy1sb2NrLmgi CgppbnQgbWFpbihpbnQgYXJnYywgY2hhciAqYXJndltdKQp7CiAgTVBJX1dpbiB3aW47CiAgTVBJ X0luaXQoICZhcmdjLCAmYXJndiApOwoKICBNQ1NMb2NrSW5pdChNUElfQ09NTV9XT1JMRCwgJndp bik7CgogIGludCByYW5rLCBzaXplOwogIE1QSV9Db21tX3JhbmsoTVBJX0NPTU1fV09STEQsICZy YW5rKTsKICBNUElfQ29tbV9zaXplKE1QSV9DT01NX1dPUkxELCAmc2l6ZSk7CgogIHByaW50Zigi cmFuazogJWQsIHNpemU6ICVkXG4iLCByYW5rLCBzaXplKTsKCiAgaW50IGNvdW50ID0gMDsKICB3 aGlsZSgxKSB7CiAgICBNQ1NMb2NrQWNxdWlyZSh3aW4pOwogICAgLy9wcmludGYoInJhbmsgJWQg YXF1aXJlZCBsb2NrXG4iLCByYW5rKTsgZmZsdXNoKHN0ZG91dCk7CiAgICBNQ1NMb2NrUmVsZWFz ZSh3aW4pOwogICAgaWYgKGNvdW50PjEwMDApCiAgICAgIGJyZWFrOwogICAgY291bnQrKzsKICB9 CgogIE1QSV9XaW5fZnJlZSgmd2luKTsgCiAgTVBJX0ZpbmFsaXplKCk7CiAgcmV0dXJuIDA7Cn0K --001a1141dbd03846ca054a8dd445 Content-Type: text/x-csrc; charset="US-ASCII"; name="mcs-lock-fop.c" Content-Disposition: attachment; filename="mcs-lock-fop.c" Content-Transfer-Encoding: base64 X-Attachment-Id: f_j0732unw1 I2luY2x1ZGUgPHVuaXN0ZC5oPgoKI2luY2x1ZGUgIm1waS5oIgoKc3RhdGljIGludCBNQ1NfTE9D S1JBTksgPSBNUElfS0VZVkFMX0lOVkFMSUQ7CmVudW0geyBuZXh0UmFuaz0wLCBibG9ja2VkPTEs IGxvY2tUYWlsPTIgfTsKCnZvaWQgTUNTTG9ja0luaXQoTVBJX0NvbW0gY29tbSwgTVBJX1dpbiAq d2luKQp7CiAgaW50ICAgICAgKmxtZW0sIHJhbms7CiAgTVBJX0FpbnQgd2luc2l6ZTsKICBNUElf Q29tbV9yYW5rKGNvbW0sJnJhbmspOwoKICBpZiAoTUNTX0xPQ0tSQU5LID09IE1QSV9LRVlWQUxf SU5WQUxJRCkKICAgIE1QSV9XaW5fY3JlYXRlX2tleXZhbChNUElfV0lOX05VTExfQ09QWV9GTiwK CQkJICBNUElfV0lOX05VTExfREVMRVRFX0ZOLAoJCQkgICZNQ1NfTE9DS1JBTkssICh2b2lkKikw KTsKCiAgd2luc2l6ZSA9IDIgKiBzaXplb2YoaW50KTsKICBpZiAocmFuayA9PSAwKSB3aW5zaXpl ICs9IHNpemVvZihpbnQpOwogIE1QSV9XaW5fYWxsb2NhdGUod2luc2l6ZSwgc2l6ZW9mKGludCks IE1QSV9JTkZPX05VTEwsIGNvbW0sCiAgICAgICAgICAgICAgICAgICAmbG1lbSwgd2luKTsKICBs bWVtW25leHRSYW5rXSA9IC0xOwogIGxtZW1bYmxvY2tlZF0gID0gMDsKICBpZiAocmFuayA9PSAw KSB7CiAgICBsbWVtW2xvY2tUYWlsXSA9IC0xOwogIH0KICBNUElfV2luX3NldF9hdHRyKCp3aW4s IE1DU19MT0NLUkFOSywgKHZvaWQqKShNUElfQWludClyYW5rKTsKICBNUElfQmFycmllcihjb21t KTsKfQoKdm9pZCBNQ1NMb2NrQWNxdWlyZShNUElfV2luIHdpbikKewogIGludCAgZmxhZywgbXly YW5rLCBwcmVkZWNlc3NvciwgKmxtZW07CiAgdm9pZCAqYXR0cnZhbDsKICBpbnQgZmV0Y2hfYmxv Y2tlZCwgZHVtbXk7CgogIE1QSV9XaW5fZ2V0X2F0dHIod2luLCBNQ1NfTE9DS1JBTkssICZhdHRy dmFsLCAmZmxhZyk7CiAgbXlyYW5rID0gKGludCkoTVBJX0FpbnQpYXR0cnZhbDsKICBNUElfV2lu X2dldF9hdHRyKHdpbiwgTVBJX1dJTl9CQVNFLCAmbG1lbSwgJmZsYWcpOwogIGxtZW1bYmxvY2tl ZF0gPSAxOyAvKiBJbiBjYXNlIHdlIGFyZSBibG9ja2VkICovCiAgTVBJX1dpbl9sb2NrX2FsbCgw LCB3aW4pOwogIE1QSV9GZXRjaF9hbmRfb3AoJm15cmFuaywgJnByZWRlY2Vzc29yLCBNUElfSU5U LAogICAgICAgICAgICAgICAgICAgMCwgbG9ja1RhaWwsIE1QSV9SRVBMQUNFLCB3aW4pOwogIE1Q SV9XaW5fZmx1c2goMCwgd2luKTsKICBpZiAocHJlZGVjZXNzb3IgIT0gLTEpIHsKICAgIC8qIFdl IGRpZG4ndCBnZXQgdGhlIGxvY2suICBBZGQgdXMgdG8gdGhlIHRhaWwgb2YgdGhlIGxpc3QgKi8K ICAgIE1QSV9BY2N1bXVsYXRlKCZteXJhbmssIDEsIE1QSV9JTlQsIHByZWRlY2Vzc29yLAogICAg ICAgICAgICAgICAgICAgbmV4dFJhbmssIDEsIE1QSV9JTlQsIE1QSV9SRVBMQUNFLCB3aW4pOwog ICAgLyogTm93IHNwaW4gb24gb3VyIGxvY2FsIHZhbHVlICJibG9ja2VkIiB1bnRpbCB3ZSBhcmUK ICAgICAgIGdpdmVuIHRoZSBsb2NrICovCiAgICBkbyB7CiAgICAgIE1QSV9GZXRjaF9hbmRfb3Ao JmR1bW15LCAmZmV0Y2hfYmxvY2tlZCwgTVBJX0lOVCwKICAgICAgICAgICAgICAgICAgIG15cmFu aywgYmxvY2tlZCwgTVBJX05PX09QLCB3aW4pOwogICAgICB1c2xlZXAoMTAwKTsKICAgIH0gd2hp bGUgKGZldGNoX2Jsb2NrZWQ9PTEpOwoKICB9CiAgLy8gZWxzZSB3ZSBoYXZlIHRoZSBsb2NrCiAg TVBJX1dpbl91bmxvY2tfYWxsKHdpbik7Cn0Kdm9pZCBNQ1NMb2NrUmVsZWFzZShNUElfV2luIHdp bikKewogIGludCBudWxscmFuayA9IC0xLCB6ZXJvPTAsIG15cmFuaywgY3VydGFpbCwgZmxhZywg KmxtZW07CiAgdm9pZCAqYXR0cnZhbDsKICBpbnQgZmV0Y2hfbmV4dHJhbmssIGR1bW15OwoKICBN UElfV2luX2dldF9hdHRyKHdpbiwgTUNTX0xPQ0tSQU5LLCAmYXR0cnZhbCwgJmZsYWcpOwogIG15 cmFuayA9IChpbnQpKE1QSV9BaW50KWF0dHJ2YWw7CiAgTVBJX1dpbl9nZXRfYXR0cih3aW4sIE1Q SV9XSU5fQkFTRSwgJmxtZW0sICZmbGFnKTsKICBNUElfV2luX2xvY2tfYWxsKDAsIHdpbik7CiAg TVBJX0ZldGNoX2FuZF9vcCgmZHVtbXksICZmZXRjaF9uZXh0cmFuaywgTVBJX0lOVCwKICAgICAg ICAgICAgICAgICAgIG15cmFuaywgbmV4dFJhbmssIE1QSV9OT19PUCwgd2luKTsKCiAgaWYgKGZl dGNoX25leHRyYW5rID09IC0xKSB7CiAgICAvKiBTZWUgaWYgd2UncmUgd2FpdGluZyBmb3IgdGhl IG5leHQgdG8gbm90aWZ5IHVzICovCiAgICBNUElfQ29tcGFyZV9hbmRfc3dhcCgmbnVsbHJhbmss ICZteXJhbmssICZjdXJ0YWlsLCBNUElfSU5ULAogICAgICAgICAgICAgICAgICAgICAgICAgMCwg bG9ja1RhaWwsIHdpbik7CiAgICBpZiAoY3VydGFpbCA9PSBteXJhbmspIHsKICAgICAgLyogV2Ug YXJlIHRoZSBvbmx5IHByb2Nlc3MgaW4gdGhlIGxpc3QgKi8KICAgICAgTVBJX1dpbl91bmxvY2tf YWxsKHdpbik7CiAgICAgIHJldHVybjsKICAgIH0KICAgIC8qIE90aGVyd2lzZSwgc29tZW9uZSBl bHNlIGhhcyBhZGRlZCB0aGVtc2VsdmVzIHRvIHRoZSBsaXN0LiovCiAgICBkbyB7CiAgICAgIE1Q SV9GZXRjaF9hbmRfb3AoJmR1bW15LCAmZmV0Y2hfbmV4dHJhbmssIE1QSV9JTlQsCiAgICAgICAg ICAgICAgICAgICBteXJhbmssIG5leHRSYW5rLCBNUElfTk9fT1AsIHdpbik7CiAgICB9IHdoaWxl IChmZXRjaF9uZXh0cmFuaz09LTEpOwoKICAgfQogIC8qIE5vdyB3ZSBjYW4gbm90aWZ5IHRoZW0u ICBVc2UgYWNjdW11bGF0ZSB3aXRoIHJlcGxhY2UgaW5zdGVhZCAKICAgICBvZiBwdXQgc2luY2Ug d2Ugd2FudCBhbiBhdG9taWMgdXBkYXRlIG9mIHRoZSBsb2NhdGlvbiAqLwogIE1QSV9BY2N1bXVs YXRlKCZ6ZXJvLCAxLCBNUElfSU5ULCBmZXRjaF9uZXh0cmFuaywgYmxvY2tlZCwKCQkgMSwgTVBJ X0lOVCwgTVBJX1JFUExBQ0UsIHdpbik7CiAgTVBJX1dpbl91bmxvY2tfYWxsKHdpbik7Cn0K --001a1141dbd03846ca054a8dd445 Content-Type: text/x-chdr; charset="US-ASCII"; name="mcs-lock.h" Content-Disposition: attachment; filename="mcs-lock.h" Content-Transfer-Encoding: base64 X-Attachment-Id: f_j0732uo22 dm9pZCBNQ1NMb2NrSW5pdChNUElfQ29tbSBjb21tLCBNUElfV2luICp3aW4pOwp2b2lkIE1DU0xv Y2tBY3F1aXJlKE1QSV9XaW4gd2luKTsKdm9pZCBNQ1NMb2NrUmVsZWFzZShNUElfV2luIHdpbik7 Cgo= --001a1141dbd03846ca054a8dd445 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Content-Disposition: inline _______________________________________________ discuss mailing list discuss at mpich.org To manage subscription options or unsubscribe: https://lists.mpich.org/mailman/listinfo/discuss --001a1141dbd03846ca054a8dd445-- From bogus@does.not.exist.com Thu Apr 17 12:32:09 2025 From: bogus@does.not.exist.com () Date: Thu, 17 Apr 2025 17:32:09 -0000 Subject: No subject Message-ID: loads and stores in the MPI_Win_lock_all epochs using MPI_Fetch_and_op (see= attached files).

This version behaves very similar to the original code and also fails from = time to time. Putting a sleep into the acquire busy loop (usleep(100)) will= make the code "much more robust" (I hack, I know, but indicating= some underlying race condition?!). Let me know if you see any problems in = the way I am using MPI_Fetch_and_op in a busy loop. Flushing or syncing is = not necessary in this case, right?

All work is done with export MPIR_CVAR_ASYNC_PROGRESS=3D1 on mpich-3.2 and = mpich-3.3a2

On Wed, Mar 8, 2017 at 4:21 PM, Halim Amer <aamer at anl.gov> wrote:
I cannot claim that I thoroughly verified the correctness of that code, so = take it with a grain of salt. Please keep in mind that it is a test code fr= om a tutorial book; those codes are meant for learning purposes not for dep= loyment.

If your goal is to have a high performance RMA lock, I suggest you to look = into the recent HPDC'16 paper: "High-Performance Distributed RMA Locks= ".

Halim
www.mcs.anl.gov/~aamer

On 3/8/17 3:06 AM, Ask Jakobsen wrote:
You are absolutely correct, Halim. Removing the test lmem[nextRank] =3D=3D = -1
in release fixes the problem. Great work. Now I will try to understand why<= br> you are right. I hope the authors of the book will credit you for
discovering the bug.

So in conclusion you need to remove the above mentioned test AND enable
asynchronous progression using the environment variable
MPIR_CVAR_ASYNC_PROGRESS=3D1 in MPICH (BTW I still can't get the code to wo= rk
in openmpi).

On Tue, Mar 7, 2017 at 5:37 PM, Halim Amer <aamer at anl.gov> wrote:

detect that another process is being or already enqueued in the MCS
queue.

Actually the problem occurs only when the waiting process already enqueued<= br> itself, i.e., the accumulate operation on the nextRank field succeeded.

Halim
www.mcs.anl.gov/~aamer <http://www.mcs.anl.gov/%7Eaam= er>


On 3/7/17 10:29 AM, Halim Amer wrote:

In the Release protocol, try removing this test:

if (lmem[nextRank] =3D=3D -1) {
  If-Block;
}

but keep the If-Block.

The hang occurs because the process releasing the MCS lock fails to
detect that another process is being or already enqueued in the MCS queue.<= br>
Halim
www.mcs.anl.gov/~aamer <http://www.mcs.anl.gov/%7Eaam= er>


On 3/7/17 6:43 AM, Ask Jakobsen wrote:

Thanks, Halim. I have now enabled asynchronous progress in MPICH (can't
find something similar in openmpi) and now all ranks acquire the lock and the program finish as expected. However if I put a while(1) loop
around the
acquire-release code in main.c it will fail again at random and go
into an
infinite loop. The simple unfair lock does not have this problem.

On Tue, Mar 7, 2017 at 12:44 AM, Halim Amer <aamer at anl.gov> wrote:

My understanding is that this code assumes asynchronous progress.
An example of when the processes hang is as follows:

1) P0 Finishes MCSLockAcquire()
2) P1 is busy waiting in MCSLockAcquire() at
do {
     MPI_Win_sync(win);
  } while (lmem[blocked] =3D=3D 1);
3) P0 executes MCSLockRelease()
4) P0 waits on MPI_Win_lock_all() inside MCSLockRlease()

Hang!

For P1 to get out of the loop, P0 has to get out of
MPI_Win_lock_all() and
executes its Compare_and_swap().

For P0 to get out MPI_Win_lock_all(), it needs an ACK from P1 that it
got
the lock.

P1 does not make communication progress because MPI_Win_sync is not
required to do so. It only synchronizes private and public copies.

For this hang to disappear, one can either trigger progress manually by
using heavy-duty synchronization calls instead of Win_sync (e.g.,
Win_unlock_all + Win_lock_all), or enable asynchronous progress.

To enable asynchronous progress in MPICH, set the
MPIR_CVAR_ASYNC_PROGRESS
env var to 1.

Halim
www.mcs.anl.gov/~aamer <http://www.mcs.anl.gov/%7Eaam= er> <
http://www.mcs.anl.gov/%7Eaamer>


On 3/6/17 1:11 PM, Ask Jakobsen wrote:

I am testing on x86_64 platform.

I have tried to built both the mpich and the mcs lock code with -O0 to
avoid agressive optimization. After your suggestion I have also
tried to
make volatile int *pblocked pointing to lmem[blocked] in the
MCSLockAcquire
function and volatile int *pnextrank pointing to lmem[nextRank] in
MCSLockRelease, but it does not appear to make a difference.

On suggestion from Richard Warren I have also tried building the code
using
openmpi-2.0.2 without any luck (however it appears to acquire the
lock a
couple of extra times before failing) which I find troubling.

I think I will give up using local load/stores and will see if I can
figure
out if rewrite using MPI calls like MPI_Fetch_and_op  as you suggest.<= br> Thanks for your help.

On Mon, Mar 6, 2017 at 7:20 PM, Jeff Hammond <jeff.science at gmail.com>
wrote:

What processor architecture are you testing?


Maybe set lmem to volatile or read it with MPI_Fetch_and_op rather
than a
load.  MPI_Win_sync cannot prevent the compiler from caching *lmem
in a
register.

Jeff

On Sat, Mar 4, 2017 at 12:30 AM, Ask Jakobsen <afj at qeye-labs.com>
wrote:

Hi,


I have downloaded the source code for the MCS lock from the excellent
book "Using Advanced MPI" from http://www.mcs.anl.gov/resear= c
h/projects/mpi/usingmpi/examples-advmpi/rma2/mcs-lock.c

I have made a very simple piece of test code for testing the MCS lock
but
it works at random and often never escapes the busy loops in the
acquire
and release functions (see attached source code). The code appears
semantically correct to my eyes.

#include <stdio.h>
#include <mpi.h>
#include "mcs-lock.h"

int main(int argc, char *argv[])
{
 MPI_Win win;
 MPI_Init( &argc, &argv );

 MCSLockInit(MPI_COMM_WORLD, &win);

 int rank, size;
 MPI_Comm_rank(MPI_COMM_WORLD, &rank);
 MPI_Comm_size(MPI_COMM_WORLD, &size);

 printf("rank: %d, size: %d\n", rank, size);


 MCSLockAcquire(win);
 printf("rank %d aquired lock\n", rank);   fflush(= stdout);
 MCSLockRelease(win);


 MPI_Win_free(&win);
 MPI_Finalize();
 return 0;
}


I have tested on several hardware platforms and mpich-3.2 and
mpich-3.3a2
but with no luck.

It appears that the MPI_Win_Sync are not "refreshing" the local data or
I
have a bug I can't spot.

A simple unfair lock like http://www.mcs.anl.gov/researc
h/projects/mpi/usingmpi/examples-advmpi/rma2/ga_mutex1.c works
perfectly.

Best regards, Ask Jakobsen


_______________________________________________
discuss mailing list     discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discus= s




--
Jeff Hammond
jeff.science at gm= ail.com
http://jeffhammond.github.io/

_______________________________________________
discuss mailing list     discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discus= s






_______________________________________________
discuss mailing list     discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discus= s

_______________________________________________

discuss mailing list     discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discus= s




_______________________________________________
discuss mailing list     discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discus= s

_______________________________________________
discuss mailing list     discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discus= s

_______________________________________________
discuss mailing list     discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discus= s




_______________________________________________
discuss mailing list     discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discus= s

_______________________________________________
discuss mailing list     discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discus= s
<main.c><mcs-lock-fop.c><mcs-lock.h>________________= _______________________________
discuss mailing list     discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discus= s

_______________________________________________
discuss mailing list     discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discus= s



--
Ask Jakobsen
R&D

Qeye Labs
Lers=C3=B8 Parkall=C3=A9 107
2100 Copenhagen =C3=98
Denmark

mobile: +45 2834 6936
email: afj at Qeye-Labs.com
_______________________________________________
discuss mailing list     discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discus= s

_______________________________________________
discuss mailing list     discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discus= s

_______________________________________________
discuss mailing list     discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discus= s

_______________________________________________
discuss mailing list     discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discus= s


--=
Ask Jakobsen
= R&D

Qeye Labs
L= ers=C3=B8 Parkall=C3=A9 107
2100 Copenhagen =C3=98
Denmark

mo= bile: +45 2834 6936
email: afj at Qeye-Labs.com
<= /div>
--94eb2c13e2e428ae25054aa11616-- --===============4174043369255602394== Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Content-Disposition: inline _______________________________________________ discuss mailing list discuss at mpich.org To manage subscription options or unsubscribe: https://lists.mpich.org/mailman/listinfo/discuss --===============4174043369255602394==-- From bogus@does.not.exist.com Thu Apr 17 12:32:09 2025 From: bogus@does.not.exist.com () Date: Thu, 17 Apr 2025 17:32:09 -0000 Subject: No subject Message-ID: e algorithm. i.e. the problem may not lie with the inefficiency of your com= munication between threads but just that your algorithm is not keeping the = processor busy enough for a large number of threads. The only way that you = will know for sure whether this is a comms issue or an algorithmic one is t= o use a profiling tool, such as Vampir or Paraver. With the profiling result you will be able to determine whether you need to= make algorithmic changes in your bulk processing and enhance comms as per = Huiwei's notes. I would be interested to know what your profiling shows. Regards, bob On Thu, Oct 23, 2014 at 12:02 AM, Qiguo Jing > wrote: Hi All, We have a parallel program running on a cluster. We recently found a case,= which decreases the CPU usage and increase the run-time when increases Nod= es. Below is the results table. The particular run requires a lot of data communication between nodes. Any thoughts about this phenomena? Or is there any way we can improve the = CPU usage when using higher number of nodes? Average CPU Usage (%) Number of Nodes Number of Threads/Node 100 1 8 92 2 8 50 3 8 40 4 8 35 5 8 30 6 8 25 7 8 20 8 8 20 8 4 Thanks! _________________________________________________________________________ The information transmitted is intended only for the person or entity to which it is addressed and may contain confidential and/or privileged material. Any review, retransmission, dissemination or other use of, or taking of any action in reliance upon, this information by persons or entities other than the intended recipient is prohibited. If you received this in error, please contact the sender and delete the material from any computer. _________________________________________________________________________ _______________________________________________ discuss mailing list discuss at mpich.org To manage subscription options or unsubscribe: https://lists.mpich.org/mailman/listinfo/discuss _________________________________________________________________________ The information transmitted is intended only for the person or entity to which it is addressed and may contain confidential and/or privileged material. Any review, retransmission, dissemination or other use of, or taking of any action in reliance upon, this information by persons or entities other than the intended recipient is prohibited. If you received this in error, please contact the sender and delete the material from any computer. _________________________________________________________________________ _________________________________________________________________________ The information transmitted is intended only for the person or entity to which it is addressed and may contain confidential and/or privileged material. Any review, retransmission, dissemination or other use of, or taking of any action in reliance upon, this information by persons or entities other than the intended recipient is prohibited. If you received this in error, please contact the sender and delete the material from any computer. _________________________________________________________________________ _______________________________________________ discuss mailing list discuss at mpich.org To manage subscription options or unsubscribe: https://lists.mpich.org/mailman/listinfo/discuss -- / Ruben FAELENS +32 494 06 72 59 --=20 _________________________________________________________________________ The information transmitted is intended only for the person or entity to which it is addressed and may contain confidential and/or privileged material. Any review, retransmission, dissemination or other use of, or taking of any action in reliance upon, this information by persons or entities other than the intended recipient is prohibited. If you received this in error, please contact the sender and delete the material from any computer. _________________________________________________________________________ --_000_9D079B25EA4E3149B260846F07AE80754C5944A3TCIEXCH03ustrin_ Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable

Hi Ruben,

 

You are right! My algorithm does what= you described.

 

I will record the timestamp for each = thread and every event.  Thanks for your suggestions! 

 

Qiguo

 

From: parasietje at gmail.com [mailto:p= arasietje at gmail.com] On Behalf Of Ruben Faelens
Sent: Thursday, October 23, 2014 10:58 AM
To: discuss at mpich.org
Subject: Re: [mpich-discuss] CPU usage versus Nodes, Threads

 

Hi Qiguo,

 

You should try to collect performance statistics, es= pecially regarding your specific nodes and what they are doing at every mom= ent in time.

If I understand correctly, your algorithm does the f= ollowing:

- Master thread: read in data, split it up into piec= es, transfer pieces to slaves

- Slave thread: do calculation, transfer data back t= o master

- Master: recombine data, do calculation, split data= back up, transfer pieces to slaves

- etc...

 

The reason you do not see linear performance scaling= could be due to the following:

- The master thread recombining and splitting the da= ta set may be responsible for a large part of the work (and therefore is th= e bottleneck)

- Work is not divided equally. A significant part of= the time is spent waiting on one slave node who has a more difficult probl= em (takes longer) than the rest.

- There is a common dataset. I/O takes a larger part= of the time when more slave nodes are used.

 

The only way to know for sure, is to simply generate= a log file that shows the time every process starts and ends a specific pr= ocess step. Output the time when 

- the slave starts receiving data

- starts calculation

- starts sending results back

- starts waiting for his next piece of data

This will clearly show you what each node is doing a= t each moment in time, and should identify the bottleneck.

 

/ Ruben

 

 

On Thu, Oct 23, 2014 at 5:27 PM, Qiguo Jing <qjing at trinity= consultants.com> wrote:

Hi Bob,

 

Thanks for your suggestions.   Here are m= ore tests.  We actually have three clusters.

 

Cluster 1 and 2:  8 nodes, (2 Processors, 4 co= res/processor, no HT =E2=80=93 Total 8 Threads)/node

Cluster 3:       = ;       8 nodes, (1 Processors, 4 cores /proc= essor, HT =E2=80=93 Total 8 Threads )/node

 

We also have a standalone machine:  2 processo= rs, 6 cores/processor, HT =E2=80=93 total 24 threads.

 

 

For one particular case: 

 

Cluster 1 and 2 take 48 min to finish with 8 nodes,= 8 threads/node,  60% CPU usage;  53 min to finish with 3 nodes, 8 threads/node, 90% CPU usage;

 

Cluster 3 takes 227 min to finish with 8 nodes, 8 t= hreads/node, 20% CPU usage; 207 min to finish with 3 nodes, 8 threads/node, 50% CPU usage;

 

Standalone machine takes 82 min to finish with 24 t= hreads, 100% CPU usage.

 

It looks like with 24 threads, they should be prett= y busy?   Could the above phenomena be a hardware issue more than software?

 

Qiguo

 

From: Bob Ilgner [mailto:bobilgner at gmail.com]
Sent: Thursday, October 23, 2014 1:11 AM
To: discuss at m= pich.org
Subject: Re: [mpich-discuss] CPU usage versus Nodes, Threads
<= o:p>

 

Hi Qiguo,

 

From the results table it looks as if you are using a computa= tionally sparse algorithm. i.e. the problem may not lie with the inefficien= cy of your communication between threads but just that your algorithm is not keeping the processor busy enough for = a large number of threads. The only way that you will know for sure whether= this is a comms issue or an algorithmic one is to use a profiling tool, su= ch as Vampir or Paraver.

 

With the profiling result you will be able to determine whether yo= u need to make algorithmic changes in your bulk processing and enhance comm= s as per Huiwei's notes.

 

I would be interested to know what your profiling shows.

 

Regards, bob

 

On Thu, Oct 23, 2014 at 12:02 AM, Qiguo Jing <qjing at trinityconsultants.c= om> wrote:

Hi All,

 

We have a parallel program running on a cluster.  We recently= found a case, which decreases the CPU usage and increase the run-time when= increases Nodes.   Below is the results table.

 

The particular run requires a lot of data communication between no= des.

 

Any thoughts about this phenomena?  Or is there any way we ca= n improve the CPU usage when using higher number of nodes?

 

Average CPU Usage (%)

Number of Nodes

Number of Threads/Node

100

1

8

92

2

8

50

3

8

40

4

8

35

5

8

30

6

8

25

7

8

20

8

8

20

8

4

 

 

Thanks!


_________________________________________________________________________
The information transmitted is intended only for the person or entity to which it is addressed and may contain confidential and/or privileged
material. Any review, retransmission, dissemination or other use of, or
taking of any action in reliance upon, this information by persons or
entities other than the intended recipient is prohibited. If you received this in error, please contact the sender and delete the material from any computer.
_________________________________________________________________________
_______________________________________________
discuss mailing list     discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss

 


_________________________________________________________________________
The information transmitted is intended only for the person or entity to which it is addressed and may contain confidential and/or privileged
material. Any review, retransmission, dissemination or other use of, or
taking of any action in reliance upon, this information by persons or
entities other than the intended recipient is prohibited. If you received this in error, please contact the sender and delete the material from any computer.
_________________________________________________________________________


_________________________________________________________________________
The information transmitted is intended only for the person or entity to which it is addressed and may contain confidential and/or privileged
material. Any review, retransmission, dissemination or other use of, or
taking of any action in reliance upon, this information by persons or
entities other than the intended recipient is prohibited. If you received this in error, please contact the sender and delete the material from any computer.
_________________________________________________________________________


_______________________________________________
discuss mailing list     https://lists.mpich.org/mailman/listinfo/discuss



 

--
/ Ruben FAELENS

+32 494 06 72 59


_________________________________________________________________= ________

The information transmitted is intended only for the person= or entity to
which it is addressed and may contain confidential and/or = privileged
material. Any review, retransmission, dissemination or other= use of, or
taking of any action in reliance upon, this information by p= ersons or
entities other than the intended recipient is prohibited. If= you received
this in error, please contact the sender and delete the ma= terial from any
computer.
______________________________________= ___________________________________
= --_000_9D079B25EA4E3149B260846F07AE80754C5944A3TCIEXCH03ustrin_-- --===============3243062315966390760== Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Content-Disposition: inline _______________________________________________ discuss mailing list discuss at mpich.org To manage subscription options or unsubscribe: https://lists.mpich.org/mailman/listinfo/discuss --===============3243062315966390760==-- From bogus@does.not.exist.com Thu Apr 17 12:32:09 2025 From: bogus@does.not.exist.com () Date: Thu, 17 Apr 2025 17:32:09 -0000 Subject: No subject Message-ID: loads and stores in the MPI_Win_lock_all epochs using MPI_Fetch_and_op (see= attached files).

This version behaves very similar to the original code and also fails from = time to time. Putting a sleep into the acquire busy loop (usleep(100)) will= make the code "much more robust" (I hack, I know, but indicating= some underlying race condition?!). Let me know if you see any problems in = the way I am using MPI_Fetch_and_op in a busy loop. Flushing or syncing is = not necessary in this case, right?

All work is done with export MPIR_CVAR_ASYNC_PROGRESS=3D1 on mpich-3.2 and = mpich-3.3a2

On Wed, Mar 8, 2017 at 4:21 PM, Halim Amer <aamer at anl.gov> wrote:
I cannot claim that I thoroughly verified the correctness of that code, so = take it with a grain of salt. Please keep in mind that it is a test code fr= om a tutorial book; those codes are meant for learning purposes not for dep= loyment.

If your goal is to have a high performance RMA lock, I suggest you to look = into the recent HPDC'16 paper: "High-Performance Distributed RMA Locks= ".

Halim
www.mcs.anl.gov/~aamer

On 3/8/17 3:06 AM, Ask Jakobsen wrote:
You are absolutely correct, Halim. Removing the test lmem[nextRank] =3D=3D = -1
in release fixes the problem. Great work. Now I will try to understand why<= br> you are right. I hope the authors of the book will credit you for
discovering the bug.

So in conclusion you need to remove the above mentioned test AND enable
asynchronous progression using the environment variable
MPIR_CVAR_ASYNC_PROGRESS=3D1 in MPICH (BTW I still can't get the code to wo= rk
in openmpi).

On Tue, Mar 7, 2017 at 5:37 PM, Halim Amer <aamer at anl.gov> wrote:

detect that another process is being or already enqueued in the MCS
queue.

Actually the problem occurs only when the waiting process already enqueued<= br> itself, i.e., the accumulate operation on the nextRank field succeeded.

Halim
www.mcs.anl.gov/~aamer <http://www.mcs.anl.gov/%7Eaam= er>


On 3/7/17 10:29 AM, Halim Amer wrote:

In the Release protocol, try removing this test:

if (lmem[nextRank] =3D=3D -1) {
  If-Block;
}

but keep the If-Block.

The hang occurs because the process releasing the MCS lock fails to
detect that another process is being or already enqueued in the MCS queue.<= br>
Halim
www.mcs.anl.gov/~aamer <http://www.mcs.anl.gov/%7Eaam= er>


On 3/7/17 6:43 AM, Ask Jakobsen wrote:

Thanks, Halim. I have now enabled asynchronous progress in MPICH (can't
find something similar in openmpi) and now all ranks acquire the lock and the program finish as expected. However if I put a while(1) loop
around the
acquire-release code in main.c it will fail again at random and go
into an
infinite loop. The simple unfair lock does not have this problem.

On Tue, Mar 7, 2017 at 12:44 AM, Halim Amer <aamer at anl.gov> wrote:

My understanding is that this code assumes asynchronous progress.
An example of when the processes hang is as follows:

1) P0 Finishes MCSLockAcquire()
2) P1 is busy waiting in MCSLockAcquire() at
do {
     MPI_Win_sync(win);
  } while (lmem[blocked] =3D=3D 1);
3) P0 executes MCSLockRelease()
4) P0 waits on MPI_Win_lock_all() inside MCSLockRlease()

Hang!

For P1 to get out of the loop, P0 has to get out of
MPI_Win_lock_all() and
executes its Compare_and_swap().

For P0 to get out MPI_Win_lock_all(), it needs an ACK from P1 that it
got
the lock.

P1 does not make communication progress because MPI_Win_sync is not
required to do so. It only synchronizes private and public copies.

For this hang to disappear, one can either trigger progress manually by
using heavy-duty synchronization calls instead of Win_sync (e.g.,
Win_unlock_all + Win_lock_all), or enable asynchronous progress.

To enable asynchronous progress in MPICH, set the
MPIR_CVAR_ASYNC_PROGRESS
env var to 1.

Halim
www.mcs.anl.gov/~aamer <http://www.mcs.anl.gov/%7Eaam= er> <
http://www.mcs.anl.gov/%7Eaamer>


On 3/6/17 1:11 PM, Ask Jakobsen wrote:

I am testing on x86_64 platform.

I have tried to built both the mpich and the mcs lock code with -O0 to
avoid agressive optimization. After your suggestion I have also
tried to
make volatile int *pblocked pointing to lmem[blocked] in the
MCSLockAcquire
function and volatile int *pnextrank pointing to lmem[nextRank] in
MCSLockRelease, but it does not appear to make a difference.

On suggestion from Richard Warren I have also tried building the code
using
openmpi-2.0.2 without any luck (however it appears to acquire the
lock a
couple of extra times before failing) which I find troubling.

I think I will give up using local load/stores and will see if I can
figure
out if rewrite using MPI calls like MPI_Fetch_and_op  as you suggest.<= br> Thanks for your help.

On Mon, Mar 6, 2017 at 7:20 PM, Jeff Hammond <jeff.science at gmail.com>
wrote:

What processor architecture are you testing?


Maybe set lmem to volatile or read it with MPI_Fetch_and_op rather
than a
load.  MPI_Win_sync cannot prevent the compiler from caching *lmem
in a
register.

Jeff

On Sat, Mar 4, 2017 at 12:30 AM, Ask Jakobsen <afj at qeye-labs.com>
wrote:

Hi,


I have downloaded the source code for the MCS lock from the excellent
book "Using Advanced MPI" from http://www.mcs.anl.gov/resear= c
h/projects/mpi/usingmpi/examples-advmpi/rma2/mcs-lock.c

I have made a very simple piece of test code for testing the MCS lock
but
it works at random and often never escapes the busy loops in the
acquire
and release functions (see attached source code). The code appears
semantically correct to my eyes.

#include <stdio.h>
#include <mpi.h>
#include "mcs-lock.h"

int main(int argc, char *argv[])
{
 MPI_Win win;
 MPI_Init( &argc, &argv );

 MCSLockInit(MPI_COMM_WORLD, &win);

 int rank, size;
 MPI_Comm_rank(MPI_COMM_WORLD, &rank);
 MPI_Comm_size(MPI_COMM_WORLD, &size);

 printf("rank: %d, size: %d\n", rank, size);


 MCSLockAcquire(win);
 printf("rank %d aquired lock\n", rank);   fflush(= stdout);
 MCSLockRelease(win);


 MPI_Win_free(&win);
 MPI_Finalize();
 return 0;
}


I have tested on several hardware platforms and mpich-3.2 and
mpich-3.3a2
but with no luck.

It appears that the MPI_Win_Sync are not "refreshing" the local data or
I
have a bug I can't spot.

A simple unfair lock like http://www.mcs.anl.gov/researc
h/projects/mpi/usingmpi/examples-advmpi/rma2/ga_mutex1.c works
perfectly.

Best regards, Ask Jakobsen


_______________________________________________
discuss mailing list     discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discus= s




--
Jeff Hammond
jeff.science at gm= ail.com
http://jeffhammond.github.io/

_______________________________________________
discuss mailing list     discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discus= s






_______________________________________________
discuss mailing list     discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discus= s

_______________________________________________

discuss mailing list     discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discus= s




_______________________________________________
discuss mailing list     discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discus= s

_______________________________________________
discuss mailing list     discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discus= s

_______________________________________________
discuss mailing list     discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discus= s




_______________________________________________
discuss mailing list     discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discus= s

_______________________________________________
discuss mailing list     discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discus= s
<main.c><mcs-lock-fop.c><mcs-lock.h>________________= _______________________________
discuss mailing list     discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discus= s

_______________________________________________
discuss mailing list     discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discus= s



--
Ask Jakobsen
R&D

Qeye Labs
Lers=C3=B8 Parkall=C3=A9 107
2100 Copenhagen =C3=98
Denmark

mobile: +45 2834 6936
email: afj at Qeye-Labs.com
_______________________________________________
discuss mailing list     discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discus= s

_______________________________________________
discuss mailing list     discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discus= s

_______________________________________________
discuss mailing list     discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discus= s

_______________________________________________
discuss mailing list     discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discus= s


--=
<= b>Ask Jakobsen
R&D

Qeye Labs
Lers=C3=B8 Parkall=C3=A9 107
2100 Copenhagen =C3=98=
Denmark

mobile: +45 2834 6936
email: af= j at Qeye-Labs.com



--
=
Ask Jakobsen
R&am= p;D

Qeye Labs
Lers= =C3=B8 Parkall=C3=A9 107
2100 Copenhagen =C3=98
Denmark

mobil= e: +45 2834 6936
email: afj at Qeye-Labs.com
--001a113d3a84bc7a97054aa1fd98-- --===============6869846251208948639== Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Content-Disposition: inline _______________________________________________ discuss mailing list discuss at mpich.org To manage subscription options or unsubscribe: https://lists.mpich.org/mailman/listinfo/discuss --===============6869846251208948639==-- From bogus@does.not.exist.com Thu Apr 17 12:32:09 2025 From: bogus@does.not.exist.com () Date: Thu, 17 Apr 2025 17:32:09 -0000 Subject: No subject Message-ID: as less latency than 4KB. I was looking for explanation of this behavior but did not get any. 1. MPIR_CVAR_CH3_EAGER_MAX_MSG_SIZE is set to 128KB. So none of the abov= e message size is using Rendezvous protocol. Is there any partition inside = eager protocol (e.g. 0 - 512 bytes, 1KB - 8KB, 16KB - 64KB)? If yes then wh= at are the boundaries for them? Can I log them with debug-event-logging? Setup I am using: - two nodes has intel core i7, one with 16gb memory another one 8gb - mpich 3.2.1, configured and build to use nemesis tcp - 1gb Ethernet connection - NFS is using for sharing - osu_latency : uses MPI_Send and MPI_Recv - MPIR_CVAR_CH3_EAGER_MAX_MSG_SIZE=3D 131072 (128KB) Can anyone help me on that? Thanks in advance. Best Regards, Abu Naser --_000_BLUPR0501MB2003829D702157A57187438D97710BLUPR0501MB2003_ Content-Type: text/html; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable


Good day to all,


I had run point to point osu_lat= ency test in two nodes for 200 times.  Followings are the average time= in microsecond for various size of the messages -

1KB    84.8514 us
2KB    73.52535 us=
4KB    272.55275 us
8KB    234.86385 u= s
16KB    288.88 us
32KB    523.3725 us
64KB    910.4025 us


From the above looks like, 2KB m= essage has less latency than 1 KB and 8KB has less latency than 4KB.

I was looking for explanation of= this behavior  but did not get any.


  1. MPIR_CVAR_CH3_EAGER_MAX_MSG_SIZE is set to 128KB. So= none of the above message size is using Rendezvous protocol. Is there any = partition inside eager protocol (e.g. 0 - 512 bytes, 1KB - 8KB, 16KB - 64KB= )? If yes then what are the boundaries for them? Can I log them with debug-event-logging?


Setup I am using:

- two nodes has intel core i7, o= ne with 16gb memory another one 8gb

- mpich 3.2.1, configured and bu= ild to use nemesis tcp

- 1gb Ethernet connection

- NFS is using for sharing

- osu_latency : uses MPI_Send an= d MPI_Recv

- MPIR_CVAR_CH3_EAGER_MAX_= MSG_SIZE=3D 131072 (128KB)


Can anyone help me on that? Than= ks in advance.




Best Regards,

Abu Naser

--_000_BLUPR0501MB2003829D702157A57187438D97710BLUPR0501MB2003_-- --===============5933607345979184983== Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Content-Disposition: inline _______________________________________________ discuss mailing list discuss at mpich.org To manage subscription options or unsubscribe: https://lists.mpich.org/mailman/listinfo/discuss --===============5933607345979184983==-- From bogus@does.not.exist.com Thu Apr 17 12:32:09 2025 From: bogus@does.not.exist.com () Date: Thu, 17 Apr 2025 17:32:09 -0000 Subject: No subject Message-ID: as less latency than 4KB. I was looking for explanation of this behavior but did not get any. 1. MPIR_CVAR_CH3_EAGER_MAX_MSG_SIZE is set to 128KB. So none of the abov= e message size is using Rendezvous protocol. Is there any partition inside = eager protocol (e.g. 0 - 512 bytes, 1KB - 8KB, 16KB - 64KB)? If yes then wh= at are the boundaries for them? Can I log them with debug-event-logging? Setup I am using: - two nodes has intel core i7, one with 16gb memory another one 8gb - mpich 3.2.1, configured and build to use nemesis tcp - 1gb Ethernet connection - NFS is using for sharing - osu_latency : uses MPI_Send and MPI_Recv - MPIR_CVAR_CH3_EAGER_MAX_MSG_SIZE=3D 131072 (128KB) Can anyone help me on that? Thanks in advance. Best Regards, Abu Naser _______________________________________________ discuss mailing list discuss at mpich.org To manage subscription options or unsubscribe: https://lists.mpich.org/mailman/listinfo/discuss -- Jeff Hammond jeff.science at gmail.com http://jeffhammond.github.io/ --_000_BLUPR0501MB200383DFBCA0721E71DDAC0E97770BLUPR0501MB2003_ Content-Type: text/html; charset="us-ascii" Content-Transfer-Encoding: quoted-printable

Hello Jeff,


Yes, I am using a switch and othe= r machines are also connected with that switch.

If I remove other machines and ju= st use my two node with the switch, then will it improve the performance by= 200 ~ 400 iterations?

Meanwhile I will give a try with = a single dedicated cable.


Thank you.


Best Regards,

Abu Naser


From: Jeff Hammond <jeff= .science at gmail.com>
Sent: Wednesday, June 20, 2018 12:52:06 PM
To: MPICH
Subject: Re: [mpich-discuss] osu_latency test: why 8KB takes less ti= me than 4KB and 2KB takes less time than 1KB?
 
Is the ethernet connection a single dedicated cable betwee= n the two machines or are you running through a switch that handles other t= raffic?

My best guess is that this is noise and that you may be able to avoid = it by running a very long time, e.g. 10000 iterations.

Jeff
--_000_BLUPR0501MB200383DFBCA0721E71DDAC0E97770BLUPR0501MB2003_-- --===============3147691773054401677== Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Content-Disposition: inline _______________________________________________ discuss mailing list discuss at mpich.org To manage subscription options or unsubscribe: https://lists.mpich.org/mailman/listinfo/discuss --===============3147691773054401677==-- From bogus@does.not.exist.com Thu Apr 17 12:32:09 2025 From: bogus@does.not.exist.com () Date: Thu, 17 Apr 2025 17:32:09 -0000 Subject: No subject Message-ID: as less latency than 4KB. I was looking for explanation of this behavior but did not get any. 1. MPIR_CVAR_CH3_EAGER_MAX_MSG_SIZE is set to 128KB. So none of the abov= e message size is using Rendezvous protocol. Is there any partition inside = eager protocol (e.g. 0 - 512 bytes, 1KB - 8KB, 16KB - 64KB)? If yes then wh= at are the boundaries for them? Can I log them with debug-event-logging? Setup I am using: - two nodes has intel core i7, one with 16gb memory another one 8gb - mpich 3.2.1, configured and build to use nemesis tcp - 1gb Ethernet connection - NFS is using for sharing - osu_latency : uses MPI_Send and MPI_Recv - MPIR_CVAR_CH3_EAGER_MAX_MSG_SIZE=3D 131072 (128KB) Can anyone help me on that? Thanks in advance. Best Regards, Abu Naser _______________________________________________ discuss mailing list discuss at mpich.org To manage subscription options or unsubscribe: https://lists.mpich.org/mailman/listinfo/discuss -- Jeff Hammond jeff.science at gmail.com http://jeffhammond.github.io/ _______________________________________________ discuss mailing list discuss at mpich.org To manage subscription options or unsubscribe: https://lists.mpich.org/mailman/listinfo/discuss --_000_BLUPR0501MB2003EB7C1C3701600F398E4997770BLUPR0501MB2003_ Content-Type: text/html; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable

Hello Min,


Thanks for the clarification.&nbs= p; I will do the experiment.


Thanks.

Best Regards,

Abu Naser


From: Min Si <msi at anl.= gov>
Sent: Wednesday, June 20, 2018 1:39:30 PM
To: discuss at mpich.org
Subject: Re: [mpich-discuss] osu_latency test: why 8KB takes less ti= me than 4KB and 2KB takes less time than 1KB?
 
Hi Abu,

I think Jeff means that you should run your experiment with more iterations= in order to get a stable results.
- Increase the iteration of for loop in each execution (I think osu benchma= rk allows you to set it)
- Run the experiments 10 or 100 times, and take the average and standard de= viation.

If you see a very small standard deviation (e.g., <=3D5%), then the tren= d is stable and you might not see such gaps.

Best regards,
Min
On 2018/06/20 12:14, Abu Naser wrote:

Hello Jeff,


Yes, I am using a switch and oth= er machines are also connected with that switch.

If I remove other machines and j= ust use my two node with the switch, then will it improve the performance b= y 200 ~ 400 iterations?

Meanwhile I will give a try with= a single dedicated cable.


Thank you.


Best Regards,

Abu Naser


From: Jeff Hammond <jeff.science at gmail.com>
Sent: Wednesday, June 20, 2018 12:52:06 PM
To: MPICH
Subject: Re: [mpich-discuss] osu_latency test: why 8KB takes less ti= me than 4KB and 2KB takes less time than 1KB?
 
Is the ethernet connection a single dedicated cable betwee= n the two machines or are you running through a switch that handles other t= raffic?

My best guess is that this is noise and that you may be able to avoid = it by running a very long time, e.g. 10000 iterations.

Jeff

On Wed, Jun 20, 2018 at 6:53 AM, Abu Naser <= span dir=3D"ltr"> <an16e at my.fsu.edu> wrote:


Good day to all,


I had run point to point osu_lat= ency test in two nodes for 200 times.  Followings are the average time= in microsecond for various size of the messages -

1KB    84.8514 us
2KB    73.52535 us=
4KB    272.55275 us
8KB    234.86385 u= s
16KB    288.88 us
32KB    523.3725 us
64KB    910.4025 us


From the above looks like, 2KB m= essage has less latency than 1 KB and 8KB has less latency than 4KB.

I was looking for explanation of= this behavior  but did not get any.


  1. MPIR_CVAR_CH3_EAGER_MAX_MSG_SIZE is set to 128K= B. So none of the above message size is using Rendezvous protocol. Is there= any partition inside eager protocol (e.g. 0 - 512 bytes, 1KB - 8KB, 16KB -= 64KB)? If yes then what are the boundaries for them? Can I log them with debug-event-logging?


Setup I am using:

- two nodes has intel core i7, o= ne with 16gb memory another one 8gb

- mpich 3.2.1, configured and bu= ild to use nemesis tcp

- 1gb Ethernet connection

- NFS is using for sharing

- osu_latency : uses MPI_Send an= d MPI_Recv

- MPIR_CVAR_CH3_EAGER_MAX_= MSG_SIZE=3D 131072 (128KB)


Can anyone help me on that? Than= ks in advance.




Best Regards,

Abu Naser


_______________________________________________
discuss mailing list     discuss@= mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss




--


_______________________________________________=0A=
discuss mailing list     discuss at mpich.org=0A=
To manage subscription options or unsubscribe:=0A=
=
https://lists.mpich.org/mailman/listinfo/discuss=0A=

--_000_BLUPR0501MB2003EB7C1C3701600F398E4997770BLUPR0501MB2003_-- --===============6609414382460160217== Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Content-Disposition: inline _______________________________________________ discuss mailing list discuss at mpich.org To manage subscription options or unsubscribe: https://lists.mpich.org/mailman/listinfo/discuss --===============6609414382460160217==-- From bogus@does.not.exist.com Thu Apr 17 12:32:09 2025 From: bogus@does.not.exist.com () Date: Thu, 17 Apr 2025 17:32:09 -0000 Subject: No subject Message-ID: as less latency than 4KB. I was looking for explanation of this behavior but did not get any. 1. MPIR_CVAR_CH3_EAGER_MAX_MSG_SIZE is set to 128KB. So none of the abov= e message size is using Rendezvous protocol. Is there any partition inside = eager protocol (e.g. 0 - 512 bytes, 1KB - 8KB, 16KB - 64KB)? If yes then wh= at are the boundaries for them? Can I log them with debug-event-logging? Setup I am using: - two nodes has intel core i7, one with 16gb memory another one 8gb - mpich 3.2.1, configured and build to use nemesis tcp - 1gb Ethernet connection - NFS is using for sharing - osu_latency : uses MPI_Send and MPI_Recv - MPIR_CVAR_CH3_EAGER_MAX_MSG_SIZE=3D 131072 (128KB) Can anyone help me on that? Thanks in advance. Best Regards, Abu Naser _______________________________________________ discuss mailing list discuss at mpich.org To manage subscription options or unsubscribe: https://lists.mpich.org/mailman/listinfo/discuss -- Jeff Hammond jeff.science at gmail.com http://jeffhammond.github.io/ _______________________________________________ discuss mailing list discuss at mpich.org To manage subscription options or unsubscribe: https://lists.mpich.org/mailman/listinfo/discuss --_000_BLUPR0501MB20037286A10BAAD88A18EF8D974B0BLUPR0501MB2003_ Content-Type: text/html; charset="Windows-1252" Content-Transfer-Encoding: quoted-printable

Hello Min and Jeff,


Here is my experiment re= sults. Default number of iterations in osu_latency for 0B =96 8KB is 1= 0,000. With that setting I had run the osu_latency 100 times and found stan= dard deviation 33 for 8KB message size.


So later I have set the = iteration to 50,000 and 100,000 for 1KB =96 16KB message size. Then run osu= _latency for 100 times for each setting and take the average and standard d= eviation.


Msg Size in Bytes

Avg time in us (50K iterations)

Avg time in us (100k iterations)

Standard deviation (50K iterations)

Standard deviation (100K iterations)

1k

85.10

84.9

0.55

0.45

2k

75.79

74.63

5.09

4.44

4k

273.80

274.71

4.18

2.45

8k

258.56

249.83

21.14

28

16k

281.31

281.02

3.22

4.10



The standard deviation o= f 8K message is so high and that implies it actually not producing any cons= istent latency time. Looks like that's the reason for 8K is taking les= s time than 4K.


Meanwhile, 2K has standa= rd deviation less than 5 but 1K message latency timing are more densely pop= ulated than 2K. So probably this is the explanation for 2K message less lat= ency time.


Thank you for your sugge= stions.




Best Regards,

Abu Naser


From: Abu Naser
Sent: Wednesday, June 20, 2018 1:48:53 PM
To: discuss at mpich.org
Subject: Re: [mpich-discuss] osu_latency test: why 8KB takes less ti= me than 4KB and 2KB takes less time than 1KB?
 

Hello Min,


Thanks for the clarification.&nb= sp; I will do the experiment.


Thanks.

Best Regards,

Abu Naser


From: Min Si <msi at anl.= gov>
Sent: Wednesday, June 20, 2018 1:39:30 PM
To: discuss at mpich.org
Subject: Re: [mpich-discuss] osu_latency test: why 8KB takes less ti= me than 4KB and 2KB takes less time than 1KB?
 
Hi Abu,

I think Jeff means that you should run your experiment with more iterations= in order to get a stable results.
- Increase the iteration of for loop in each execution (I think osu benchma= rk allows you to set it)
- Run the experiments 10 or 100 times, and take the average and standard de= viation.

If you see a very small standard deviation (e.g., <=3D5%), then the tren= d is stable and you might not see such gaps.

Best regards,
Min
On 2018/06/20 12:14, Abu Naser wrote:

Hello Jeff,


Yes, I am using a switch and oth= er machines are also connected with that switch.

If I remove other machines and j= ust use my two node with the switch, then will it improve the performance b= y 200 ~ 400 iterations?

Meanwhile I will give a try with= a single dedicated cable.


Thank you.


Best Regards,

Abu Naser


From: Jeff Hammond <jeff.science at gmail.com>
Sent: Wednesday, June 20, 2018 12:52:06 PM
To: MPICH
Subject: Re: [mpich-discuss] osu_latency test: why 8KB takes less ti= me than 4KB and 2KB takes less time than 1KB?
 
Is the ethernet connection a single dedicated cable betwee= n the two machines or are you running through a switch that handles other t= raffic?

My best guess is that this is noise and that you may be able to avoid = it by running a very long time, e.g. 10000 iterations.

Jeff

On Wed, Jun 20, 2018 at 6:53 AM, Abu Naser= <an16e at my.fsu.edu><= /span> wrote:


Good day to all,


I had run point to point osu_lat= ency test in two nodes for 200 times.  Followings are the average time= in microsecond for various size of the messages -

1KB    84.8514 us
2KB    73.52535 us=
4KB    272.55275 us
8KB    234.86385 u= s
16KB    288.88 us
32KB    523.3725 us
64KB    910.4025 us


From the above looks like, 2KB m= essage has less latency than 1 KB and 8KB has less latency than 4KB.

I was looking for explanation of= this behavior  but did not get any.


  1. MPIR_CVAR_CH3_EAGER_MAX_MSG_SIZE is set to 128K= B. So none of the above message size is using Rendezvous protocol. Is there= any partition inside eager protocol (e.g. 0 - 512 bytes, 1KB - 8KB, 16KB -= 64KB)? If yes then what are the boundaries for them? Can I log them with debug-event-logging?


Setup I am using:

- two nodes has intel core i7, o= ne with 16gb memory another one 8gb

- mpich 3.2.1, configured and bu= ild to use nemesis tcp

- 1gb Ethernet connection

- NFS is using for sharing

- osu_latency : uses MPI_Send an= d MPI_Recv

- MPIR_CVAR_CH3_EAGER_MAX_= MSG_SIZE=3D 131072 (128KB)


Can anyone help me on that? Than= ks in advance.




Best Regards,

Abu Naser


_______________________________________________
discuss mailing list     discus= s at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss



--


_______________________________________________=0A=
discuss mailing list     discuss at mpich.org=0A=
To manage subscription options or unsubscribe:=0A=
https://lists.mpich.org/mailman/listinfo/discuss=0A=

--_000_BLUPR0501MB20037286A10BAAD88A18EF8D974B0BLUPR0501MB2003_-- --===============3923392798546736373== Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Content-Disposition: inline _______________________________________________ discuss mailing list discuss at mpich.org To manage subscription options or unsubscribe: https://lists.mpich.org/mailman/listinfo/discuss --===============3923392798546736373==-- From bogus@does.not.exist.com Thu Apr 17 12:32:09 2025 From: bogus@does.not.exist.com () Date: Thu, 17 Apr 2025 17:32:09 -0000 Subject: No subject Message-ID: as less latency than 4KB. I was looking for explanation of this behavior but did not get any. 1. MPIR_CVAR_CH3_EAGER_MAX_MSG_SIZE is set to 128KB. So none of the abov= e message size is using Rendezvous protocol. Is there any partition inside = eager protocol (e.g. 0 - 512 bytes, 1KB - 8KB, 16KB - 64KB)? If yes then wh= at are the boundaries for them? Can I log them with debug-event-logging? Setup I am using: - two nodes has intel core i7, one with 16gb memory another one 8gb - mpich 3.2.1, configured and build to use nemesis tcp - 1gb Ethernet connection - NFS is using for sharing - osu_latency : uses MPI_Send and MPI_Recv - MPIR_CVAR_CH3_EAGER_MAX_MSG_SIZE=3D 131072 (128KB) Can anyone help me on that? Thanks in advance. Best Regards, Abu Naser _______________________________________________ discuss mailing list discuss at mpich.org To manage subscription options or unsubscribe: https://lists.mpich.org/mailman/listinfo/discuss -- Jeff Hammond jeff.science at gmail.com http://jeffhammond.github.io/ _______________________________________________ discuss mailing list discuss at mpich.org To manage subscription options or unsubscribe: https://lists.mpich.org/mailman/listinfo/discuss _______________________________________________ discuss mailing list discuss at mpich.org To manage subscription options or unsubscribe: https://lists.mpich.org/mailman/listinfo/discuss --_000_BLUPR0501MB2003414CB97CA97A0242D0BC97430BLUPR0501MB2003_ Content-Type: text/html; charset="Windows-1252" Content-Transfer-Encoding: quoted-printable

Hello Min,


After compiling my mpich-3.2.1 with sock, while I was trying to run  a= ny program including osu benchmark or examples/cpi  in two m= achines, I have received following error -


Process 3 of 4 is on dhcp16194
Process 1 of 4 is on dhcp16194
Process 0 of 4 is on dhcp16198
Process 2 of 4 is on dhcp16198
Fatal error in PMPI_Bcast: Unknown erro= r class, error stack:
PMPI_Bcast(1600).......................= .....: MPI_Bcast(buf=3D0x7ffc1808542c, count=3D1, MPI_INT, root=3D0, MPI_CO= MM_WORLD) failed
MPIR_Bcast_impl(1452)..................= .....: 
MPIR_Bcast(1476).......................= .....: 
MPIR_Bcast_intra(1249).................= .....: 
MPIR_SMP_Bcast(1081)...................= .....: 
MPIR_Bcast_binomial(285)...............= .....: 
MPIC_Send(303).........................= .....: 
MPIC_Wait(226).........................= .....: 
MPIDI_CH3i_Progress_wait(242)..........= .....: an error occurred while handling an event returned by MPIDU_Sock_Wai= t()
MPIDI_CH3I_Progress_handle_sock_event(6= 98)..: 
MPIDI_CH3_Sockconn_handle_connect_event= (597): [ch3:sock] failed to connnect to remote process
MPIDU_Socki_handle_connect(808)........= .....: connection failure (set=3D0,sock=3D1,errno=3D111:Connection refused)=
MPIR_SMP_Bcast(1088)...................= .....: 
MPIR_Bcast_binomial(310)...............= .....: Failure during collective
Fatal error in PMPI_Bcast: Other MPI er= ror, error stack:
PMPI_Bcast(1600)........: MPI_Bcast(buf= =3D0x7ffd9eeebdac, count=3D1, MPI_INT, root=3D0, MPI_COMM_WORLD) failed
MPIR_Bcast_impl(1452)...: <= /i>
MPIR_Bcast(1476)........: <= /i>
MPIR_Bcast_intra(1249)..: <= /i>
MPIR_SMP_Bcast(1088)....: <= /i>
MPIR_Bcast_binomial(310): Failure durin= g collective

I checked the mpich FAQ and also mpic= h discussion list. Based on that I have checked followings a= nd found  they are fine in my machines -

- firewall is disabled in both machine

- I can do password less ssh in both machine

- /etc/hosts in both machine configured wi= th ip address and name properly

- I have updated the library path and used= absolute path for mpiexec

- Most importantly when I configured and b= uild mpich with tcp, it works fine.


 I think I am missing something but could not figure = out yet. Any help would be appreciated.


Thank you.






Best Regards,

Abu Naser


From: Min Si <msi at anl.go= v>
Sent: Tuesday, June 26, 2018 12:54:29 PM
To: discuss at mpich.org
Subject: Re: [mpich-discuss] osu_latency test: why 8KB takes less ti= me than 4KB and 2KB takes less time than 1KB?
 
Hi Abu,

I think the results are stable enough. Perhaps you could also try the follo= wing tests, and see if similar trend exists:
- MPICH/socket (set `--with-device=3Dch3:sock` at configure)
- A socket-based pingpong test without MPI.

At this point, I could not think of any MPI-specific design for 2k/8k messa= ges. My guess is that it is related to your network connection.

Min

On 2018/06/24 11:09, Abu Naser wrote:

Hello Min and Jeff,


Here is my experiment results. Default number of iterations in osu_= latency for 0B =96 8KB is 10,000. With that setting I had run the osu_laten= cy 100 times and found standard deviation 33 for 8KB message size.


So later I have set the iteration to 50,000 and 100,000 for 1KB =96 16KB= message size. Then run osu_latency for 100 times for each setting and take= the average and standard deviation.


Msg Size in Bytes

Avg time in us (50K iterations)

Avg time in us (100k iterations)

Standard deviation (50K iterations)

Standard deviation (100K iterations)

1k

85.10

84.9

0.55

0.45

2k

75.79

74.63

5.09

4.44

4k

273.80

274.71

4.18

2.45

8k

258.56

249.83

21.14

28

16k

281.31

281.02

3.22

4.10



The standard deviation of 8K message is so high and that implies it actu= ally not producing any consistent latency time. Looks like that's the = reason for 8K is taking less time than 4K.


Meanwhile, 2K has standard deviation less than 5 but 1K message latency = timing are more densely populated than 2K. So probably this is the explanat= ion for 2K message less latency time.


Thank you for your suggestions.




Best Regards,

Abu Naser


From: Abu Naser
Sent: Wednesday, June 20, 2018 1:48:53 PM
To: discuss at mpich.org
Subject: Re: [mpich-discuss] osu_latency test: why 8KB takes less ti= me than 4KB and 2KB takes less time than 1KB?
 

Hello Min,


Thanks for the clarification.  I will do the experiment.


Thanks.

Best Regards,

Abu Naser


From: Min Si <msi at anl.gov>
Sent: Wednesday, June 20, 2018 1:39:30 PM
To: discuss at mpich.org
Subject: Re: [mpich-discuss] osu_latency test: why 8KB takes less ti= me than 4KB and 2KB takes less time than 1KB?
 
Hi Abu,

I think Jeff means that you should run your experiment with more iterations= in order to get a stable results.
- Increase the iteration of for loop in each execution (I think osu benchma= rk allows you to set it)
- Run the experiments 10 or 100 times, and take the average and standard de= viation.

If you see a very small standard deviation (e.g., <=3D5%), then the tren= d is stable and you might not see such gaps.

Best regards,
Min
On 2018/06/20 12:14, Abu Naser wrote:<= br>

Hello Jeff,


Yes, I am using a switch and other machines are also connected with= that switch.

If I remove other machines and just use my two node with the switch, the= n will it improve the performance by 200 ~ 400 iterations?

Meanwhile I will give a try with a single dedicated cable. =


Thank you.


Best Regards,

Abu Naser


From: Jeff Hammond <jeff.science at gmail.com>
Sent: Wednesday, June 20, 2018 12:52:06 PM
To: MPICH
Subject: Re: [mpich-discuss] osu_latency test: why 8KB takes less ti= me than 4KB and 2KB takes less time than 1KB?
 
Is the ethernet connection a single dedicated cable betwee= n the two machines or are you running through a switch that handles other t= raffic?

My best guess is that this is noise and that you may be able to avoid = it by running a very long time, e.g. 10000 iterations.

Jeff

On Wed, Jun 20, 2018 at 6:53 AM, Abu Nas= er <an16e at my.fsu.edu>= ; wrote:


Good day to all,


I had run point to point osu_latency test in two nodes for 200 times.&nb= sp; Followings are the average time in microsecond for various size of the = messages -

1KB    84.8514 us
2KB    73.52535 us
4KB    272.55275 us
8KB    234.86385 us
16KB    288.88 us
32KB    523.3725 us
64KB    910.4025 us


From the above looks like, 2KB message has less latency than 1 KB and 8K= B has less latency than 4KB.

I was looking for explanation of this behavior  but did not get any= .


  1. MPIR_CVAR_CH3_EAGER_MAX_MSG_SIZE is set to 128K= B. So none of the above message size is using Rendezvous protocol. Is there= any partition inside eager protocol (e.g. 0 - 512 bytes, 1KB - 8KB, 16KB -= 64KB)? If yes then what are the boundaries for them? Can I log them with debug-event-logging?


Setup I am using:

- two nodes has intel core i7, one with 16gb memory another one 8gb

- mpich 3.2.1, configured and build to use nemesis tcp

- 1gb Ethernet connection

- NFS is using for sharing

- osu_latency : uses MPI_Send and MPI_Recv

- MPIR_CVAR_CH3_EAGER_MAX_MSG_SIZE=3D 131072 (128KB)


Can anyone help me on that? Thanks in advance.




Best Regards,

Abu Naser


_______________________________________________
discuss mailing list     disc= uss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss<= br>



--


_______________________________________________=0A=
discuss mailing list     discuss at mpich.org=0A=
To manage subscription options or unsubscribe:=0A=
https://lists.mpich.org/mailman/listinfo/discuss=0A=



_______________________________________________=0A=
discuss mailing list     discuss at mpich.org=0A=
To manage subscription options or unsubscribe:=0A=
=
https://lists.mpich.org/mailman/listinfo/discuss=0A=

--_000_BLUPR0501MB2003414CB97CA97A0242D0BC97430BLUPR0501MB2003_-- --===============7322407779089830927== Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Content-Disposition: inline _______________________________________________ discuss mailing list discuss at mpich.org To manage subscription options or unsubscribe: https://lists.mpich.org/mailman/listinfo/discuss --===============7322407779089830927==-- From bogus@does.not.exist.com Thu Apr 17 12:32:09 2025 From: bogus@does.not.exist.com () Date: Thu, 17 Apr 2025 17:32:09 -0000 Subject: No subject Message-ID: as less latency than 4KB. I was looking for explanation of this behavior but did not get any. 1. MPIR_CVAR_CH3_EAGER_MAX_MSG_SIZE is set to 128KB. So none of the abov= e message size is using Rendezvous protocol. Is there any partition inside = eager protocol (e.g. 0 - 512 bytes, 1KB - 8KB, 16KB - 64KB)? If yes then wh= at are the boundaries for them? Can I log them with debug-event-logging? Setup I am using: - two nodes has intel core i7, one with 16gb memory another one 8gb - mpich 3.2.1, configured and build to use nemesis tcp - 1gb Ethernet connection - NFS is using for sharing - osu_latency : uses MPI_Send and MPI_Recv - MPIR_CVAR_CH3_EAGER_MAX_MSG_SIZE=3D 131072 (128KB) Can anyone help me on that? Thanks in advance. Best Regards, Abu Naser _______________________________________________ discuss mailing list discuss at mpich.org To manage subscription options or unsubscribe: https://lists.mpich.org/mailman/listinfo/discuss -- Jeff Hammond jeff.science at gmail.com http://jeffhammond.github.io/ _______________________________________________ discuss mailing list discuss at mpich.org To manage subscription options or unsubscribe: https://lists.mpich.org/mailman/listinfo/discuss _______________________________________________ discuss mailing list discuss at mpich.org To manage subscription options or unsubscribe: https://lists.mpich.org/mailman/listinfo/discuss _______________________________________________ discuss mailing list discuss at mpich.org To manage subscription options or unsubscribe: https://lists.mpich.org/mailman/listinfo/discuss --_000_BLUPR0501MB2003DCD7FDB382061050A6B997430BLUPR0501MB2003_ Content-Type: text/html; charset="Windows-1252" Content-Transfer-Encoding: quoted-printable

Hello Min,


I have downloaded it from http://www.mpich.org/static/downloads/3.2.1/mpich-3.2.1.tar.gz but it d= id not work. I have received almost same error. Except this time no process= information from my remote machine.

Previously I have received thi= s -

= Process 3 of 4 is on dhcp16194
= Process 1 of 4 is on dhcp16194
= Process 0 of 4 is on dhcp16198
= Process 2 of 4 is on dhcp16198

With the new source code -=

Process 0 o= f 4 is on dhcp16198
Process 2 o= f 4 is on dhcp16198


Entire error message is:

Process 0 of 4 is on dhcp16198
Process 2 of 4 is on dhcp16198
Fatal error in PMPI_Bcast: Unknown= error class, error stack:
PMPI_Bcast(1600)..................= ..........: MPI_Bcast(buf=3D0x7ffd1ee145f0, count=3D1, MPI_INT, root=3D0, M= PI_COMM_WORLD) failed
MPIR_Bcast_impl(1452).............= ..........: 
MPIR_Bcast(1476)..................= ..........: 
MPIR_Bcast_intra(1249)............= ..........: 
MPIR_SMP_Bcast(1081)..............= ..........: 
MPIR_Bcast_binomial(285)..........= ..........: 
MPIC_Send(303)....................= ..........: 
MPIC_Wait(226)....................= ..........: 
MPIDI_CH3i_Progress_wait(242).....= ..........: an error occurred while handling an event returned by MPIDU_Soc= k_Wait()
MPIDI_CH3I_Progress_handle_sock_ev= ent(698)..: 
MPIDI_CH3_Sockconn_handle_connect_= event(597): [ch3:sock] failed to connnect to remote process
MPIDU_Socki_handle_connect(808)...= ..........: connection failure (set=3D0,sock=3D1,errno=3D111:Connection ref= used)
MPIR_SMP_Bcast(1088)..............= ..........: 
MPIR_Bcast_binomial(310)..........= ..........: Failure during collective
Fatal error in PMPI_Bcast: Other M= PI error, error stack:
PMPI_Bcast(1600)........: MPI_Bcas= t(buf=3D0x7ffe2eeb90f0, count=3D1, MPI_INT, root=3D0, MPI_COMM_WORLD) faile= d
MPIR_Bcast_impl(1452)...: 
MPIR_Bcast(1476)........: 
MPIR_Bcast_intra(1249)..: 
MPIR_SMP_Bcast(1088)....: 
MPIR_Bcast_binomial(310): Failure = during collective

Again if I configure the new sour= ce with tcp, it works fine.


Thank You.


Best Regards,

Abu Naser


From: Min Si <msi at anl.= gov>
Sent: Monday, July 2, 2018 11:56:51 AM
To: discuss at mpich.org
Subject: Re: [mpich-discuss] osu_latency test: why 8KB takes less ti= me than 4KB and 2KB takes less time than 1KB?
 
Hi Abu,

Thanks for reporting this. Can you please try the latest release with ch3/s= ock and see if you still have this error ?

Min
On 2018/07/01 21:47, Abu Naser wrote:

Hello Min,


After compiling my mpich-3.2.1 with sock, while I was trying = to run  any program including osu benchmark or examples/cpi&= nbsp; in two machines, I have received following error -


Process 3 of 4 is on dhcp= 16194
Process 1 of 4 is on dhcp= 16194
Process 0 of 4 is on dhcp= 16198
Process 2 of 4 is on dhcp= 16198
Fatal error in PMPI_Bcast= : Unknown error class, error stack:
PMPI_Bcast(1600).........= ...................: MPI_Bcast(buf=3D0x7ffc1808542c, count=3D1, MPI_INT, ro= ot=3D0, MPI_COMM_WORLD) failed
MPIR_Bcast_impl(1452)....= ...................: 
MPIR_Bcast(1476).........= ...................: 
MPIR_Bcast_intra(1249)...= ...................: 
MPIR_SMP_Bcast(1081).....= ...................: 
MPIR_Bcast_binomial(285).= ...................: 
MPIC_Send(303)...........= ...................: 
MPIC_Wait(226)...........= ...................: 
MPIDI_CH3i_Progress_wait(= 242)...............: an error occurred while handling an event returned by = MPIDU_Sock_Wait()
MPIDI_CH3I_Progress_handl= e_sock_event(698)..: 
MPIDI_CH3_Sockconn_handle= _connect_event(597): [ch3:sock] failed to connnect to remote process=
MPIDU_Socki_handle_connec= t(808).............: connection failure (set=3D0,sock=3D1,errno=3D111:Conne= ction refused)
MPIR_SMP_Bcast(1088).....= ...................: 
MPIR_Bcast_binomial(310).= ...................: Failure during collective
Fatal error in PMPI_Bcast= : Other MPI error, error stack:
PMPI_Bcast(1600)........:= MPI_Bcast(buf=3D0x7ffd9eeebdac, count=3D1, MPI_INT, root=3D0, MPI_COMM_WOR= LD) failed
MPIR_Bcast_impl(1452)...:=  
MPIR_Bcast(1476)........:=  
MPIR_Bcast_intra(1249)..:=  
MPIR_SMP_Bcast(1088)....:=  
MPIR_Bcast_binomial(310):= Failure during collective

I checked the mpich FAQ a= nd also mpich discussion list. Based on that I have checked fol= lowings and found  they are fine in my machines -

- firewall is disabled in both= machine

- I can do passwor= d less ssh in both machine

- /etc/hosts in both machine c= onfigured with ip address and name properly

- I have updated the library p= ath and used absolute path for mpiexec

- Most importantly when I conf= igured and build mpich with tcp, it works fine.


 I think I am missing something but could not figure out = yet. Any help would be appreciated.


Thank you.






Best Regards,

Abu Naser


From: Min Si <msi at anl.gov>
Sent: Tuesday, June 26, 2018 12:54:29 PM
To: discuss at mpich.org
Subject: Re: [mpich-discuss] osu_latency test: why 8KB takes less ti= me than 4KB and 2KB takes less time than 1KB?
 
Hi Abu,

I think the results are stable enough. Perhaps you could also try the follo= wing tests, and see if similar trend exists:
- MPICH/socket (set `--with-device=3Dch3:sock` at configure)
- A socket-based pingpong test without MPI.

At this point, I could not think of any MPI-specific design for 2k/8k messa= ges. My guess is that it is related to your network connection.

Min

On 2018/06/24 11:09, Abu Naser wrote:

Hello Min and Jeff,


Here is my experiment results. Default number of iterations in osu_= latency for 0B =96 8KB is 10,000. With that setting I had run the osu_laten= cy 100 times and found standard deviation 33 for 8KB message size.


So later I have set the iteration to 50,000 and 100,000 for 1KB =96 16KB= message size. Then run osu_latency for 100 times for each setting and take= the average and standard deviation.


Msg Size in Bytes

Avg time in us (50K iterations)

Avg time in us (100k iterations)

Standard deviation (50K iterations)

Standard deviation (100K iterations)

1k

85.10

84.9

0.55

0.45

2k

75.79

74.63

5.09

4.44

4k

273.80

274.71

4.18

2.45

8k

258.56

249.83

21.14

28

16k

281.31

281.02

3.22

4.10



The standard deviation of 8K message is so high and that implies it actu= ally not producing any consistent latency time. Looks like that's the = reason for 8K is taking less time than 4K.


Meanwhile, 2K has standard deviation less than 5 but 1K message latency = timing are more densely populated than 2K. So probably this is the explanat= ion for 2K message less latency time.


Thank you for your suggestions.




Best Regards,

Abu Naser


From: Abu Naser
Sent: Wednesday, June 20, 2018 1:48:53 PM
To: discuss at mpich.org
Subject: Re: [mpich-discuss] osu_latency test: why 8KB takes less ti= me than 4KB and 2KB takes less time than 1KB?
 

Hello Min,


Thanks for the clarification.  I will do the experiment.


Thanks.

Best Regards,

Abu Naser


From: Min Si <msi at anl.gov>
Sent: Wednesday, June 20, 2018 1:39:30 PM
To: discuss at mpich.org
Subject: Re: [mpich-discuss] osu_latency test: why 8KB takes less ti= me than 4KB and 2KB takes less time than 1KB?
 
Hi Abu,

I think Jeff means that you should run your experiment with more iterations= in order to get a stable results.
- Increase the iteration of for loop in each execution (I think osu benchma= rk allows you to set it)
- Run the experiments 10 or 100 times, and take the average and standard de= viation.

If you see a very small standard deviation (e.g., <=3D5%), then the tren= d is stable and you might not see such gaps.

Best regards,
Min
On 2018/06/20 12:14, Abu Naser wrote= :

Hello Jeff,


Yes, I am using a switch and other machines are also connected with= that switch.

If I remove other machines and just use my two node with the switch, the= n will it improve the performance by 200 ~ 400 iterations?

Meanwhile I will give a try with a single dedicated cable. =


Thank you.


Best Regards,

Abu Naser


From: Jeff Hammond <jeff.science at gmail.com>
Sent: Wednesday, June 20, 2018 12:52:06 PM
To: MPICH
Subject: Re: [mpich-discuss] osu_latency test: why 8KB takes less ti= me than 4KB and 2KB takes less time than 1KB?
 
Is the ethernet connection a single dedicated cable betwee= n the two machines or are you running through a switch that handles other t= raffic?

My best guess is that this is noise and that you may be able to avoid = it by running a very long time, e.g. 10000 iterations.

Jeff

On Wed, Jun 20, 2018 at 6:53 AM, Abu N= aser <an16e at my.fsu.edu&= gt; wrote:


Good day to all,


I had run point to point osu_latency test in two nodes for 200 times.&nb= sp; Followings are the average time in microsecond for various size of the = messages -

1KB    84.8514 us
2KB    73.52535 us
4KB    272.55275 us
8KB    234.86385 us
16KB    288.88 us
32KB    523.3725 us
64KB    910.4025 us


From the above looks like, 2KB message has less latency than 1 KB and 8K= B has less latency than 4KB.

I was looking for explanation of this behavior  but did not get any= .


  1. MPIR_CVAR_CH3_EAGER_MAX_MSG_SIZE is set to 128K= B. So none of the above message size is using Rendezvous protocol. Is there= any partition inside eager protocol (e.g. 0 - 512 bytes, 1KB - 8KB, 16KB -= 64KB)? If yes then what are the boundaries for them? Can I log them with debug-event-logging?


Setup I am using:

- two nodes has intel core i7, one with 16gb memory another one 8gb

- mpich 3.2.1, configured and build to use nemesis tcp

- 1gb Ethernet connection

- NFS is using for sharing

- osu_latency : uses MPI_Send and MPI_Recv

- MPIR_CVAR_CH3_EAGER_MAX_MSG_SIZE=3D 131072 (128KB)


Can anyone help me on that? Thanks in advance.




Best Regards,

Abu Naser


_______________________________________________
discuss mailing list     di= scuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss




--


_______________________________________________=0A=
discuss mailing list     discuss at mpich.org=0A=
To manage subscription options or unsubscribe:=0A=
https://lists.mpich.org/mailman/listinfo/discuss=0A=



_______________________________________________=0A=
discuss mailing list     discuss at mpich.org=0A=
To manage subscription options or unsubscribe:=0A=
https://lists.mpich.org/mailman/listinfo/discuss=0A=



_______________________________________________=0A=
discuss mailing list     discuss at mpich.org=0A=
To manage subscription options or unsubscribe:=0A=
=
https://lists.mpich.org/mailman/listinfo/discuss=0A=

--_000_BLUPR0501MB2003DCD7FDB382061050A6B997430BLUPR0501MB2003_-- --===============0610270028680441889== Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Content-Disposition: inline _______________________________________________ discuss mailing list discuss at mpich.org To manage subscription options or unsubscribe: https://lists.mpich.org/mailman/listinfo/discuss --===============0610270028680441889==-- From bogus@does.not.exist.com Thu Apr 17 12:32:09 2025 From: bogus@does.not.exist.com () Date: Thu, 17 Apr 2025 17:32:09 -0000 Subject: No subject Message-ID: as less latency than 4KB. I was looking for explanation of this behavior but did not get any. 1. MPIR_CVAR_CH3_EAGER_MAX_MSG_SIZE is set to 128KB. So none of the abov= e message size is using Rendezvous protocol. Is there any partition inside = eager protocol (e.g. 0 - 512 bytes, 1KB - 8KB, 16KB - 64KB)? If yes then wh= at are the boundaries for them? Can I log them with debug-event-logging? Setup I am using: - two nodes has intel core i7, one with 16gb memory another one 8gb - mpich 3.2.1, configured and build to use nemesis tcp - 1gb Ethernet connection - NFS is using for sharing - osu_latency : uses MPI_Send and MPI_Recv - MPIR_CVAR_CH3_EAGER_MAX_MSG_SIZE=3D 131072 (128KB) Can anyone help me on that? Thanks in advance. Best Regards, Abu Naser _______________________________________________ discuss mailing list discuss at mpich.org To manage subscription options or unsubscribe: https://lists.mpich.org/mailman/listinfo/discuss -- Jeff Hammond jeff.science at gmail.com http://jeffhammond.github.io/ _______________________________________________ discuss mailing list discuss at mpich.org To manage subscription options or unsubscribe: https://lists.mpich.org/mailman/listinfo/discuss _______________________________________________ discuss mailing list discuss at mpich.org To manage subscription options or unsubscribe: https://lists.mpich.org/mailman/listinfo/discuss _______________________________________________ discuss mailing list discuss at mpich.org To manage subscription options or unsubscribe: https://lists.mpich.org/mailman/listinfo/discuss _______________________________________________ discuss mailing list discuss at mpich.org To manage subscription options or unsubscribe: https://lists.mpich.org/mailman/listinfo/discuss --_000_BLUPR0501MB2003D13F739833B52E59DBB097430BLUPR0501MB2003_ Content-Type: text/html; charset="Windows-1252" Content-Transfer-Encoding: quoted-printable

Hello Min,


Now for some cases it is working = and for some cases not.


Cases when it worked:

- when application binary (e.g cp= i, osu_bw, osu_latency) is compiled with other mpicc (generated when config= ured with tcp), then mpiexec (generated when configured with s= ock) could run it.


Cases not working:

- application binary (e.g cpi) is= compiled with mpicc (generated when configured with sock), th= en mpiexec (generated when configured with sock) could not run= it and produce the same error message. [lib path was set]


Thank you.



Best Regards,

Abu Naser


From: Min Si <msi at anl.go= v>
Sent: Monday, July 2, 2018 2:10:23 PM
To: discuss at mpich.org
Subject: Re: [mpich-discuss] osu_latency test: why 8KB takes less ti= me than 4KB and 2KB takes less time than 1KB?
 
Could you please try mpich-3.3b3 ?<= br> http://www.mpich.org/static/downloads/3.3= b3/mpich-3.3b3.tar.gz

Min
On 2018/07/02 13:01, Abu Naser wrote:

Hello Min,


I have downloaded it from http://www.mpich.org/static/downloads/3.2.1/mpich-3.2.1.tar.gz but it d= id not work. I have received almost same error. Except this time no process= information from my remote machine.

Previously I have received th= is -

Proce= ss 3 of 4 is on dhcp16194
Proce= ss 1 of 4 is on dhcp16194
Proce= ss 0 of 4 is on dhcp16198
Proce= ss 2 of 4 is on dhcp16198

With the new source code -

Process 0 of 4 i= s on dhcp16198
Process 2 of 4 i= s on dhcp16198


Entire error message is:<= /p>

Process 0 of 4 is on dhcp16198
Process 2 of 4 is on dhcp16198
Fatal error in PMPI_Bcast: Unknown e= rror class, error stack:
PMPI_Bcast(1600)....................= ........: MPI_Bcast(buf=3D0x7ffd1ee145f0, count=3D1, MPI_INT, root=3D0, MPI= _COMM_WORLD) failed
MPIR_Bcast_impl(1452)...............= ........: 
MPIR_Bcast(1476)....................= ........: 
MPIR_Bcast_intra(1249)..............= ........: 
MPIR_SMP_Bcast(1081)................= ........: 
MPIR_Bcast_binomial(285)............= ........: 
MPIC_Send(303)......................= ........: 
MPIC_Wait(226)......................= ........: 
MPIDI_CH3i_Progress_wait(242).......= ........: an error occurred while handling an event returned by MPIDU_Sock_= Wait()
MPIDI_CH3I_Progress_handle_sock_even= t(698)..: 
MPIDI_CH3_Sockconn_handle_connect_ev= ent(597): [ch3:sock] failed to connnect to remote process
MPIDU_Socki_handle_connect(808).....= ........: connection failure (set=3D0,sock=3D1,errno=3D111:Connection refus= ed)
MPIR_SMP_Bcast(1088)................= ........: 
MPIR_Bcast_binomial(310)............= ........: Failure during collective
Fatal error in PMPI_Bcast: Other MPI= error, error stack:
PMPI_Bcast(1600)........: MPI_Bcast(= buf=3D0x7ffe2eeb90f0, count=3D1, MPI_INT, root=3D0, MPI_COMM_WORLD) failed<= /span>
MPIR_Bcast_impl(1452)...: 
MPIR_Bcast(1476)........: 
MPIR_Bcast_intra(1249)..: 
MPIR_SMP_Bcast(1088)....: 
MPIR_Bcast_binomial(310): Failure du= ring collective

Again if I configure the new sou= rce with tcp, it works fine.


Thank You.


Best Regards,

Abu Naser


From: Min Si <msi at an= l.gov>
Sent: Monday, July 2, 2018 11:56:51 AM
To: discuss at mpich.org
Subject: Re: [mpich-discuss] osu_latency test: why 8KB takes less ti= me than 4KB and 2KB takes less time than 1KB?
 
Hi Abu,

Thanks for reporting this. Can you please try the latest release with ch3/s= ock and see if you still have this error ?

Min
On 2018/07/01 21:47, Abu Naser wrote:

Hello Min,


After compiling my mpich-3.2.1 with sock, while I was trying = to run  any program including osu benchmark or examples/cpi&= nbsp; in two machines, I have received following error -


Process 3 of 4 is on dhcp= 16194
Process 1 of 4 is on dhcp= 16194
Process 0 of 4 is on dhcp= 16198
Process 2 of 4 is on dhcp= 16198
Fatal error in PMPI_Bcast= : Unknown error class, error stack:
PMPI_Bcast(1600).........= ...................: MPI_Bcast(buf=3D0x7ffc1808542c, count=3D1, MPI_INT, ro= ot=3D0, MPI_COMM_WORLD) failed
MPIR_Bcast_impl(1452)....= ...................: 
MPIR_Bcast(1476).........= ...................: 
MPIR_Bcast_intra(1249)...= ...................: 
MPIR_SMP_Bcast(1081).....= ...................: 
MPIR_Bcast_binomial(285).= ...................: 
MPIC_Send(303)...........= ...................: 
MPIC_Wait(226)...........= ...................: 
MPIDI_CH3i_Progress_wait(= 242)...............: an error occurred while handling an event returned by = MPIDU_Sock_Wait()
MPIDI_CH3I_Progress_handl= e_sock_event(698)..: 
MPIDI_CH3_Sockconn_handle= _connect_event(597): [ch3:sock] failed to connnect to remote process=
MPIDU_Socki_handle_connec= t(808).............: connection failure (set=3D0,sock=3D1,errno=3D111:Conne= ction refused)
MPIR_SMP_Bcast(1088).....= ...................: 
MPIR_Bcast_binomial(310).= ...................: Failure during collective
Fatal error in PMPI_Bcast= : Other MPI error, error stack:
PMPI_Bcast(1600)........:= MPI_Bcast(buf=3D0x7ffd9eeebdac, count=3D1, MPI_INT, root=3D0, MPI_COMM_WOR= LD) failed
MPIR_Bcast_impl(1452)...:=  
MPIR_Bcast(1476)........:=  
MPIR_Bcast_intra(1249)..:=  
MPIR_SMP_Bcast(1088)....:=  
MPIR_Bcast_binomial(310):= Failure during collective

I checked the mpich FAQ a= nd also mpich discussion list. Based on that I have checked fol= lowings and found  they are fine in my machines -

- firewall is disabled in both= machine

- I can do passwor= d less ssh in both machine

- /etc/hosts in both machine c= onfigured with ip address and name properly

- I have updated the library p= ath and used absolute path for mpiexec

- Most importantly when I conf= igured and build mpich with tcp, it works fine.


 I think I am missing something but could not figure out = yet. Any help would be appreciated.


Thank you.






Best Regards,

Abu Naser


From: Min Si <msi at anl.gov>
Sent: Tuesday, June 26, 2018 12:54:29 PM
To: discuss at mpich.org
Subject: Re: [mpich-discuss] osu_latency test: why 8KB takes less ti= me than 4KB and 2KB takes less time than 1KB?
 
Hi Abu,

I think the results are stable enough. Perhaps you could also try the follo= wing tests, and see if similar trend exists:
- MPICH/socket (set `--with-device=3Dch3:sock` at configure)
- A socket-based pingpong test without MPI.

At this point, I could not think of any MPI-specific design for 2k/8k messa= ges. My guess is that it is related to your network connection.

Min

On 2018/06/24 11:09, Abu Naser wrote:<= br>

Hello Min and Jeff,


Here is my experiment results. Default number of iterations in osu_= latency for 0B =96 8KB is 10,000. With that setting I had run the osu_laten= cy 100 times and found standard deviation 33 for 8KB message size.


So later I have set the iteration to 50,000 and 100,000 for 1KB =96 16KB= message size. Then run osu_latency for 100 times for each setting and take= the average and standard deviation.


Msg Size in Bytes

Avg time in us (50K iterations)

Avg time in us (100k iterations)

Standard deviation (50K iterations)

Standard deviation (100K iterations)

1k

85.10

84.9

0.55

0.45

2k

75.79

74.63

5.09

4.44

4k

273.80

274.71

4.18

2.45

8k

258.56

249.83

21.14

28

16k

281.31

281.02

3.22

4.10



The standard deviation of 8K message is so high and that implies it actu= ally not producing any consistent latency time. Looks like that's the = reason for 8K is taking less time than 4K.


Meanwhile, 2K has standard deviation less than 5 but 1K message latency = timing are more densely populated than 2K. So probably this is the explanat= ion for 2K message less latency time.


Thank you for your suggestions.




Best Regards,

Abu Naser


From: Abu Naser
Sent: Wednesday, June 20, 2018 1:48:53 PM
To: discuss at mpich.org
Subject: Re: [mpich-discuss] osu_latency test: why 8KB takes less ti= me than 4KB and 2KB takes less time than 1KB?
 

Hello Min,


Thanks for the clarification.  I will do the experiment.


Thanks.

Best Regards,

Abu Naser


From: Min Si <msi at anl.gov>
Sent: Wednesday, June 20, 2018 1:39:30 PM
To: discuss at mpich.org
Subject: Re: [mpich-discuss] osu_latency test: why 8KB takes less ti= me than 4KB and 2KB takes less time than 1KB?
 
Hi Abu,

I think Jeff means that you should run your experiment with more iterations= in order to get a stable results.
- Increase the iteration of for loop in each execution (I think osu benchma= rk allows you to set it)
- Run the experiments 10 or 100 times, and take the average and standard de= viation.

If you see a very small standard deviation (e.g., <=3D5%), then the tren= d is stable and you might not see such gaps.

Best regards,
Min
On 2018/06/20 12:14, Abu Naser wro= te:

Hello Jeff,


Yes, I am using a switch and other machines are also connected with= that switch.

If I remove other machines and just use my two node with the switch, the= n will it improve the performance by 200 ~ 400 iterations?

Meanwhile I will give a try with a single dedicated cable. =


Thank you.


Best Regards,

Abu Naser


From: Jeff Hammond <= a class=3D"x_x_x_x_x_moz-txt-link-rfc2396E x_x_x_x_OWAAutoLink" href=3D"mai= lto:jeff.science at gmail.com" id=3D"LPlnk983157"> <jeff.science at gmail.com>
Sent: Wednesday, June 20, 2018 12:52:06 PM
To: MPICH
Subject: Re: [mpich-discuss] osu_latency test: why 8KB takes less ti= me than 4KB and 2KB takes less time than 1KB?
 
Is the ethernet connection a single dedicated cable betwee= n the two machines or are you running through a switch that handles other t= raffic?

My best guess is that this is noise and that you may be able to avoid = it by running a very long time, e.g. 10000 iterations.

Jeff

On Wed, Jun 20, 2018 at 6:53 AM, Abu= Naser <an16e at my.fsu.edu> wrote:


Good day to all,


I had run point to point osu_latency test in two nodes for 200 times.&nb= sp; Followings are the average time in microsecond for various size of the = messages -

1KB    84.8514 us
2KB    73.52535 us
4KB    272.55275 us
8KB    234.86385 us
16KB    288.88 us
32KB    523.3725 us
64KB    910.4025 us


From the above looks like, 2KB message has less latency than 1 KB and 8K= B has less latency than 4KB.

I was looking for explanation of this behavior  but did not get any= .


  1. MPIR_CVAR_CH3_EAGER_MAX_MSG_SIZE is set to 128K= B. So none of the above message size is using Rendezvous protocol. Is there= any partition inside eager protocol (e.g. 0 - 512 bytes, 1KB - 8KB, 16KB -= 64KB)? If yes then what are the boundaries for them? Can I log them with debug-event-logging?


Setup I am using:

- two nodes has intel core i7, one with 16gb memory another one 8gb

- mpich 3.2.1, configured and build to use nemesis tcp

- 1gb Ethernet connection

- NFS is using for sharing

- osu_latency : uses MPI_Send and MPI_Recv

- MPIR_CVAR_CH3_EAGER_MAX_MSG_SIZE=3D 131072 (128KB)


Can anyone help me on that? Thanks in advance.




Best Regards,

Abu Naser


_______________________________________________
discuss mailing list     discuss at mpich.org To manage subscription options or unsubscribe:
htt= ps://lists.mpich.org/mailman/listinfo/discuss




--


_______________________________________________
discuss mailing list     disc=
uss at mpich.org
To manage subscription options or unsubscribe:
https://=
lists.mpich.org/mailman/listinfo/discuss



_______________________________________________
discuss mailing list     discuss at mpic=
h.org
To manage subscription options or unsubscribe:
https://lists.mp=
ich.org/mailman/listinfo/discuss



_______________________________________________
discuss mailing list     discuss at mpich.or=
g
To manage subscription options or unsubscribe:
https://lists.mpich.=
org/mailman/listinfo/discuss



_______________________________________________
discuss mailing list     discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss

--_000_BLUPR0501MB2003D13F739833B52E59DBB097430BLUPR0501MB2003_-- --===============7802683794760088852== Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Content-Disposition: inline _______________________________________________ discuss mailing list discuss at mpich.org To manage subscription options or unsubscribe: https://lists.mpich.org/mailman/listinfo/discuss --===============7802683794760088852==-- From Martin.Audet at cnrc-nrc.gc.ca Wed Apr 16 11:13:28 2025 From: Martin.Audet at cnrc-nrc.gc.ca (Audet, Martin) Date: Wed, 16 Apr 2025 16:13:28 +0000 Subject: [mpich-discuss] mpich 4.3.0 compilation problem when using --with-hcoll=/opt/mellanox/hcoll Message-ID: <82d94e0b0ff84c63904bf29bf548eac1@cnrc-nrc.gc.ca> Hello mpich community, When I try to compile mpich 4.3.0 configured with --with-hcoll=/opt/mellanox/hcoll option, I get a compilation error because the hcoll_do_progress() function is defined with two arguments in hcoll_init.c but it is called only with one in hcoll_rte.c ! Here is the error message I get: src/mpid/common/hcoll/hcoll_rte.c: In function 'progress': src/mpid/common/hcoll/hcoll_rte.c:58:33: warning: passing argument 1 of 'hcoll_do_progress' makes integer from pointer without a cast [-Wint-conversion] 58 | ret = hcoll_do_progress(&made_progress); | ^~~~~~~~~~~~~~ | | | int * In file included from ./src/mpid/ch4/netmod/include/../ucx/ucx_coll.h:11, from ./src/mpid/ch4/netmod/include/../ucx/netmod_inline.h:15, from ./src/mpid/ch4/netmod/include/netmod_impl.h:1589, from ./src/mpid/ch4/include/mpidch4.h:448, from ./src/mpid/ch4/include/mpidpost.h:10, from ./src/include/mpiimpl.h:232, from src/mpid/common/hcoll/hcoll_rte.c:6: ./src/mpid/ch4/netmod/include/../ucx/../../../common/hcoll/hcoll.h:42:27: note: expected 'int' but argument is of type 'int *' 42 | int hcoll_do_progress(int vci, int *made_progress); | ~~~~^~~ src/mpid/common/hcoll/hcoll_rte.c:58:15: error: too few arguments to function 'hcoll_do_progress' 58 | ret = hcoll_do_progress(&made_progress); | ^~~~~~~~~~~~~~~~~ In file included from ./src/mpid/ch4/netmod/include/../ucx/ucx_coll.h:11, from ./src/mpid/ch4/netmod/include/../ucx/netmod_inline.h:15, from ./src/mpid/ch4/netmod/include/netmod_impl.h:1589, from ./src/mpid/ch4/include/mpidch4.h:448, from ./src/mpid/ch4/include/mpidpost.h:10, from ./src/include/mpiimpl.h:232, from src/mpid/common/hcoll/hcoll_rte.c:6: ./src/mpid/ch4/netmod/include/../ucx/../../../common/hcoll/hcoll.h:42:5: note: declared here 42 | int hcoll_do_progress(int vci, int *made_progress); | ^~~~~~~~~~~~~~~~~ I use to compile mpich versions 3.4,x, 4.1.x, and 4.2.x configured with this option (--with-hcoll=) in the past without any problems. It looks like some recent changes in the related files introduced a problem that slip into 4.3.0 and makes compilation impossible. Could it be fixed ? Or could the --with-hcoll option be removed if it is no longer relevant (I guess that if we use ch4:ucx, ucx may itself use hcoll internally to optimize collective operations when running on hierarchical environment) ? Here are some details: arch: x86_64 OS: RHEL 9.5 (up to date except kernel) MOFED: 24.10-2.1.8.0-LTS hcoll: 4.8.3230-1.2410068 ucx: 1.18.0-1.2410068 The complete configuration line: ./configure --with-device=ch4:ucx --with-hcoll=/opt/mellanox/hcoll --prefix=/work/software/x86_64/mpich/mpich-ch4_ucx-4.3.0 --with-xpmem --enable-g=none --enable-fast=all --enable-romio --with-file-system=ufs+nfs+lustre --enable-shared --enable-sharedlibs=gcc Thanks, Martin Audet -------------- next part -------------- An HTML attachment was scrubbed... URL: From zhouh at anl.gov Wed Apr 16 11:17:04 2025 From: zhouh at anl.gov (Zhou, Hui) Date: Wed, 16 Apr 2025 16:17:04 +0000 Subject: [mpich-discuss] mpich 4.3.0 compilation problem when using --with-hcoll=/opt/mellanox/hcoll In-Reply-To: <82d94e0b0ff84c63904bf29bf548eac1@cnrc-nrc.gc.ca> References: <82d94e0b0ff84c63904bf29bf548eac1@cnrc-nrc.gc.ca> Message-ID: Hi Martin, Could you try the patch in https://urldefense.us/v3/__https://github.com/pmodels/mpich/pull/7047?__;!!G_uCfscf7eWS!bdBfNtswmXiGNRmHAbm5F4OqebCOzMdjnw1OCHLYwWev0BDBpFX1N2_NtVhd53COJcs4iEqNf96K$ -- Hui ________________________________ From: Audet, Martin via discuss Sent: Wednesday, April 16, 2025 11:13 AM To: discuss at mpich.org Cc: Audet, Martin Subject: [mpich-discuss] mpich 4.3.0 compilation problem when using --with-hcoll=/opt/mellanox/hcoll Hello mpich community, When I try to compile mpich 4.?3.?0 configured with --with-hcoll=/opt/mellanox/hcoll option, I get a compilation error because the hcoll_do_progress() function is defined with two arguments in hcoll_init.?c but it is called ZjQcmQRYFpfptBannerStart This Message Is From an External Sender This message came from outside your organization. ZjQcmQRYFpfptBannerEnd Hello mpich community, When I try to compile mpich 4.3.0 configured with --with-hcoll=/opt/mellanox/hcoll option, I get a compilation error because the hcoll_do_progress() function is defined with two arguments in hcoll_init.c but it is called only with one in hcoll_rte.c ! Here is the error message I get: src/mpid/common/hcoll/hcoll_rte.c: In function 'progress': src/mpid/common/hcoll/hcoll_rte.c:58:33: warning: passing argument 1 of 'hcoll_do_progress' makes integer from pointer without a cast [-Wint-conversion] 58 | ret = hcoll_do_progress(&made_progress); | ^~~~~~~~~~~~~~ | | | int * In file included from ./src/mpid/ch4/netmod/include/../ucx/ucx_coll.h:11, from ./src/mpid/ch4/netmod/include/../ucx/netmod_inline.h:15, from ./src/mpid/ch4/netmod/include/netmod_impl.h:1589, from ./src/mpid/ch4/include/mpidch4.h:448, from ./src/mpid/ch4/include/mpidpost.h:10, from ./src/include/mpiimpl.h:232, from src/mpid/common/hcoll/hcoll_rte.c:6: ./src/mpid/ch4/netmod/include/../ucx/../../../common/hcoll/hcoll.h:42:27: note: expected 'int' but argument is of type 'int *' 42 | int hcoll_do_progress(int vci, int *made_progress); | ~~~~^~~ src/mpid/common/hcoll/hcoll_rte.c:58:15: error: too few arguments to function 'hcoll_do_progress' 58 | ret = hcoll_do_progress(&made_progress); | ^~~~~~~~~~~~~~~~~ In file included from ./src/mpid/ch4/netmod/include/../ucx/ucx_coll.h:11, from ./src/mpid/ch4/netmod/include/../ucx/netmod_inline.h:15, from ./src/mpid/ch4/netmod/include/netmod_impl.h:1589, from ./src/mpid/ch4/include/mpidch4.h:448, from ./src/mpid/ch4/include/mpidpost.h:10, from ./src/include/mpiimpl.h:232, from src/mpid/common/hcoll/hcoll_rte.c:6: ./src/mpid/ch4/netmod/include/../ucx/../../../common/hcoll/hcoll.h:42:5: note: declared here 42 | int hcoll_do_progress(int vci, int *made_progress); | ^~~~~~~~~~~~~~~~~ I use to compile mpich versions 3.4,x, 4.1.x, and 4.2.x configured with this option (--with-hcoll=) in the past without any problems. It looks like some recent changes in the related files introduced a problem that slip into 4.3.0 and makes compilation impossible. Could it be fixed ? Or could the --with-hcoll option be removed if it is no longer relevant (I guess that if we use ch4:ucx, ucx may itself use hcoll internally to optimize collective operations when running on hierarchical environment) ? Here are some details: arch: x86_64 OS: RHEL 9.5 (up to date except kernel) MOFED: 24.10-2.1.8.0-LTS hcoll: 4.8.3230-1.2410068 ucx: 1.18.0-1.2410068 The complete configuration line: ./configure --with-device=ch4:ucx --with-hcoll=/opt/mellanox/hcoll --prefix=/work/software/x86_64/mpich/mpich-ch4_ucx-4.3.0 --with-xpmem --enable-g=none --enable-fast=all --enable-romio --with-file-system=ufs+nfs+lustre --enable-shared --enable-sharedlibs=gcc Thanks, Martin Audet -------------- next part -------------- An HTML attachment was scrubbed... URL: From Martin.Audet at cnrc-nrc.gc.ca Wed Apr 16 13:54:27 2025 From: Martin.Audet at cnrc-nrc.gc.ca (Audet, Martin) Date: Wed, 16 Apr 2025 18:54:27 +0000 Subject: [mpich-discuss] mpich 4.3.0 compilation problem when using --with-hcoll=/opt/mellanox/hcoll In-Reply-To: References: <82d94e0b0ff84c63904bf29bf548eac1@cnrc-nrc.gc.ca>, Message-ID: <964461c365474f7da55922ade7c76007@cnrc-nrc.gc.ca> Hello Hui, I tried the patch and it works (it compiles at least). Thanks for your very quick response ! Martin ________________________________ From: Zhou, Hui Sent: April 16, 2025 12:17 PM To: discuss at mpich.org Cc: Audet, Martin Subject: EXT: Re: [mpich-discuss] mpich 4.3.0 compilation problem when using --with-hcoll=/opt/mellanox/hcoll ***Attention*** This email originated from outside of the NRC. ***Attention*** Ce courriel provient de l'ext?rieur du CNRC. Hi Martin, Could you try the patch in https://urldefense.us/v3/__https://github.com/pmodels/mpich/pull/7047?__;!!G_uCfscf7eWS!fFz0Y7WM3O3wILKEFJYH4NdS0Q4bnjjAlwkuzQk7YDRXZwbZpV6PhzYRIIp689vvuEERZIuT2W0cbtCyvZX_-QIOCT0$ -- Hui ________________________________ From: Audet, Martin via discuss Sent: Wednesday, April 16, 2025 11:13 AM To: discuss at mpich.org Cc: Audet, Martin Subject: [mpich-discuss] mpich 4.3.0 compilation problem when using --with-hcoll=/opt/mellanox/hcoll Hello mpich community, When I try to compile mpich 4.?3.?0 configured with --with-hcoll=/opt/mellanox/hcoll option, I get a compilation error because the hcoll_do_progress() function is defined with two arguments in hcoll_init.?c but it is called ZjQcmQRYFpfptBannerStart This Message Is From an External Sender This message came from outside your organization. ZjQcmQRYFpfptBannerEnd Hello mpich community, When I try to compile mpich 4.3.0 configured with --with-hcoll=/opt/mellanox/hcoll option, I get a compilation error because the hcoll_do_progress() function is defined with two arguments in hcoll_init.c but it is called only with one in hcoll_rte.c ! Here is the error message I get: src/mpid/common/hcoll/hcoll_rte.c: In function 'progress': src/mpid/common/hcoll/hcoll_rte.c:58:33: warning: passing argument 1 of 'hcoll_do_progress' makes integer from pointer without a cast [-Wint-conversion] 58 | ret = hcoll_do_progress(&made_progress); | ^~~~~~~~~~~~~~ | | | int * In file included from ./src/mpid/ch4/netmod/include/../ucx/ucx_coll.h:11, from ./src/mpid/ch4/netmod/include/../ucx/netmod_inline.h:15, from ./src/mpid/ch4/netmod/include/netmod_impl.h:1589, from ./src/mpid/ch4/include/mpidch4.h:448, from ./src/mpid/ch4/include/mpidpost.h:10, from ./src/include/mpiimpl.h:232, from src/mpid/common/hcoll/hcoll_rte.c:6: ./src/mpid/ch4/netmod/include/../ucx/../../../common/hcoll/hcoll.h:42:27: note: expected 'int' but argument is of type 'int *' 42 | int hcoll_do_progress(int vci, int *made_progress); | ~~~~^~~ src/mpid/common/hcoll/hcoll_rte.c:58:15: error: too few arguments to function 'hcoll_do_progress' 58 | ret = hcoll_do_progress(&made_progress); | ^~~~~~~~~~~~~~~~~ In file included from ./src/mpid/ch4/netmod/include/../ucx/ucx_coll.h:11, from ./src/mpid/ch4/netmod/include/../ucx/netmod_inline.h:15, from ./src/mpid/ch4/netmod/include/netmod_impl.h:1589, from ./src/mpid/ch4/include/mpidch4.h:448, from ./src/mpid/ch4/include/mpidpost.h:10, from ./src/include/mpiimpl.h:232, from src/mpid/common/hcoll/hcoll_rte.c:6: ./src/mpid/ch4/netmod/include/../ucx/../../../common/hcoll/hcoll.h:42:5: note: declared here 42 | int hcoll_do_progress(int vci, int *made_progress); | ^~~~~~~~~~~~~~~~~ I use to compile mpich versions 3.4,x, 4.1.x, and 4.2.x configured with this option (--with-hcoll=) in the past without any problems. It looks like some recent changes in the related files introduced a problem that slip into 4.3.0 and makes compilation impossible. Could it be fixed ? Or could the --with-hcoll option be removed if it is no longer relevant (I guess that if we use ch4:ucx, ucx may itself use hcoll internally to optimize collective operations when running on hierarchical environment) ? Here are some details: arch: x86_64 OS: RHEL 9.5 (up to date except kernel) MOFED: 24.10-2.1.8.0-LTS hcoll: 4.8.3230-1.2410068 ucx: 1.18.0-1.2410068 The complete configuration line: ./configure --with-device=ch4:ucx --with-hcoll=/opt/mellanox/hcoll --prefix=/work/software/x86_64/mpich/mpich-ch4_ucx-4.3.0 --with-xpmem --enable-g=none --enable-fast=all --enable-romio --with-file-system=ufs+nfs+lustre --enable-shared --enable-sharedlibs=gcc Thanks, Martin Audet -------------- next part -------------- An HTML attachment was scrubbed... URL: