[mpich-discuss] Problem Running MPI on cluster

Md. Amjad Hossain mhossai2 at kent.edu
Wed Oct 22 02:04:07 CDT 2014


> Thank you so much Ken for your reply.
>
> Do I have to copy executable file to all machines? what I am doing is,
> coding, compiling and running on a single machine and host_file contains
> name of the other machines.
>
> I have run the command you gave me to run. it is printing the only name of
> the machine I am executing command and showing all previous errors. Here
> are the outputs after running the command twice:
>
>
> [mhossain at md-lin-01 mpi_hello_world]$ /usr/lib64/mpich/bin/mpirun -n 4 -f
> host_file hostname
> md-lin-01.mcs.kent.edu
> [proxy:0:1 at md-lin-02.mcs.kent.edu] launch_procs
> (./pm/pmiserv/pmip_cb.c:648): unable to change wdir to
> /home/mhossain/testMpi/mpi_hello_world (No such file or directory)
> [proxy:0:1 at md-lin-02.mcs.kent.edu] HYD_pmcd_pmip_control_cmd_cb
> (./pm/pmiserv/pmip_cb.c:893): launch_procs returned error
> [proxy:0:1 at md-lin-02.mcs.kent.edu] HYDT_dmxu_poll_wait_for_event
> (./tools/demux/demux_poll.c:77): callback returned error status
> [proxy:0:1 at md-lin-02.mcs.kent.edu] main (./pm/pmiserv/pmip.c:206): demux
> engine error waiting for event
> [mpiexec at md-lin-01.mcs.kent.edu] control_cb
> (./pm/pmiserv/pmiserv_cb.c:202): assert (!closed) failed
> [mpiexec at md-lin-01.mcs.kent.edu] HYDT_dmxu_poll_wait_for_event
> (./tools/demux/demux_poll.c:77): callback returned error status
> [mpiexec at md-lin-01.mcs.kent.edu] HYD_pmci_wait_for_completion
> (./pm/pmiserv/pmiserv_pmci.c:197): error waiting for event
> [mpiexec at md-lin-01.mcs.kent.edu] main (./ui/mpich/mpiexec.c:331): process
> manager error waiting for completion
>
> [mhossain at md-lin-01 mpi_hello_world]$ /usr/lib64/mpich/bin/mpirun -n 4 -f
> host_file hostname
> md-lin-01.mcs.kent.edu
> [mpiexec at md-lin-01.mcs.kent.edu] control_cb
> (./pm/pmiserv/pmiserv_cb.c:202): assert (!closed) failed
> [mpiexec at md-lin-01.mcs.kent.edu] HYDT_dmxu_poll_wait_for_event
> (./tools/demux/demux_poll.c:77): callback returned error status
> [mpiexec at md-lin-01.mcs.kent.edu] HYD_pmci_wait_for_completion
> (./pm/pmiserv/pmiserv_pmci.c:197): error waiting for event
> [mpiexec at md-lin-01.mcs.kent.edu] main (./ui/mpich/mpiexec.c:331): process
> manager error waiting for completion
>
>
> Is this problem with configuration or MPICH version?
>
> Regards
> Amjad
>




>
>
> Does your mpi_hello_world binary exist in the same directory on all the
> machines you are trying to run on? Can you try running this:
>
> /usr/lib64/mpich/bin/mpirun -n 4 -f host_file hostname
>
> If it outputs the names of the hosts in your hostfile, we can be
> confident that your mpirun and ssh setup is functioning correctly.
>
> Ken
>





>
> On 10/21/2014 12:09 AM, Md. Amjad Hossain wrote:
> > Hi I am trying to run simple hello world program on cluster nodes. I am
> > running it by following command but getting errors:
> >
> > Command:  /usr/lib64/mpich/bin/mpirun -n 4 -f host_file ./mpi_hello_world
> >
> > errors:
> > [mpiexec at md-lin-01.mcs.kent.edu <mailto:mpiexec at md-lin-01.mcs.kent.edu>]
> > control_cb (./pm/pmiserv/pmiserv_cb.c:202): assert (!closed) failed
> > [mpiexec at md-lin-01.mcs.kent.edu <mailto:mpiexec at md-lin-01.mcs.kent.edu>]
> > HYDT_dmxu_poll_wait_for_event (./tools/demux/demux_poll.c:77): callback
> > returned error status
> > [mpiexec at md-lin-01.mcs.kent.edu <mailto:mpiexec at md-lin-01.mcs.kent.edu>]
> > HYD_pmci_wait_for_completion (./pm/pmiserv/pmiserv_pmci.c:197): error
> > waiting for event
> > [mpiexec at md-lin-01.mcs.kent.edu <mailto:mpiexec at md-lin-01.mcs.kent.edu>]
> > main (./ui/mpich/mpiexec.c:331): process manager error waiting for
> > completion
> >
> >
> > Before running the command I am setting variables MPIRUN =mpi diretory
> > and MPI_HOSTS=host_file. The "host_file" has four nodes and they can ssh
> > to each other without password.
> >
> > MPICH version I am running is: 3.0.4. The MPI code is attached.
> >
> > Any help to solve the problem please?
> >
> >
> >
> >
> >
> >
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20141022/68c8253e/attachment.html>
-------------- next part --------------
_______________________________________________
discuss mailing list     discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss


More information about the discuss mailing list