[mpich-discuss] MPI_LOOKUP_NAME getting wrong port back?

Alexander Rast alex.rast.technical at gmail.com
Wed Mar 21 09:10:38 CDT 2018


More information: this looks as if it may be a bug in hydra_nameserver. I
found the following:

If you start up the hydra_nameserver, and do an mpiexec with both my
'server' and 'client' groups, *the first time*, if you've got things set up
OK, the client retrieves the right port and connects.
However, if when the MPI application exits, you then try to run it a second
time on both 'server' and 'client', the server gets a new port ID and
appears to store it to the nameserver, BUT the client retrieves a port ID
that is basically, the port ID the server had on its first run, plus some
(nonrandom) garbage at the end. The garbage at the end appears to be some
small slice of one of the text strings corresponding to a port ID. It seems
therefore that the nameserver isn't (entirely) deleting the published port
information when an MPI application exits. And then it's doing something
strange when it tries to look up the same service name for a different run.
If you kill the hydra_nameserver process, then when you run the
applications again with 'server' and 'client' they connect.

Incidentally, I got the following errors reported from what looks to be the
hydra_nameserver on one run, when the 'server' group executed
MPI_COMM_ACCEPT:

[cli_1]: PMIU_parse_keyvals: unexpected key delimiter at character 96 in cmd
[cli_1]: parse_keyvals failed -1

However, the 'client' group successfully connected in spite of those
messages.

On Wed, Mar 21, 2018 at 12:58 AM, Alexander Rast <
alex.rast.technical at gmail.com> wrote:

> I'm trying to test the MPI publish-connect methods and running into a
> situation where for some reason MPI_LOOKUP_NAME is returning the wrong
> port. You can see the test source and output in the attached zipfile.
>
> I'm testing this (on Ubuntu 16.04, MPICH 3.2) by opening 2 terminal
> windows and starting one group of 2 processes (the 'server' group) in one,
> and another group (the 'client' group) in another. Beforehand I've started
> the hydra name server using hydra_nameserver & (in the same terminal
> session as the server group) The exact command line for the server group is:
>
> mpiexec -nameserver DeepThought -n 1 ./MPI_Example_C_0 server : -n 1
> ./MPI_Example_C_1 server
>
> and for the client group:
>
> mpiexec -nameserver DeepThought -n 1 ./MPI_Example_C_0 client : -n 1
> ./MPI_Example_C_1 client
>
> If you look at the attached output files, you can see that the problem is
> straightforward: MPI_LOOKUP_NAME is returning the wrong port name for the
> server group. Rather surprisingly, too, it *always* returns the exact same
> port name for any run:
>
>  tag#0$description#DeepThought$port#50178$ifname#127.0.1.1$127.0.P
>
> even though the port name reported for the server varies: in this run it
> was
>
>  tag#0$description#DeepThought$port#33341$ifname#127.0.1.1$
>
> The clients are clearly performing the lookup and successfully querying
> the nameserver because if I don't run the server and simply run the clients
> they abort saying they couldn't find a published port, as expected.
>
> So something strange is going on. Has anyone run into this? Is there
> something I need to do otherwise?
>
> Thanks for any help.
>
> Alex Rast
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20180321/d7ca1400/attachment.html>
-------------- next part --------------
_______________________________________________
discuss mailing list     discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss


More information about the discuss mailing list