[mpich-discuss] MPI_LOOKUP_NAME getting wrong port back?

Alexander Rast alex.rast.technical at gmail.com
Wed Mar 21 09:45:35 CDT 2018


Fixed. At least the nameserver lookup part of it. Those [cli_1] messages
remain but the application starts up and runs to completion without errors
on both sides.

The solution is to call MPI_Unpublish_name on the 'server' node that
published the name. It's not immediately obvious from the MPI documentation
that you need to do this, and I get the impression from several on-line
posts in various groups it's not immediately obvious to others, either, but
it makes sense. Anyway, maybe this note will be useful for posterity.

So now we're just down to those mysterious error messages from what looks
to be the nameserver.

On Wed, Mar 21, 2018 at 2:10 PM, Alexander Rast <
alex.rast.technical at gmail.com> wrote:

> More information: this looks as if it may be a bug in hydra_nameserver. I
> found the following:
>
> If you start up the hydra_nameserver, and do an mpiexec with both my
> 'server' and 'client' groups, *the first time*, if you've got things set up
> OK, the client retrieves the right port and connects.
> However, if when the MPI application exits, you then try to run it a
> second time on both 'server' and 'client', the server gets a new port ID
> and appears to store it to the nameserver, BUT the client retrieves a port
> ID that is basically, the port ID the server had on its first run, plus
> some (nonrandom) garbage at the end. The garbage at the end appears to be
> some small slice of one of the text strings corresponding to a port ID. It
> seems therefore that the nameserver isn't (entirely) deleting the published
> port information when an MPI application exits. And then it's doing
> something strange when it tries to look up the same service name for a
> different run. If you kill the hydra_nameserver process, then when you run
> the applications again with 'server' and 'client' they connect.
>
> Incidentally, I got the following errors reported from what looks to be
> the hydra_nameserver on one run, when the 'server' group executed
> MPI_COMM_ACCEPT:
>
> [cli_1]: PMIU_parse_keyvals: unexpected key delimiter at character 96 in
> cmd
> [cli_1]: parse_keyvals failed -1
>
> However, the 'client' group successfully connected in spite of those
> messages.
>
> On Wed, Mar 21, 2018 at 12:58 AM, Alexander Rast <
> alex.rast.technical at gmail.com> wrote:
>
>> I'm trying to test the MPI publish-connect methods and running into a
>> situation where for some reason MPI_LOOKUP_NAME is returning the wrong
>> port. You can see the test source and output in the attached zipfile.
>>
>> I'm testing this (on Ubuntu 16.04, MPICH 3.2) by opening 2 terminal
>> windows and starting one group of 2 processes (the 'server' group) in one,
>> and another group (the 'client' group) in another. Beforehand I've started
>> the hydra name server using hydra_nameserver & (in the same terminal
>> session as the server group) The exact command line for the server group is:
>>
>> mpiexec -nameserver DeepThought -n 1 ./MPI_Example_C_0 server : -n 1
>> ./MPI_Example_C_1 server
>>
>> and for the client group:
>>
>> mpiexec -nameserver DeepThought -n 1 ./MPI_Example_C_0 client : -n 1
>> ./MPI_Example_C_1 client
>>
>> If you look at the attached output files, you can see that the problem is
>> straightforward: MPI_LOOKUP_NAME is returning the wrong port name for the
>> server group. Rather surprisingly, too, it *always* returns the exact same
>> port name for any run:
>>
>>  tag#0$description#DeepThought$port#50178$ifname#127.0.1.1$127.0.P
>>
>> even though the port name reported for the server varies: in this run it
>> was
>>
>>  tag#0$description#DeepThought$port#33341$ifname#127.0.1.1$
>>
>> The clients are clearly performing the lookup and successfully querying
>> the nameserver because if I don't run the server and simply run the clients
>> they abort saying they couldn't find a published port, as expected.
>>
>> So something strange is going on. Has anyone run into this? Is there
>> something I need to do otherwise?
>>
>> Thanks for any help.
>>
>> Alex Rast
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20180321/f35df463/attachment.html>
-------------- next part --------------
_______________________________________________
discuss mailing list     discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss


More information about the discuss mailing list