[mpich-discuss] Hybrid HPC system

Min Si msi at anl.gov
Tue Jan 24 12:45:04 CST 2017


It seems some thing wrong with the process group id. Can you try the 
execution again with MPICH debug message ? You can enable debug as below:
- configure MPICH again with --enable-g=all, then make && make install
- before executing:
    mkdir -p log/
    export MPICHD_DBG_PG=yes
    export MPICH_DBG_FILENAME="log/dbg-%d.log"
    export MPICH_DBG_CLASS=ALL
    export MPICH_DBG_LEVEL=VERBOSE

Then can you send me the output and the log files in log/ ?

Min

On 1/23/17 10:46 AM, Doha Ehab wrote:
> Hi Min,
>  I have attached the two Config.log. and here is the code
>
> #include <stdio.h>
> #include <mpi.h>
>
> int main (argc, argv)
>      int argc;
>      char *argv[];
> {
>
> int i=0;
>  MPI_Init (&argc, &argv);/* starts MPI */
> // Find out rank, size
> int world_rank;
> MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);
> int world_size;
> MPI_Comm_size(MPI_COMM_WORLD, &world_size);
>
> int number;
> if (world_rank == 0) {
>
>     number = -1;
> for( i=1; i < world_size; i++){
>
>     MPI_Send(&number, 1, MPI_INT, i, 0, MPI_COMM_WORLD);
> }
> }
> else  {
>     MPI_Recv(&number, 1, MPI_INT, 0, 0, MPI_COMM_WORLD,
>              MPI_STATUS_IGNORE);
>     printf("Process %d received number %d from process 
> 0\n",world_rank, number);
> }
> MPI_Finalize();
>
>   return 0;
> }
>
> Regards,
> Doha
>
> On Sun, Jan 22, 2017 at 10:47 PM, Min Si <msi at anl.gov 
> <mailto:msi at anl.gov>> wrote:
>
>     Hi Doha,
>
>     Can you please send us the config.log file for each MPICH build
>     and your helloworld source doe ? The config.log file should be
>     under your MPICH build directory where you executed ./configure.
>
>     Min
>
>     On 1/21/17 4:53 AM, Doha Ehab wrote:
>>     I have tried what you mentioned in the previous E-mail.
>>
>>     1- I have build MPICH for CPU node and ARM node.
>>     2- Uploaded the binaries on same path on the 2 nodes.
>>     3- Compiled helloWorld (it sends a number from process zero to
>>     all other processes ) for both nodes. Then tried mpiexec -np 2 -f
>>     <hostfile with mic hostnames>./helloworld
>>
>>     I got this error
>>      Fatal error in MPI_Recv: Other MPI error, error stack:
>>     MPI_Recv(200)................................:
>>     MPI_Recv(buf=0xbe9460d0, count=1, MPI_INT, src=0, tag=0,
>>     MPI_COMM_WORLD, status=0x1) failed
>>     MPIDI_CH3i_Progress_wait(242)................: an error occurred
>>     while handling an event returned by MPIDU_Sock_Wait()
>>     MPIDI_CH3I_Progress_handle_sock_event(554)...:
>>     MPIDI_CH3_Sockconn_handle_connopen_event(899): unable to find the
>>     process group structure with id <>
>>
>>     Regards,
>>     Doha
>>
>>
>>     On Wed, Nov 16, 2016 at 6:38 PM, Min Si <msi at anl.gov
>>     <mailto:msi at anl.gov>> wrote:
>>
>>         I guess you might need to put all the MPICH binaries (e.g.,
>>         hydra_pmi_proxy) to the same path on each node. I have
>>         executed MPICH on Intel MIC chips from the host CPU node
>>         where OS are different. The thing I did was:
>>         1. build MPICH for both CPU node and MIC on the CPU node (you
>>         have done this step).
>>         2. upload the MIC binaries to the same path on MIC chip as on
>>         the CPU node
>>            For example:
>>            - on CPU node : /tmp/mpich/install/bin holds the CPU version
>>            - on MIC :          /tmp/mpich/install/bin holds the MIC
>>         version
>>         3. compile helloworld.c with the MIC version mpicc
>>         4. execute on CPU node: mpiexe -np 2 -f <hostfile with mic
>>         hostnames>./helloworld
>>
>>         I think you should be able to follow step 2, but since your
>>         helloworld binary is also built with different OS, you might
>>         want to put it also into the same path on two nodes similar
>>         as we do for MPICH binaries.
>>
>>         Min
>>
>>
>>         On 11/16/16 8:29 AM, Kenneth Raffenetti wrote:
>>
>>             Have you disabled any and all firewalls on both nodes? It
>>             sounds like they are unable to communicate in initialization.
>>
>>             Ken
>>
>>             On 11/16/2016 07:34 AM, Doha Ehab wrote:
>>
>>                 Yes, I built MPICH-3 on both systems and I tried the
>>                 code on each node
>>                 separately and it worked, I tried each node with
>>                 other nodes that has
>>                 the same operating system and it worked as well.
>>                 When I try the code on the 2 nodes that have
>>                 different operating systems
>>                 no result or error message appear.
>>
>>                 Regards
>>                 Doha
>>
>>                 On Mon, Nov 14, 2016 at 6:25 PM, Kenneth Raffenetti
>>                 <raffenet at mcs.anl.gov <mailto:raffenet at mcs.anl.gov>
>>                 <mailto:raffenet at mcs.anl.gov
>>                 <mailto:raffenet at mcs.anl.gov>>> wrote:
>>
>>                     It may be possible to run in such a setup, but it
>>                 would not be
>>                     recommended. Did you build MPICH on both systems
>>                 you are trying to
>>                     run on? What exactly happened when the code
>>                 didn't work?
>>
>>                     Ken
>>
>>
>>                     On 11/13/2016 12:36 AM, Doha Ehab wrote:
>>
>>                         Hello,
>>                          I tried to run a parallel (Hello World) C
>>                 code on a cluster
>>                         that has 2
>>                         nodes, the nodes have different operating
>>                 system so the code did not
>>                         work and no results were printed.
>>                          How to make such a cluster work? is there is
>>                 extra steps that
>>                         should be
>>                         done?
>>
>>                         Regards,
>>                         Doha
>>
>>
>>                         _______________________________________________
>>                         discuss mailing list discuss at mpich.org
>>                 <mailto:discuss at mpich.org>
>>                         <mailto:discuss at mpich.org
>>                 <mailto:discuss at mpich.org>>
>>                         To manage subscription options or unsubscribe:
>>                 https://lists.mpich.org/mailman/listinfo/discuss
>>                 <https://lists.mpich.org/mailman/listinfo/discuss>
>>                        
>>                 <https://lists.mpich.org/mailman/listinfo/discuss
>>                 <https://lists.mpich.org/mailman/listinfo/discuss>>
>>
>>                     _______________________________________________
>>                     discuss mailing list discuss at mpich.org
>>                 <mailto:discuss at mpich.org> <mailto:discuss at mpich.org
>>                 <mailto:discuss at mpich.org>>
>>                     To manage subscription options or unsubscribe:
>>                 https://lists.mpich.org/mailman/listinfo/discuss
>>                 <https://lists.mpich.org/mailman/listinfo/discuss>
>>                     <https://lists.mpich.org/mailman/listinfo/discuss
>>                 <https://lists.mpich.org/mailman/listinfo/discuss>>
>>
>>
>>
>>
>>                 _______________________________________________
>>                 discuss mailing list discuss at mpich.org
>>                 <mailto:discuss at mpich.org>
>>                 To manage subscription options or unsubscribe:
>>                 https://lists.mpich.org/mailman/listinfo/discuss
>>                 <https://lists.mpich.org/mailman/listinfo/discuss>
>>
>>             _______________________________________________
>>             discuss mailing list discuss at mpich.org
>>             <mailto:discuss at mpich.org>
>>             To manage subscription options or unsubscribe:
>>             https://lists.mpich.org/mailman/listinfo/discuss
>>             <https://lists.mpich.org/mailman/listinfo/discuss>
>>
>>
>>         _______________________________________________
>>         discuss mailing list discuss at mpich.org <mailto:discuss at mpich.org>
>>         To manage subscription options or unsubscribe:
>>         https://lists.mpich.org/mailman/listinfo/discuss
>>         <https://lists.mpich.org/mailman/listinfo/discuss>
>>
>>
>>
>>
>>     _______________________________________________
>>     discuss mailing listdiscuss at mpich.org <mailto:discuss at mpich.org>
>>     To manage subscription options or unsubscribe:
>>     https://lists.mpich.org/mailman/listinfo/discuss
>>     <https://lists.mpich.org/mailman/listinfo/discuss>
>     _______________________________________________ discuss mailing
>     list discuss at mpich.org <mailto:discuss at mpich.org> To manage
>     subscription options or unsubscribe:
>     https://lists.mpich.org/mailman/listinfo/discuss
>     <https://lists.mpich.org/mailman/listinfo/discuss> 
>
> _______________________________________________
> discuss mailing list     discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20170124/1ac4d115/attachment.html>
-------------- next part --------------
_______________________________________________
discuss mailing list     discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss


More information about the discuss mailing list