[mpich-discuss] Hybrid HPC system
Min Si
msi at anl.gov
Tue Jan 24 12:45:04 CST 2017
It seems some thing wrong with the process group id. Can you try the
execution again with MPICH debug message ? You can enable debug as below:
- configure MPICH again with --enable-g=all, then make && make install
- before executing:
mkdir -p log/
export MPICHD_DBG_PG=yes
export MPICH_DBG_FILENAME="log/dbg-%d.log"
export MPICH_DBG_CLASS=ALL
export MPICH_DBG_LEVEL=VERBOSE
Then can you send me the output and the log files in log/ ?
Min
On 1/23/17 10:46 AM, Doha Ehab wrote:
> Hi Min,
> I have attached the two Config.log. and here is the code
>
> #include <stdio.h>
> #include <mpi.h>
>
> int main (argc, argv)
> int argc;
> char *argv[];
> {
>
> int i=0;
> MPI_Init (&argc, &argv);/* starts MPI */
> // Find out rank, size
> int world_rank;
> MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);
> int world_size;
> MPI_Comm_size(MPI_COMM_WORLD, &world_size);
>
> int number;
> if (world_rank == 0) {
>
> number = -1;
> for( i=1; i < world_size; i++){
>
> MPI_Send(&number, 1, MPI_INT, i, 0, MPI_COMM_WORLD);
> }
> }
> else {
> MPI_Recv(&number, 1, MPI_INT, 0, 0, MPI_COMM_WORLD,
> MPI_STATUS_IGNORE);
> printf("Process %d received number %d from process
> 0\n",world_rank, number);
> }
> MPI_Finalize();
>
> return 0;
> }
>
> Regards,
> Doha
>
> On Sun, Jan 22, 2017 at 10:47 PM, Min Si <msi at anl.gov
> <mailto:msi at anl.gov>> wrote:
>
> Hi Doha,
>
> Can you please send us the config.log file for each MPICH build
> and your helloworld source doe ? The config.log file should be
> under your MPICH build directory where you executed ./configure.
>
> Min
>
> On 1/21/17 4:53 AM, Doha Ehab wrote:
>> I have tried what you mentioned in the previous E-mail.
>>
>> 1- I have build MPICH for CPU node and ARM node.
>> 2- Uploaded the binaries on same path on the 2 nodes.
>> 3- Compiled helloWorld (it sends a number from process zero to
>> all other processes ) for both nodes. Then tried mpiexec -np 2 -f
>> <hostfile with mic hostnames>./helloworld
>>
>> I got this error
>> Fatal error in MPI_Recv: Other MPI error, error stack:
>> MPI_Recv(200)................................:
>> MPI_Recv(buf=0xbe9460d0, count=1, MPI_INT, src=0, tag=0,
>> MPI_COMM_WORLD, status=0x1) failed
>> MPIDI_CH3i_Progress_wait(242)................: an error occurred
>> while handling an event returned by MPIDU_Sock_Wait()
>> MPIDI_CH3I_Progress_handle_sock_event(554)...:
>> MPIDI_CH3_Sockconn_handle_connopen_event(899): unable to find the
>> process group structure with id <>
>>
>> Regards,
>> Doha
>>
>>
>> On Wed, Nov 16, 2016 at 6:38 PM, Min Si <msi at anl.gov
>> <mailto:msi at anl.gov>> wrote:
>>
>> I guess you might need to put all the MPICH binaries (e.g.,
>> hydra_pmi_proxy) to the same path on each node. I have
>> executed MPICH on Intel MIC chips from the host CPU node
>> where OS are different. The thing I did was:
>> 1. build MPICH for both CPU node and MIC on the CPU node (you
>> have done this step).
>> 2. upload the MIC binaries to the same path on MIC chip as on
>> the CPU node
>> For example:
>> - on CPU node : /tmp/mpich/install/bin holds the CPU version
>> - on MIC : /tmp/mpich/install/bin holds the MIC
>> version
>> 3. compile helloworld.c with the MIC version mpicc
>> 4. execute on CPU node: mpiexe -np 2 -f <hostfile with mic
>> hostnames>./helloworld
>>
>> I think you should be able to follow step 2, but since your
>> helloworld binary is also built with different OS, you might
>> want to put it also into the same path on two nodes similar
>> as we do for MPICH binaries.
>>
>> Min
>>
>>
>> On 11/16/16 8:29 AM, Kenneth Raffenetti wrote:
>>
>> Have you disabled any and all firewalls on both nodes? It
>> sounds like they are unable to communicate in initialization.
>>
>> Ken
>>
>> On 11/16/2016 07:34 AM, Doha Ehab wrote:
>>
>> Yes, I built MPICH-3 on both systems and I tried the
>> code on each node
>> separately and it worked, I tried each node with
>> other nodes that has
>> the same operating system and it worked as well.
>> When I try the code on the 2 nodes that have
>> different operating systems
>> no result or error message appear.
>>
>> Regards
>> Doha
>>
>> On Mon, Nov 14, 2016 at 6:25 PM, Kenneth Raffenetti
>> <raffenet at mcs.anl.gov <mailto:raffenet at mcs.anl.gov>
>> <mailto:raffenet at mcs.anl.gov
>> <mailto:raffenet at mcs.anl.gov>>> wrote:
>>
>> It may be possible to run in such a setup, but it
>> would not be
>> recommended. Did you build MPICH on both systems
>> you are trying to
>> run on? What exactly happened when the code
>> didn't work?
>>
>> Ken
>>
>>
>> On 11/13/2016 12:36 AM, Doha Ehab wrote:
>>
>> Hello,
>> I tried to run a parallel (Hello World) C
>> code on a cluster
>> that has 2
>> nodes, the nodes have different operating
>> system so the code did not
>> work and no results were printed.
>> How to make such a cluster work? is there is
>> extra steps that
>> should be
>> done?
>>
>> Regards,
>> Doha
>>
>>
>> _______________________________________________
>> discuss mailing list discuss at mpich.org
>> <mailto:discuss at mpich.org>
>> <mailto:discuss at mpich.org
>> <mailto:discuss at mpich.org>>
>> To manage subscription options or unsubscribe:
>> https://lists.mpich.org/mailman/listinfo/discuss
>> <https://lists.mpich.org/mailman/listinfo/discuss>
>>
>> <https://lists.mpich.org/mailman/listinfo/discuss
>> <https://lists.mpich.org/mailman/listinfo/discuss>>
>>
>> _______________________________________________
>> discuss mailing list discuss at mpich.org
>> <mailto:discuss at mpich.org> <mailto:discuss at mpich.org
>> <mailto:discuss at mpich.org>>
>> To manage subscription options or unsubscribe:
>> https://lists.mpich.org/mailman/listinfo/discuss
>> <https://lists.mpich.org/mailman/listinfo/discuss>
>> <https://lists.mpich.org/mailman/listinfo/discuss
>> <https://lists.mpich.org/mailman/listinfo/discuss>>
>>
>>
>>
>>
>> _______________________________________________
>> discuss mailing list discuss at mpich.org
>> <mailto:discuss at mpich.org>
>> To manage subscription options or unsubscribe:
>> https://lists.mpich.org/mailman/listinfo/discuss
>> <https://lists.mpich.org/mailman/listinfo/discuss>
>>
>> _______________________________________________
>> discuss mailing list discuss at mpich.org
>> <mailto:discuss at mpich.org>
>> To manage subscription options or unsubscribe:
>> https://lists.mpich.org/mailman/listinfo/discuss
>> <https://lists.mpich.org/mailman/listinfo/discuss>
>>
>>
>> _______________________________________________
>> discuss mailing list discuss at mpich.org <mailto:discuss at mpich.org>
>> To manage subscription options or unsubscribe:
>> https://lists.mpich.org/mailman/listinfo/discuss
>> <https://lists.mpich.org/mailman/listinfo/discuss>
>>
>>
>>
>>
>> _______________________________________________
>> discuss mailing listdiscuss at mpich.org <mailto:discuss at mpich.org>
>> To manage subscription options or unsubscribe:
>> https://lists.mpich.org/mailman/listinfo/discuss
>> <https://lists.mpich.org/mailman/listinfo/discuss>
> _______________________________________________ discuss mailing
> list discuss at mpich.org <mailto:discuss at mpich.org> To manage
> subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss
> <https://lists.mpich.org/mailman/listinfo/discuss>
>
> _______________________________________________
> discuss mailing list discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpich.org/pipermail/discuss/attachments/20170124/1ac4d115/attachment.html>
-------------- next part --------------
_______________________________________________
discuss mailing list discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss
More information about the discuss
mailing list