[mpich-discuss] Shared Memory Segfaults
Brian Cornille
bcornille at wisc.edu
Mon May 4 10:20:03 CDT 2015
Jeff,
I agree that the code I supplied has no interesting parallelism. That code was simply an attempt to diagnose or replicated an error I was getting with a more complicated code which also had shared memory structures pointing to other shared memory structures. The linked list was just a quick example of a similar structure to test what the issue could be. It ended up having very similar error behavior as the original program so I thought I would share it instead of a more complicated project. For the project I was working on I wanted to create large shared memory structures that would be read-only once the initial set-up was done. Some of these shared memory structures were to have shared memory structures allocated within them, which appeared to generate the errors I was encountering. If you are interested in the original code that produced these errors it is posted at:
https://github.com/bcornille/mpi-shared-ne506/tree/develop/src (mostly in geoms.c)
There are likely ways that I could change the structure of the code to work around this problem, but I was hoping to try to understand what was causing the very strange errors that I produced.
I encountered errors when I tried to make assignments to shared memory regions that were pointed to from other shared memory regions. I am hoping that the error in the code I provided is the mirror image of the error in my original code.
Thanks and best,
Brian Cornille
On Sat, May 2 at 5:46 PM, Jeff Hammond <jeff.science at gmail.com> wrote:
>
>I do not understand what this program is supposed to do.
>MPI_Win_allocate_shared is collective, as is MPI_Win_fence, so I don't
>see how any interesting parallelism emerges from this type of
>implementation of a linked-list.
>
>Are you just using the linked-list to manage a collection of shared
>memory windows (GMR does this in
>http://git.mpich.org/armci-mpi.git/blob/HEAD:/src/gmr.c, for example)?
>
>If you want to do a distributed linked-list using RMA, I recall there
>is a linked-list example in the MPI-3 spec or perhaps somewhere else
>(somebody on this list will remember, or I can look it up). And this
>example probably uses MPI_Win_create_dynamic and MPI_Win_attach, which
>means no interprocess load-store.
>
>Jeff
>
>On Sat, May 2, 2015 at 4:06 PM, Junchao Zhang <jczhang at mcs.anl.gov> wrote:
>> I can reproduce the segfault with the latest mpich. In Clean_List(), I think
>> there is a data race, since all shmem members update the head node
>> head->next_win = temp_win;
>> head->next = temp_next;
>> I tried to simplify Clean_List() further as follows,
>>
>> void Clean_List()
>> {
>> Node *cur_node = head;
>> MPI_Win cur_win = head_win;
>> Node *next_node;
>> MPI_Win next_win;
>>
>> while (cur_node) {
>> next_node = cur_node->next;
>> next_win = cur_node->next_win;
>> MPI_Win_free(&cur_win);
>> cur_node = next_node;
>> cur_win = next_win;
>> }
>>
>> head = tail = NULL;
>> }
>>
>> But I still met segfault. With gdb, the segfault disappears. If I comment
>> out the call to Clean_List in main(), the error also disappear.
>> I Cc'ed our local RMA expert Xin to see if she has new findings.
>>
>> --Junchao Zhang
>>
>> On Sat, May 2, 2015 at 10:26 AM, Brian Cornille <bcornille at wisc.edu> wrote:
>>>
>>> Hello,
>>>
>>>
>>> In working on a project that is attempting to use MPI shared memory (from
>>> MPI_Win_allocate_shared) I began getting inconsistent segfaults in portions
>>> of the code that appeared to have no memory errors when investigated with
>>> gdb. I believe I have somewhat reproduced this error in a small code
>>> (attached) that creates a linked list of MPI shared memory allocated
>>> elements.
>>>
>>>
>>> The attached program segfaults for me when run with more than one process.
>>> However, will not segfault if run in gdb (e.g. mpirun -n 2 xterm -e gdb
>>> ./mpi_shmem_ll). I have done what I can to eliminate any apparent race
>>> conditions. Any help in this matter would be much appreciated.
>>>
>>>
>>> Thanks and best,
>>>
>>> Brian Cornille
>>>
>>>
>>> _______________________________________________
>>> discuss mailing list discuss at mpich.org
>>> To manage subscription options or unsubscribe:
>>> https://lists.mpich.org/mailman/listinfo/discuss
>>
>>
>>
>> _______________________________________________
>> discuss mailing list discuss at mpich.org
>> To manage subscription options or unsubscribe:
>> https://lists.mpich.org/mailman/listinfo/discuss
>
>
>
>--
>Jeff Hammond
>jeff.science at gmail.com
>http://jeffhammond.github.io/
_______________________________________________
discuss mailing list discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss
More information about the discuss
mailing list