[mpich-discuss] Shared Memory Segfaults

Jeff Hammond jeff.science at gmail.com
Mon May 11 17:40:21 CDT 2015


Thanks for the clarification.  The smallest possible reproducer is
always the best, from a bug report perspective :-)

Jeff

On Mon, May 4, 2015 at 8:20 AM, Brian Cornille <bcornille at wisc.edu> wrote:
> Jeff,
>
> I agree that the code I supplied has no interesting parallelism.  That code was simply an attempt to diagnose or replicated an error I was getting with a more complicated code which also had shared memory structures pointing to other shared memory structures.  The linked list was just a quick example of a similar structure to test what the issue could be.  It ended up having very similar error behavior as the original program so I thought I would share it instead of a more complicated project.  For the project I was working on I wanted to create large shared memory structures that would be read-only once the initial set-up was done.  Some of these shared memory structures were to have shared memory structures allocated within them, which appeared to generate the errors I was encountering.  If you are interested in the original code that produced these errors it is posted at:
> https://github.com/bcornille/mpi-shared-ne506/tree/develop/src (mostly in geoms.c)
> There are likely ways that I could change the structure of the code to work around this problem, but I was hoping to try to understand what was causing the very strange errors that I produced.
> I encountered errors when I tried to make assignments to shared memory regions that were pointed to from other shared memory regions.  I am hoping that the error in the code I provided is the mirror image of the error in my original code.
>
> Thanks and best,
> Brian Cornille
>
> On Sat, May 2 at 5:46 PM, Jeff Hammond <jeff.science at gmail.com> wrote:
>>
>>I do not understand what this program is supposed to do.
>>MPI_Win_allocate_shared is collective, as is MPI_Win_fence, so I don't
>>see how any interesting parallelism emerges from this type of
>>implementation of a linked-list.
>>
>>Are you just using the linked-list to manage a collection of shared
>>memory windows (GMR does this in
>>http://git.mpich.org/armci-mpi.git/blob/HEAD:/src/gmr.c, for example)?
>>
>>If you want to do a distributed linked-list using RMA, I recall there
>>is a linked-list example in the MPI-3 spec or perhaps somewhere else
>>(somebody on this list will remember, or I can look it up).  And this
>>example probably uses MPI_Win_create_dynamic and MPI_Win_attach, which
>>means no interprocess load-store.
>>
>>Jeff
>>
>>On Sat, May 2, 2015 at 4:06 PM, Junchao Zhang <jczhang at mcs.anl.gov> wrote:
>>> I can reproduce the segfault with the latest mpich. In Clean_List(), I think
>>> there is a data race, since all shmem members update the head node
>>>             head->next_win = temp_win;
>>>             head->next = temp_next;
>>> I tried to simplify Clean_List() further as follows,
>>>
>>> void Clean_List()
>>> {
>>>     Node *cur_node = head;
>>>     MPI_Win cur_win = head_win;
>>>     Node *next_node;
>>>     MPI_Win next_win;
>>>
>>>     while (cur_node) {
>>>         next_node = cur_node->next;
>>>         next_win = cur_node->next_win;
>>>         MPI_Win_free(&cur_win);
>>>         cur_node = next_node;
>>>         cur_win = next_win;
>>>     }
>>>
>>>     head = tail = NULL;
>>> }
>>>
>>> But I still met segfault. With gdb, the segfault disappears. If I comment
>>> out the call to Clean_List in main(), the error also disappear.
>>> I Cc'ed our local RMA expert Xin to see if she has new findings.
>>>
>>> --Junchao Zhang
>>>
>>> On Sat, May 2, 2015 at 10:26 AM, Brian Cornille <bcornille at wisc.edu> wrote:
>>>>
>>>> Hello,
>>>>
>>>>
>>>> In working on a project that is attempting to use MPI shared memory (from
>>>> MPI_Win_allocate_shared) I began getting inconsistent segfaults in portions
>>>> of the code that appeared to have no memory errors when investigated with
>>>> gdb.  I believe I have somewhat reproduced this error in a small code
>>>> (attached) that creates a linked list of MPI shared memory allocated
>>>> elements.
>>>>
>>>>
>>>> The attached program segfaults for me when run with more than one process.
>>>> However, will not segfault if run in gdb (e.g. mpirun -n 2 xterm -e gdb
>>>> ./mpi_shmem_ll).  I have done what I can to eliminate any apparent race
>>>> conditions.  Any help in this matter would be much appreciated.
>>>>
>>>>
>>>> Thanks and best,
>>>>
>>>> Brian Cornille
>>>>
>>>>
>>>> _______________________________________________
>>>> discuss mailing list     discuss at mpich.org
>>>> To manage subscription options or unsubscribe:
>>>> https://lists.mpich.org/mailman/listinfo/discuss
>>>
>>>
>>>
>>> _______________________________________________
>>> discuss mailing list     discuss at mpich.org
>>> To manage subscription options or unsubscribe:
>>> https://lists.mpich.org/mailman/listinfo/discuss
>>
>>
>>
>>--
>>Jeff Hammond
>>jeff.science at gmail.com
>>http://jeffhammond.github.io/
> _______________________________________________
> discuss mailing list     discuss at mpich.org
> To manage subscription options or unsubscribe:
> https://lists.mpich.org/mailman/listinfo/discuss



-- 
Jeff Hammond
jeff.science at gmail.com
http://jeffhammond.github.io/
_______________________________________________
discuss mailing list     discuss at mpich.org
To manage subscription options or unsubscribe:
https://lists.mpich.org/mailman/listinfo/discuss


More information about the discuss mailing list