X-Git-Url: http://wien.tomnetworks.com/gitweb/?p=hs-boehmgc.git;a=blobdiff_plain;f=gc-7.2%2Fdoc%2Fgcdescr.html;fp=gc-7.2%2Fdoc%2Fgcdescr.html;h=01a52da1ac545c80739b5e03d7ade29678ba64d3;hp=0000000000000000000000000000000000000000;hb=324587ba93dc77f37406d41fd2a20d0e0d94fb1d;hpb=2a4ea609491b225a1ceb06da70396e93916f137a diff --git a/gc-7.2/doc/gcdescr.html b/gc-7.2/doc/gcdescr.html new file mode 100644 index 0000000..01a52da --- /dev/null +++ b/gc-7.2/doc/gcdescr.html @@ -0,0 +1,632 @@ + +
++This is a description of the algorithms and data structures used in our +conservative garbage collector. I expect the level of detail to increase +with time. For a survey of GC algorithms, see for example + Paul Wilson's +excellent paper. For an overview of the collector interface, +see here. +
+This description is targeted primarily at someone trying to understand the +source code. It specifically refers to variable and function names. +It may also be useful for understanding the algorithms at a higher level. +
+The description here assumes that the collector is used in default mode. +In particular, we assume that it used as a garbage collector, and not just +a leak detector. We initially assume that it is used in stop-the-world, +non-incremental mode, though the presence of the incremental collector +will be apparent in the design. +We assume the default finalization model, but the code affected by that +is very localized. +
+The remaining sections describe the memory allocation data structures, +and then the last 3 collection phases in more detail. We conclude by +outlining some of the additional features implemented in the collector. + +
+Most static data used by the allocator, as well as that needed by the +rest of the garbage collector is stored inside the +_GC_arrays structure. +This allows the garbage collector to easily ignore the collectors own +data structures when it searches for root pointers. Other allocator +and collector internal data structures are allocated dynamically +with GC_scratch_alloc. GC_scratch_alloc does not +allow for deallocation, and is therefore used only for permanent data +structures. +
+The allocator allocates objects of different kinds. +Different kinds are handled somewhat differently by certain parts +of the garbage collector. Certain kinds are scanned for pointers, +others are not. Some may have per-object type descriptors that +determine pointer locations. Or a specific kind may correspond +to one specific object layout. Two built-in kinds are uncollectable. +One (STUBBORN) is immutable without special precautions. +In spite of that, it is very likely that most C clients of the +collector currently +use at most two kinds: NORMAL and PTRFREE objects. +The gcj runtime also makes +heavy use of a kind (allocated with GC_gcj_malloc) that stores +type information at a known offset in method tables. +
+The collector uses a two level allocator. A large block is defined to +be one larger than half of HBLKSIZE, which is a power of 2, +typically on the order of the page size. +
+Large block sizes are rounded up to +the next multiple of HBLKSIZE and then allocated by +GC_allochblk. Recent versions of the collector +use an approximate best fit algorithm by keeping free lists for +several large block sizes. +The actual +implementation of GC_allochblk +is significantly complicated by black-listing issues +(see below). +
+Small blocks are allocated in chunks of size HBLKSIZE. +Each chunk is +dedicated to only one object size and kind. +
+The allocator maintains +separate free lists for each size and kind of object. +Associated with each kind is an array of free list pointers, +with entry freelist[i] pointing to +a free list of size i objects. +In recent versions of the +collector, index i is expressed in granules, which are the +minimum allocatable unit, typically 8 or 16 bytes. +The free lists themselves are +linked through the first word in each object (see obj_link() +macro). +
+Once a large block is split for use in smaller objects, it can only +be used for objects of that size, unless the collector discovers a completely +empty chunk. Completely empty chunks are restored to the appropriate +large block free list. +
+In order to avoid allocating blocks for too many distinct object sizes, +the collector normally does not directly allocate objects of every possible +request size. Instead request are rounded up to one of a smaller number +of allocated sizes, for which free lists are maintained. The exact +allocated sizes are computed on demand, but subject to the constraint +that they increase roughly in geometric progression. Thus objects +requested early in the execution are likely to be allocated with exactly +the requested size, subject to alignment constraints. +See GC_init_size_map for details. +
+The actual size rounding operation during small object allocation is +implemented as a table lookup in GC_size_map which maps +a requested allocation size in bytes to a number of granules. +
+Both collector initialization and computation of allocated sizes are +handled carefully so that they do not slow down the small object fast +allocation path. An attempt to allocate before the collector is initialized, +or before the appropriate GC_size_map entry is computed, +will take the same path as an allocation attempt with an empty free list. +This results in a call to the slow path code (GC_generic_malloc_inner) +which performs the appropriate initialization checks. +
+In non-incremental mode, we make a decision about whether to garbage collect +whenever an allocation would otherwise have failed with the current heap size. +If the total amount of allocation since the last collection is less than +the heap size divided by GC_free_space_divisor, we try to +expand the heap. Otherwise, we initiate a garbage collection. This ensures +that the amount of garbage collection work per allocated byte remains +constant. +
+The above is in fact an oversimplification of the real heap expansion +and GC triggering heuristic, which adjusts slightly for root size +and certain kinds of +fragmentation. In particular: +
+It has been suggested that this should be adjusted so that we favor +expansion if the resulting heap still fits into physical memory. +In many cases, that would no doubt help. But it is tricky to do this +in a way that remains robust if multiple application are contending +for a single pool of physical memory. + +
+At the beginning of the mark phase, all root segments +(as described above) are pushed on the +stack by GC_push_roots. (Registers and eagerly processed +stack sections are processed by pushing the referenced objects instead +of the stack section itself.) If ALL_INTERIOR_PTRS is not +defined, then stack roots require special treatment. In this case, the +normal marking code ignores interior pointers, but GC_push_all_stack +explicitly checks for interior pointers and pushes descriptors for target +objects. +
+The marker is structured to allow incremental marking. +Each call to GC_mark_some performs a small amount of +work towards marking the heap. +It maintains +explicit state in the form of GC_mark_state, which +identifies a particular sub-phase. Some other pieces of state, most +notably the mark stack, identify how much work remains to be done +in each sub-phase. The normal progression of mark states for +a stop-the-world collection is: +
+The fact that it perform a only a small amount of work per call also +allows it to be used as the core routine of the parallel marker. In that +case it is normally invoked on thread-private mark stacks instead of the +global mark stack. More details can be found in +scale.html +
+The marker correctly handles mark stack overflows. Whenever the mark stack +overflows, the mark state is reset to MS_INVALID. +Since there are already marked objects in the heap, +this eventually forces a complete +scan of the heap, searching for pointers, during which any unmarked objects +referenced by marked objects are again pushed on the mark stack. This +process is repeated until the mark phase completes without a stack overflow. +Each time the stack overflows, an attempt is made to grow the mark stack. +All pieces of the collector that push regions onto the mark stack have to be +careful to ensure forward progress, even in case of repeated mark stack +overflows. Every mark attempt results in additional marked objects. +
+Each mark stack entry is processed by examining all candidate pointers +in the range described by the entry. If the region has no associated +type information, then this typically requires that each 4-byte aligned +quantity (8-byte aligned with 64-bit pointers) be considered a candidate +pointer. +
+We determine whether a candidate pointer is actually the address of
+a heap block. This is done in the following steps:
+
+At the end of the mark phase, mark bits for left-over free lists are cleared, +in case a free list was accidentally marked due to a stray pointer. + +
+This initial sweep pass touches only block headers, not +the blocks themselves. Thus it does not require significant paging, even +if large sections of the heap are not in physical memory. +
+Nonempty small object pages are swept when an allocation attempt +encounters an empty free list for that object size and kind. +Pages for the correct size and kind are repeatedly swept until at +least one empty block is found. Sweeping such a page involves +scanning the mark bit array in the page header, and building a free +list linked through the first words in the objects themselves. +This does involve touching the appropriate data page, but in most cases +it will be touched only just before it is used for allocation. +Hence any paging is essentially unavoidable. +
+Except in the case of pointer-free objects, we maintain the invariant +that any object in a small object free list is cleared (except possibly +for the link field). Thus it becomes the burden of the small object +sweep routine to clear objects. This has the advantage that we can +easily recover from accidentally marking a free list, though that could +also be handled by other means. The collector currently spends a fair +amount of time clearing objects, and this approach should probably be +revisited. +
+In most configurations, we use specialized sweep routines to handle common +small object sizes. Since we allocate one mark bit per word, it becomes +easier to examine the relevant mark bits if the object size divides +the word length evenly. We also suitably unroll the inner sweep loop +in each case. (It is conceivable that profile-based procedure cloning +in the compiler could make this unnecessary and counterproductive. I +know of no existing compiler to which this applies.) +
+The sweeping of small object pages could be avoided completely at the expense +of examining mark bits directly in the allocator. This would probably +be more expensive, since each allocation call would have to reload +a large amount of state (e.g. next object address to be swept, position +in mark bit table) before it could do its work. The current scheme +keeps the allocator simple and allows useful optimizations in the sweeper. + +
+The collector makes an initial pass over the table of finalizable objects, +pushing the contents of unmarked objects onto the mark stack. +After pushing each object, the marker is invoked to mark all objects +reachable from it. The object itself is not explicitly marked. +This assures that objects on which a finalizer depends are neither +collected nor finalized. +
+If in the process of marking from an object the +object itself becomes marked, we have uncovered +a cycle involving the object. This usually results in a warning from the +collector. Such objects are not finalized, since it may be +unsafe to do so. See the more detailed + discussion of finalization semantics. +
+Any objects remaining unmarked at the end of this process are added to +a queue of objects whose finalizers can be run. Depending on collector +configuration, finalizers are dequeued and run either implicitly during +allocation calls, or explicitly in response to a user request. +(Note that the former is unfortunately both the default and not generally safe. +If finalizers perform synchronization, it may result in deadlocks. +Nontrivial finalizers generally need to perform synchronization, and +thus require a different collector configuration.) +
+The collector provides a mechanism for replacing the procedure that is +used to mark through objects. This is used both to provide support for +Java-style unordered finalization, and to ignore certain kinds of cycles, +e.g. those arising from C++ implementations of virtual inheritance. + +
+The most significant modification is that +the collector always starts running in the allocating thread. +There is no separate garbage collector thread. (If parallel GC is +enabled, helper threads may also be woken up.) +If an allocation attempt either requests a large object, or encounters +an empty small object free list, and notices that there is a collection +in progress, it immediately performs a small amount of marking work +as described above. +
+This change was made both because we wanted to easily accommodate +single-threaded environments, and because a separate GC thread requires +very careful control over the scheduler to prevent the mutator from +out-running the collector, and hence provoking unneeded heap growth. +
+In incremental mode, the heap is always expanded when we encounter +insufficient space for an allocation. Garbage collection is triggered +whenever we notice that more than +GC_heap_size/2 * GC_free_space_divisor +bytes of allocation have taken place. +After GC_full_freq minor collections a major collection +is started. +
+All collections initially run uninterrupted until a predetermined +amount of time (50 msecs by default) has expired. If this allows +the collection to complete entirely, we can avoid correcting +for data structure modifications during the collection. If it does +not complete, we return control to the mutator, and perform small +amounts of additional GC work during those later allocations that +cannot be satisfied from small object free lists. When marking completes, +the set of modified pages is retrieved, and we mark once again from +marked objects on those pages, this time with the mutator stopped. +
+We keep track of modified pages using one of several distinct mechanisms: +
+During the mark phase, the collector tracks ``near misses'', i.e. attempts +to follow a ``pointer'' to just outside the garbage-collected heap, or +to a currently unallocated page inside the heap. Pages that have been +the targets of such near misses are likely to be the targets of +misidentified ``pointers'' in the future. To minimize the future +damage caused by such misidentifications they will be allocated only to +small pointerfree objects. +
+The collector understands two different kinds of black-listing. A +page may be black listed for interior pointer references +(GC_add_to_black_list_stack), if it was the target of a near +miss from a location that requires interior pointer recognition, +e.g. the stack, or the heap if GC_all_interior_pointers +is set. In this case, we also avoid allocating large blocks that include +this page. +
+If the near miss came from a source that did not require interior +pointer recognition, it is black-listed with +GC_add_to_black_list_normal. +A page black-listed in this way may appear inside a large object, +so long as it is not the first page of a large object. +
+The GC_allochblk routine respects black-listing when assigning +a block to a particular object kind and size. It occasionally +drops (i.e. allocates and forgets) blocks that are completely black-listed +in order to avoid excessively long large block free lists containing +only unusable blocks. This would otherwise become an issue +if there is low demand for small pointerfree objects. + +
+In particular, it is very difficult for the collector to stop all other +threads in the system and examine the register contents. This is currently +accomplished with very different mechanisms for some Pthreads +implementations. The Solaris implementation temporarily disables much +of the user-level threads implementation by stopping kernel-level threads +("lwp"s). The Linux/HPUX/OSF1 and Irix implementations sends signals to +individual Pthreads and has them wait in the signal handler. +
+The Linux and Irix implementations use +only documented Pthreads calls, but rely on extensions to their semantics. +The Linux implementation linux_threads.c relies on only very +mild extensions to the pthreads semantics, and already supports a large number +of other Unix-like pthreads implementations. Our goal is to make this the +only pthread support in the collector. +
+(The Irix implementation is separate only for historical reasons and should +clearly be merged. The current Solaris implementation probably performs +better in the uniprocessor case, but does not support thread operations in the +collector. Hence it cannot support the parallel marker.) +
+All implementations must +intercept thread creation and a few other thread-specific calls to allow +enumeration of threads and location of thread stacks. This is current +accomplished with # define's in gc.h +(really gc_pthread_redirects.h), or optionally +by using ld's function call wrapping mechanism under Linux. +
+Recent versions of the collector support several facilities to enhance +the processor-scalability and thread performance of the collector. +These are discussed in more detail here. +We briefly outline the data approach to thread-local allocation in the +next section. +
+The free list arrays associated +with each thread are only used to satisfy requests for objects that +are both very small, and belong to one of a small number of well-known +kinds. These currently include "normal" and pointer-free objects. +Depending on the configuration, "gcj" objects may also be included. +
+Thread-local free list entries contain either a pointer to the first +element of a free list, or they contain a counter of the number of +allocation granules, corresponding to objects of this size, +allocated so far. Initially they contain the +value one, i.e. a small counter value. +
+Thread-local allocation allocates directly through the global +allocator, if the object is of a size or kind not covered by the +local free lists. +
+If there is an appropriate local free list, the allocator checks whether it +contains a sufficiently small counter value. If so, the counter is simply +incremented by the counter value, and the global allocator is used. +In this way, the initial few allocations of a given size bypass the local +allocator. A thread that only allocates a handful of objects of a given +size will not build up its own free list for that size. This avoids +wasting space for unpopular objects sizes or kinds. +
+Once the counter passes a threshold, GC_malloc_many is called +to allocate roughly HBLKSIZE space and put it on the corresponding +local free list. Further allocations of that size and kind then use +this free list, and no longer need to acquire the allocation lock. +The allocation procedure is otherwise similar to the global free lists. +The local free lists are also linked using the first word in the object. +In most cases this means they require considerably less time. +
+Local free lists are treated buy most of the rest of the collector +as though they were in-use reachable data. This requires some care, +since pointer-free objects are not normally traced, and hence a special +tracing procedure is required to mark all objects on pointer-free and +gcj local free lists. +
+On thread exit, any remaining thread-local free list entries are +transferred back to the global free list. +
+Note that if the collector is configured for thread-local allocation, +GC versions before 7 do not invoke the thread-local allocator by default. +GC_malloc only uses thread-local allocation in version 7 and later. +
+For some more details see here, and the +technical report entitled + +``Fast Multiprocessor Memory Allocation and Garbage Collection'' + +
+
+Comments are appreciated. Please send mail to +gc@linux.hpl.hp.com +(GC mailing list) or +Hans.Boehm@hp.com +
+This is a modified copy of a page written while the author was at SGI. +The original was here. + +