
PREFACE: This blog post tries to investigate into the much prevalent modern language’s feature and a topic that is usually glossed over due to its complex black box nature, and that is, Garbage Collection (GC). Be prepared for a long haul!. This blog post is prepared exclusively for developers.
DISCLAIMER & NOTES
- I have made assumptions that the reader has basic knowledge about GC.
- The word memory is used to indicate heap, unless otherwise explicitly stated.
- 100’s of technical paper has been published around this one topic (GC); This post gives you the condensed essence of decade-long work. Bear with me, the post is occasionally theoretically winding, which is inevitable.
- Note that, Reference == pointer == cell == node == object.
WHAT IS “GARBAGE”? (DANGLING VS GARBAGE)
The difference between dangling pointers and garbage might shed some light on our understanding. A dangling pointer is a reference to an object that no longer exists. Garbage is an object that cannot be reached through a reference. There are two types of garbage: Semantic and Syntactic garbage.
public class Stack { private Object[] elements; private int size; public Stack(int capacity) { elements = new Object[capacity]; } public void push(Object e) { elements[size++] = e; } public Object pop() { return elements[size--]; } }
*Source code borrowed from Wikipedia
In the above code – once you pop a particular element, one doesn’t have any means to re-access the popped item from outside the class, though it’s still somewhere in there as part of the runtime memory. Hence we create semantic garbage here. On the other hand, syntactic garbage is when a local object goes out of scope through block level scoping, the global/root scope doesn’t have any means to access the local object in a context which creates garbage due to syntactic access limitation. Now languages like C/C++ has a means to de-allocate through commands but modern languages need GC to reclaim the memory that is no longer accessible from the global or root reference.
Also, in languages like Java/C#, dangling pointers cannot occur because there is no mechanism to explicitly deallocate memory. Rather, the garbage collector may de-allocate memory, but only when the object is no longer reachable from any references.
GUTSY EVOLUTION
- Reference Counting Algorithm: Imagine every managed heap cell having additional header property like reference count (refCount is the type of integer) that is automatically incremented (like when actively referenced by another object) or automatically decremented (like when the reference is deleted). There are a handful of ways when a GC kicks in like induced by a programmer or when there is a large allocation of memory chunks or when the dynamic memory is under critical pressure. When any one of the above generic cases is reached, the GC kicks in and reclaims the memory of garbage objects with refCount == 0. Note that the root here is the base pointer to the heap, not the application root pointer. It is quite an expensive operation but due to the simplicity of algorithm, it’s prevalent. One major drawback would be a cyclic data structure (like A pointing to B and inversely, B pointing to A) are hard to reclaim. One solution suggested is to go for the hybrid approach (Mark and Sweep for cyclic structures and all others through reference counting). There are innumerable variations of this approach and they warrant having separate work.
- Mark and Sweep / Compact Algorithm: This is quite popular in the modern OO languages like.NET runtime / Java Virtual Machine. It goes like this, consider the whole of your application as collection of hierarchical objects (each node is equivalent to an object) in a tree-like fashion with an application wide root reference and as well as base heap root reference, when any one of the GC inducing cases is reached (refer the previous paragraph: reasons when GC kicks in), the algorithm scans the whole heap tree and marks the unreachable garbage objects (like flipping the bit as marked or ready for collection). A bit of side track here:
Reachable vs. Unreachable: Every application stack has pointer variable referring to connected heap hierarchical object structures. When these objects are alive i.e., accessible through the root then it’s called as reachable if it’s not, then called as unreachable cells. As the GC traverses through the nested tree structures (this process is called as marking), it evaluates each node for reachable and unreachable status.
The unreachable objects here are marked for collection. Next is the sweeping stage where the memory is reclaimed or released back to the managed pool for reuse. Here there is a possibility of memory holes and hence ends in heap fragmentation. To address memory fragmentation issue there is another variation of this GC called as “mark and compact”, where the memory holes are compacted or consolidated by relocating or copying (can be overhead) the existing live objects and readjusting the pointer to refer new ones.
Heuristic behind .NET GC: One GC solution or algorithm doesn’t fit all the real world real-world scenarios. We need to take the best of the mark and sweep/compact/copying + generational + other heuristic parameters and selectively apply to make.NET runtime robust and powerful. It’s always the integrated hybrid solution that is sustainable. Right!
Generational algorithms kind of start with one monolithic heap segment and based on the age/lifetime of the objects living through the iteration of GC, the objects get promoted to new generation incrementally all the way from genartion0 to genaration2 and LOH (large object heap). Each time a generation is created a new memory segment/chunk is allocated. But within one single generation, they employ mark-sweep/compact/copy algorithm for reclamation as and when there is memory pressure. And .NET, in particular, has tweaks like desktop, server, .net runtime versions etc. determining the size of the initial heap. .NET also has something called as LOH (large object heap) – which is exempt from iterative generational memory reclamation and supposed to live for a longer period. In .NET 1.1 and 2.0 if an object is >= 85,000 bytes it’s considered a large object. The large object heap is collected when a generation2 collection occurs but are not compacted. The reasoning behind having a special heap for large objects is that it is very expensive to move them around, and particularly for arrays, for example, it is very expensive to update all the references etc.
Inference: I might have intentionally overlooked few concepts but note that this is not a thesis work. It started out as a short one-page blog and ended in two-part blog series and frankly, this topic deserves such treatment. We looked at the major classification of GC algorithms and kind of dig into.NET runtime GC. There is not one solution out there that fits the bill; it’s the hybrid solution that is embraced by all. A food for thought here would be to delve into Real-Time GC algorithm well suited for embedded devices (iPhone/ iWatch/ Fitbit) which don’t have a whole lot of memory and processing power compared to desktops.
I have not failed. I’ve just found 10,000 ways that won’t work. – Thomas A. Edison