When I was at university our very forward thinking lecturer (Dr Bacon as I remember) introduced us to the most amazing up-to-date programming environment: C on UNIX. Storing an item in a collection meant declaring an array, or if you are really being adventurous, using malloc to dynamically allocate memory. That was it. Back then it was practically state of the art. As well as making for the kind of excellent training that you curse at the time but realize 10 years later you should be eternally thankful for having been made to do it.
What a difference from today when the .NET Framework comes with so many collection classes out of the box that I bet most .NET developers don’t even know what to do with some of them. (Test of your knowledge: Do you know the difference between a Dictionary and a HashSet? WITHOUT Googling it!). In the C++ world, you’re equally well blessed, as the standard template library has long since gone mainstream and boasts an equally rich set of choices.
Now here’s a question for you: What’s the fundamental difference between LinkedList<T> and – well, basically everything else in System.Collections.Generics?
Think about it before you read on.
You just read on without stopping to think, didn’t you. I saw you.
The clue is in the signature of LinkedList’s various Add methods. Look at one of them:
public LinkedListNode<T> AddFirst(T value)
It returns a LinkedListNode<T>. And that LinkedListNode is crucial. Its purpose is to know where it is in the list, and without it, moving around the list is a bit harder (though still possible).
Linked List (with Nodes)
And that LinkedListNode is the difference we’re looking for. All the other collections in System.Generics will let you add an item without anything other than the collection itself needing to know that that item is in the collection (or at least: Anything that’s visible outside the collection. I don’t care about private implementation details). Your item can sit there all pure and happy and oblivious to the fact that not only is it no longer a collection-virgin, but it maybe it now even sits rather promiscuously inside 3 or 4 different collections. All at the same time (gasp).
But linked lists are different. The whole point of a linked list is that it allows efficient insertion, deletion, and navigation, all by having each object store references to its neighbours in the list. And that means an object can’t be in the list unless it knows it’s there. Now of course at this point you’re dying to tell me that I’m slightly lying and really the thing that knows its inside a linked list is merely the LinkedListNode instance. That node also contains a reference to the underlying object and the underlying object still doesn’t know that it’s in the list. That’s fair enough: That class design gives excellent separation of concerns (separating the need to know that where you are in the list from everything else about the underlying object). But it doesn’t change the fact that something connected with the object has to know that it is in the list and where it is.
So that’s one way that I personally categorise collections. You have collections where the members don’t need to know they are in the collection, and collections where they do. Offhand I don’t think anyone’s given a name to those categories, but my personal terminology is that I call the things whose purpose is to know where they are nodes, and I call collections like LinkedList nodal collections.
Of course it may have occurred to you that a categorization that puts one collection on the side and everything else sitting together seems a tad one-sided. Well sorry, I wasn’t aiming for an even split, I was looking for something meaningful that impacts how I code, and in particular how I design my own collection classes. And yes I often find that the MS classes, rich though they are, don’t fulfil my needs and I need to write new collections. Things like:
- A dictionary that can store multiple values with the same key.
- A single-valued dictionary that allows quick lookup of key from value as well as value from key.
- A graph, (in the mathematical sense, like the diagram. Yes, some of my code requires very complicated data structures!).
- Or of course that common favourite, the tree.
A graph. Probably requires nodes!
And one thing that I’ve never read in any best practices book, but which I’ve found repeatedly is that if I design any collection in such a way that objects need to know they are in the collection, my code complexity in the classes that consume that collection seems to practically quadruple. There’s various factors at work there. You get bidirectional relationship madness because the collection needs to know who its nodes are, AND the nodes need to know who the collection is, added to the restriction that the nodes are often the kind of object that really ought to be immutable. And that adds a memory footprint because there are a lot of nodes (each one needs its own data structure to figure out where it is too). You get the division of responsibilities between node and collection: Do you go to the node to add a new object or the collection? It all gets fuzzy, and of course everything related to the collection is a generic class too, which doesn’t really add complexity, but sure makes the code harder to read. You quickly end up with one unholy mess.
In short. My golden rule for collections is:
Don’t design your collection as a nodal collection that needs a public associated node class unless you really have to.
I suspect someone in the Microsoft .NET team independently figured this out long ago because almost all of their collection classes have apparently been designed not to require public nodes.
Of course avoiding nodal collections isn’t always possible. A linked list would simply not be able to function effectively without nodes that know they are in the list. And the same is probably true of things like graphs and trees. But if I think I need a node class: I always look more carefully first.