Cross-Machine Goo Objects

This is a rough record of various conversations and online discussions about how goo will work in a multi-machine environment, circa February 1999.

Redefining OID

Quick summary of existing practice:  We expect that on any given goo engine, the database of created objects and classes will likely easily grow past the amount of available memory on the machine.  So instead of keeping all objects in memory, they're stored in a database on disk, with a memory cache of the frequently used objects.  Instead of using C pointers to track object links (an object that contains a reference to another object), we use a 128-bit OID (Object ID) which is used to lookup entries in the database.

Every object has a unique OID, so I just borrowed the concept of a UUID (Universally Unique ID) from a draft RFC Microsoft submitted to the IETF a year or so ago (yes, I did just use the terms "RFC" and "Microsoft" in the same phrase) -- though I modified Goo's UUID/OID to remove Windows-centric timestamp behavior and make some of the math easier for me.  An OID is roughly based on the machine's ethernet address, the current time in milliseconds, and a counter.

Since we're not using proper UUID's, there's no reason not to redefine the OID completely and scrap the use of UUIDs.  Therefore Greg suggested creating a GWID (Goo World ID) which is unique to each machine running Goo.  This would likely be a 32 or 64-bit hash of the ethernet address, current time, and some other machine-dependent info.  It's important that the GWID is unique for each machine.

The OID can then be redefined to be composed of the GWID, timestamp, and counter.  With a proper table of (hostname, GWID) pairs, you could then always identify the creator of an object: extract the GWID from the OID, and lookup the hostname for that GWID.  The OID would of course still be unique.

Owning an Object

Each object would now have a "creator" machine (embedded in the OID) and an "owner" machine (the machine that's currently storing that object in its local database).  For most objects (I'd say at least 90-95%), these will be the same machine: rooms, objects associated with those rooms, short-range NPCs [non-player characters: objects that traverse rooms and interact with players as if they were robotically controlled players] and the like would remain on the machine that created them.

However, NPCs that wander over the entire gameboard, and objects carried by players who are wandering across machines -- these things will end up "living" on a machine that didn't create them.  For example, an NPC that enters a portal and moves from Bob's gameboard to Jim's gameboard will end up being serialized, and transferred from Bob's goo engine to Jim's goo engine.

When this happens, the "creator" machine for the object will record the "forwarding address" (GWID) -- it will be responsible for keeping track of where its objects have wandered off to.  When we want to find an object, we can extract the GWID of the creator from the OID of the object and contact that host.  The creator host will then direct us to the current owner, and we can talk to the owner.

We should keep a rather long-life cache of this information, so that we're not always bugging the creator host when it's not currently owning an object -- if the object moves, and we talk to the old owner, it will just shrug and say "I don't own that object".  In that case, we clear the cache entry, and go back to the creator host and ask for the owner once again.  Naturally, when an object is moved from one machine to another, the creator host must be updated about who's holding it.

Failover

When trying to find an object, you now have two points of failure: the creating machine can be down, or the owning machine can be down.  This is just a fact of net life, so we have to cope with it.  When a machine goes "down", it could be temporary or permanent.  It could be for 10 seconds, or it could be forever (hydrogen bomb).

If the creator machine is down, but some other machine currently owns it, the object is reachable only to people who have the owner information cached.  For everyone else, the object has effectively popped out of existence.  Attempts to reach the object will timeout (or instantly fail).  In this case, the goo engine should cache the object in a "blacklist" with exponential backoff.  For example, after the first failed attempt, blacklist the object for 5 minutes.  If the object is accessed by goo code after 5 minutes, try again, and if it fails again, blacklist for 10 minutes, and so on.  Eventually the object will return or the blacklist timeout will reach some magic length (a week?) where we are effectively not even checking anymore -- the object is dead.

When goo code tries to access an object that can't be reached in this way, we call it AWOL (from US Army term: Absent Without Leave).  After the timeout, the goo engine will trigger an exception (ObjectAWOL) on the calling code.  Smart code can catch this exception and either continue normally, or take evasive action.  If the exception isn't caught, the goo code snippet will exit and the player will see a rude error message.

If the creator machine is alive and well, but the owner machine is down, a similar situation occurs.  We get the forwarding address from the creator, but have trouble contacting the owner.  Pretty much everything else happens as described above.  The difference here is that if the owner machine is down, nobody can access the object.  If only the creator machine is down, some people can access the object while it's still in their cache, but most people can't.  Eventually the GWID of the creator will disappear from the (hostname, GWID) table, and the owner machine should do a periodic scan of the database to extinguish any orphaned objects like that -- this check wouldn't need to be done very often, and you might be able to get away with forgetting to do it at all, though you'd have a chance of collecting unusable cruft objects.

Precautions

Player character objects should always remain on the server they were created on.  Weird things could happen if the player's server couldn't reach its own player character, but other people could.  If a player character is on the gameboard in a different machine, and that machine drops off the net, the player character can be whisked away to its "home" room, or if that's gone too, to a "safe" room on the player's server.

Classes are a weird case.  Whenever an object is owned by a machine, that machine will need to have a copy of the class.  Possibly it can just make a copy of the class, and then periodically check the creator machine to see if it needs to get an update of the class definition.  The authoritative class definition should always remain on the creator machine for that class, with other machines just using temporary copies -- that is, class definitions don't move around like objects.  If the creator machine goes down, each machine that owns an object of that class will still have a copy of the class definition, so things should be able to hobble along.  (Class definitions have an OID just like objects, and that OID will of course have the GWID in it also.)

Lazy Object Migration

Object migration is when an object moves from one host to another.  All we've really agreed on is that players and rooms never migrate -- figuring out when/if other objects should migrate has turned out to be a really tricky question.  My first idea was that objects would just migrate when they changed "owners", but the game may sometimes lack intrinsic knowledge of what "owns" what, and guessing gets too confusing.

Greg had the idea that you could migrate players, and then just recursively migrate any objects that the player had references to -- but you run into issues of when to stop.  For example, Robey and Scott might have walkie-talkies that they use to communicate with each other.  When Robey walks to a different host, his walkie-talkie goes with him.  But his walkie-talkie has a reference to Scott's, and moving Scott's walkie-talkie doesn't make much sense.  Other cases get even more complex in an ever-downward spiral.

Greg's II proposal is "lazy migration".  As players and NPCs move around, no objects migrate at first.  But as I keep using my flashlight, it eventually behooves the flashlight's owner to migrate it over to my host.  The easiest threshold I could think of is N consecutive RPC calls to the object (easy for goo to track, since any RPC call from another host [or the local host] would reset the count).  Note that this means some objects might never migrate, either because the player/NPC isn't using it much, or because other pieces of the game are using it just as much.  For example, my walkie-talkie might never migrate if Scott talks to me about as often as I talk to him, since Scott's host will be referring to my walkie-talkie whenever Scott talks, and my host will be referring to it whenever I talk.

Object migration should definitely be initiated by the game (not by the goo engine), so the game can use a little bit of "smarts" or "sanity checking" on object migration.  It could, at the very least, make sure that players and rooms never migrated unless it wanted to, and it could make sure objects in a player's possession (or a room's possession) were migrated to that player or room (if we think that's a good idea).  The ambiguous cases could be handled through the lazy migration scheme.

Nimby Garbage Collection

GC across a distributed space is apparently an unsolved problem, which is not an encouraging thing (to say the least).  But I think we can cope with it by using reference counts.  Any link (reference) to another object increments the target object's refcount.  If the link is thrown away, the refcount is decremented.  If the link is just passed to a new machine, nothing special is done.

Since the game should be able to track down all local-host objects by traversing its map of rooms and player list, a traversal-type garbage collector should mostly work.  It will easily detect objects that are obviously still in use, and we can set those aside as "in use" and concentrate on the rest -- the "questionable" pile.

The "questionable" pile is questionable because of the corkscrew that a distributed game adds: Two objects might appear to be a closed loop, but they may be referenced by (living) objects on another host.  So we need another phase to figure this out.  I think this can be done with a quadratic algorithm that sort of linearly searches each item in the "questionable" pile and determines how many inter-references there are.  For example, if an object has a refcount of 3, but there are only 2 objects in the "questionable" pile that have references to it, then one reference is from off-site.  We mark these objects as "nimby" (explained below) and traverse their references, moving those chains over to the "in use" pile.  Anything left over is a local closed-loop (all refcounts are accounted for locally) and can be discarded.

"Nimby" objects are thus objects that are "in use" by some other host, but not used by anything locally.  Therefore the local host has no need for them, and wants to get rid of them as soon as possible.  It has no way of knowing which remote host holds a reference to this object chain, but the next time an RPC comes in for one of these objects, goo will see that it's a nimby object, and it will proceed to evict the object chain (ie, that object, and a recusive descent of all references), and migrate it to the remote host that sent the RPC.  That's why they're called "nimby" -- they're marked for eviction as soon as we can find a victim to migrate them to, because we definitely don't want them.

[ In the 80s and 90s in America, garbage disposal became a big problem.  Dumps and landfills (and toxic waste dumps) were rapidly filling up, but it was hard to open new ones, because nobody wanted a bunch of trash dumped in their city/state/etc -- "Not In My BackYard" was used to refer to this game of trash hot potato, later abbreviated to NIMBY. ]
A possible loophole is that you could have a closed loop (objects that only refer to each other, but aren't referred to by anything active in the game) that spans more than one host.  Maybe an object on A that has a reference to (and is referred to by) an object on B.  They each have a refcount of 1, with no local references, meaning they appear to be "in use" remotely, so they're each marked nimby -- but they'll never come into play.  In this case, the garbage collector should do one final step.  If it's doing garbage collection, and it finds a nimby object which is already marked as a nimby then it should decide (based on a random number) whether to do a no-op RPC on any objects that the nimby refers to.  If it decides to, that RPC may trigger remote nimby objects to be evicted and migrated over to the local host, and on the next round of garbage collection, the whole thing will appear as a local closed loop, and get thrown away.

Nimby Example / Justification

To keep it simple, imagine you have a closed "dead" loop that only spans two hosts.  Now, each host will figure out it's a nimby, because at least one object on each host will have an unaccounted-for refcount (ie, a refcount that stems from being referred to by an object on another host). So each object in the loop will be marked as a nimby.
 
On the next iteration, the GC will see the nimbies, see that they're already marked, and it will decide at random to basically poke them to see if they squirm.  It follows all the references in them, looking for a reference to an object on another host.  Since this is a closed loop, at least one object on each host does contain a reference to an object on the other host.  The GC will then do a no-op RPC to that object, sort of "pinging" it.
 
Recall that any RPC on a nimby object causes that host to evict the nimby object (and any nimby object chain of referrals it's connected to) to the host that did the RPC.  So whichever host does the "ping" first will get the other half of the dead loop dumped on its lap.  But this is a good  thing!  Because on the next GC cycle, the entire loop is on the local host, and easily determined to be dead, and wiped.
 
You can expand this to more than two hosts -- it'll just happen in stages as various pieces of the loop coalesce onto the unlucky machines that do the pinging (determined at random).  The dead loops will never resplinter -- only coalesce -- so you can prove that eventually you'll be down to the entire loop sitting on one host.
 
What about non-loops?  An chain of objects on host A refers to a chain of objects on host B, but not vice-versa.  Well, since A refers to B, but nobody else refers to A, the chain on A will be considered "dead" on the first GC cycle, and wiped.  The wiping of A will involve removing the refcount from the object on B, and then B will be a dead chain too.  This is actually much easier and quicker than a loop, if you think about it. Only loops cause the "pinging" to need to occur.