Computers, Science, Technology, Xen Virtualization, Hosting, Photography, The Internet, Geekdom And More

A better approach to distributed IPC

Posted on | September 29, 2008 | No Comments

There are two golden rules to follow when packaging your own OS distribution:

  • Don’t mangle upstream code
  • Make it work while observing rule #1

When your making a grid OS, its hard not to break those rules. Lets take for instance a scenario where you have two computers and need to live migrate a virtual machine from computer A to computer B. In this scenario, something central is keeping track of what virtual machines live where.

You fire up your handy grid shell and type:

livemigrate database-server from xennode1 to xennode2

Now, things get interesting if xennode1 is in Japan and xennode2 is in Nevada. We’re not going to get into the semantics of packaging up and shipping the virtual block devices from node1 to node2. We’re just going to examine the coordination that is required to pull it off.

The steps below illustrate the process:

  1. Ensure xennode2 has adequate resources to receive database-server
  2. Ensure the user has rights to put database-server on xennode2
  3. Copy block devices from xennode1 to xennode2 (if needed)
  4. Tell database-server on xennode1 to swapoff
  5. Tell xennode2 we need swap ready for database-server
  6. Verify the integrity of the copied block devices (if needed)
  7. Push database-server from xennode1 to xennode2
  8. Re-arrange networking on database-server if needed
  9. Tell database-server on xennode2 to swapon
  10. Ensure database-server is alive and well on xennode2
  11. Update central control so database-server is associated with xennode2
  12. At any failure, revert (if needed) the process and inform the user

Holy twelve step program Batman! :)

Even when working with just one machine .. sometimes POSIX signals just don’t cut it. Its up to the programmer to catch those signals and use them for a pre-defined purpose. Some signals (that are always fatal) can not be caught or handled. That means your SOL to inform something central that you died .. you just hope that central discovers that your missing in the next five minutes and adjusts.

This means, coordination between two xennodes connected via Gig-e WAN is going to be (at best) frustrating. Node 2 has to inform node1 on success or failure. If the migration failed, node2 has to inform the controller (and perhaps node1) on what went wrong.

Patching the Xen userland tools to do more will result in a plethora of merge conflicts to resolve every time you update. Doing so also means you can no longer call it ‘Xen’. The answer? Write your own userland tools or replicate something like the HelenOS IPC in Linux userspace. The latter is actually easier.

Every ‘grid’ service would register itself at central and set up an answerbox (basically a forwarded phone number consistently pointing to itself in this scenario) and poll. This would allow the grid service on node1 / node2 and central to talk, then communicate via primitive signals on each side respectively.

TCP/IP is a bit of a bottleneck, but if you use non blocking sockets with a very small payload they pay off is well worth it. No need to thread or over-alloc on dom-0 just to remain responsive. Really, all your sending is a UUID and some brief arguments .. securing it via a random Blowfish secret makes it safe and fast. At the worst, the migration would take 3 seconds instead of 2 after the vbds are handled. If people can put up with iSCSI, they can put up with something this simplistic and useful (I’m not knocking iSCSI so save your hate mail for someone who deserves it).

Sometimes experimental operating systems offer the best leads to practical solutions on stable operating systems. So perhaps gripcd will soon be born to run at an annoyingly high priority. Something like this running on a grid interconnected by Equinix .. or some other dark fiber would be remarkably fast.

Or, I could just write a bunch of ugly scripts and use SSH :P


Leave a Reply

  • Monkey Plus Typewriter
  • Stack Overflow

  • Me According To Ohloh

  • Meta