Many months ago, my manager S asked about a “health check service” for “Project: Sippy-Cup [1].” Something that operations could query to see if my component was still up and running. I rejected the idea of embedding a web server in the component as being complete overkill (and really, any embedded webserver would swamp the amount of code that actually does the useful work in my component, which just processes one SIP (Session Initiation Protocol) message.
So I did the simplest thing that could possibly work [2]: a simple UDP (User Datagram Protocol) service. It accepts a packet with the string “STATUS” and replies with “OKAY.” It was only a few lines of code, and with netcat [3] I figured it would be a simple matter for operations to do a health check.
It seems that UDP is too confusing for operations to deal with, so I changed the underlying protocol to TCP (Transmission Control Protocol). It's a bit more complicated to support as I now have to listen and accept connections, but then it should be even easier for operations to handle it with netcat. The protocol stills accept a string of “STATUS” and returns with “OKAY”.
And it's still apparently too much for operations to deal with. Operations actually asked if they could send a SIP message, and I was like, Wow! If it's easier for you guys to send a SIP message for a health check, more power to you! But my manager nixxed that idea and we stuck with the current TCP version, which he feels is the simplest thing that could work.
I'm not sure what operations is actually doing. My manager mentioned that my component was failing the health check, yet when check it was fine (using netcat of course). Yet the logs were filled with errors (“recvfrom: Bad file number” and “poll: Invalid argument”), probably from all the failed attempts by operations to do a health check.
I did ask operations what is sent and how often. What they're sending is right, but they're asking “Areyoustillup?Whyhaven'tyouansweredme?Areyoup?Areyouup?McFly!McFly!Answerme!” before my component has a chance to even answer. I think they're a bit too aggressive. They don't.
Sigh.
[2] http://c2.com/cgi/wiki?DoTheSimplestThingThatCouldPossiblyWork