Here’s the jar to run in prod and here’s the pdf on troubleshooting it.

This still happens in 2017. It may make sense if you are shipping software to run outside your organization. But in an online/SaaS environment, this falls apart after a couple of months.

Here are some of our team’s guidelines, whenever we bring new components online.

Deployed with Chef, no exceptions

Even if the chef cookbook is just a collection of execute directives. This adds the new server into our “on-line” inventory and is picked up by the monitoring system. It gets the correct security profiles and other org-wide base configurations.

Any configuration management system - Puppet, Ansible - is fine too. I am biased towards Chef-Server due to my previous experiences and it’s very good search capabilities - knife search "role:worker AND colo:en1"

Process has to be supervised

The process running the application (java/node.js) has to be supervised with daemontools or runit or systemD. Everything crashes. The idea is to not debate code quality here. Having a supervisor reduces the impact of a process crash. And no manual intervention to bring it back up in the middle of the night. Other parts of the ops-stack will catch the crash and log it for later analysis.

/alive

It it’s a web-service, it should have an endpoint which can be used to check if it’s ‘up’. I could use top and check the logs, but this is faster and can be done externally.

Connected with Self-Healer aka Boss

My colleague (Pierre Belanger) maintains the awesome boss utility. It runs as cron and checks the state of a process and attempts to heal it if needed. The healing is mostly a brutal restart, but it’s very effective.

For example, if your ruby process is running, but not listening on it’s port, it will be restarted. If it doesn’t respond to heartbeat in a given amount of time, it will be restarted.

The boss utility is smart. It won’t blindly start up things which have been intentionally stopped or mess up sequences during deploys. It sits between our monitoring system and the process-to-be-monitored. The boss will try to heal a bad process, and if it can’t the monitoring system takes over and gets a human.

A bundled health check script

The script is tightly coupled with the application logic. Say you had a web-service which returned md5sum for a string. The script will do this all the time.

The application developers own it, but the OpsTeam helps authoring the first few versions.

Should the monitoring system do the health check? I don’t like that approach. If the application logic changes, you need to make sure the release cycle is in-sync with the monitoring system. By bundling the checker with the runnable code itself, the release cycle dependency is gone. Also keeping the health check and healer closer to the process within the box makes monitoring more effective.

But containers?

(Docker) containers with schedulers (Mesos/Swarm) can take care of some of these problems. They provide a different level of abstraction. For example instead of a process being restarted by cron, the Docker daemon would spin up a new container for a dead/non-responsive/un-healthy container.

Conclusion

Runbooks in the SaaS world need to evolve to keep up with complexity and quick release cycles. A (markdown) doc should focus on the ‘why’, and the ‘how’ should be handled with code.