When Things Go South

02:29 reading time

“If you want to make God laugh, tell him about your plans.” Woody Allen

Happy Friday the 13th.

The single incontrovertible fact in anything is that things are going to go wrong. Oh, you can try and plan for it. You can develop resiliency, elasticity, scalability and all the other ty’s you want, but things are still going to go wrong. Hypervisors are going to fail, deployments are going to botch things even though CI said it wouldn’t, message queues are going to go haywire, and things are going to crash.

Brent Chapman gave a great talk at O'Reilly’s 2008 Velocity conference titled “Incident Command for IT: What We Can Learn from the Fire Department.” After looking at this presentation, our team started using ICS anytime things went south. At first, ICS can seem a little heavy weight especially in the heat of the moment. You might honestly think “Do we need to use ICS for this? Let’s just figure out why the message queue isn’t draining!” And sometimes, that’s the right thinking - not everything deserves a full blown incident commander and incident report. But when things go south - like nausea inducing, holy *&@^# south - that’s when ICS is valuable. We’ve found that ICS is just the way we tackle problems. Not all problems, just ones of the holy *&@^# kind.

Since 2008, we simply operated in this fashion not knowing if anyone else was doing what we were doing. In 2012, John Allspaw’s post on Demystifying Site Outages was a great data point. It’s long, it’s detailed but it’s a transparent record of what happened, why it happened and why it won’t happen again. And more recently I read how Heroku handles incident response which is another great data point and really ICS in action.

If you’re not doing ICS, try it. And if you don’t know where to start with ICS here’s a TL;DR:

  • Organize
  • Coordinate
  • Communicate
  • Manage by Objective
  • Provide Constant Updates (even if the update is that there is no update)

A real doozy of a day