Craig's Blog

PowerOutage(Runbook)

I want to write about Standard Operating Procedures (SOPs) and Runbooks. These are critical components for every system, yet they are often overlooked or done poorly. Some people combine the two, while others skip them entirely. Throughout this post, I’ll convince you why they are important, and towards the end, I’ll provide a humorous example that you can use as a reference to create your own. This isn’t a comprehensive guide that explains all the nuances, but it should provide enough information to get you started and motivate you to look up more examples as needed.

An SOP is a detailed, written set of instructions that outlines the step-by-step processes for performing a specific task or activity. SOPs are designed to ensure consistency, quality, and compliance. They serve as reference guides, providing a standardized approach to carrying out operations on a system, which is particularly useful in high-stress situations, such as being paged in the middle of the night. On the other hand, a Runbook is a compilation of documented procedures and operations that are specific to a particular system at a specific point in time. Runbooks provide detailed instructions for executing routine tasks, handling tickets, or performing maintenance and troubleshooting activities. Runbooks commonly refer to multiple SOPs to resolve a specific problem.

The concept of SOPs can be traced back to the early 20th century when industrialization and the rise of mass production necessitated the standardization of processes to ensure consistent quality and efficiency. SOPs gained further prominence in the mid-20th century, particularly in industries like manufacturing, healthcare, and aviation, where adherence to strict protocols was crucial for safety and regulatory compliance. Runbooks, on the other hand, have their roots in the IT industry, emerging in the 1980s and 1990s as computer systems and networks became more complex. As organizations increasingly relied on technology, the need for documented procedures to manage and maintain these systems became apparent.

A well-crafted SOP should be clear, concise, and easy to follow. It should include a detailed step-by-step breakdown of the procedure, along with any necessary prerequisites, roles and responsibilities, and relevant safety or regulatory guidelines. Visuals, such as diagrams, flowcharts, or annotated screenshots, can be highly beneficial in clarifying complex steps. Additionally, SOPs should be regularly reviewed and updated to reflect any changes in processes, regulations, or industry best practices. On the other hand, SOPs should avoid ambiguity, excessive technical jargon, or assumptions about the reader’s prior knowledge. They should not include unnecessary background information or irrelevant details that could confuse or distract the user. A sanity check is that a project manager or a development manager should be able to follow the instructions without asking questions.

A well-designed Runbook should provide a comprehensive and organized set of instructions for managing specific IT systems or infrastructure. It should include detailed procedures for routine tasks, such as system backups, software updates, and configuration changes, as well as troubleshooting steps for common incidents or failures. Runbooks should incorporate relevant diagrams, screenshots, and links to supporting documentation or knowledge base articles. Additionally, Runbooks should reference applicable SOPs for executing specific tasks, ensuring consistency and adherence to established protocols. They should not include unnecessary background information or irrelevant details that could distract from the core procedures. A good starting point is that there should be a Runbook entry for every individual ticket. If you find that multiple tickets point to the same Runbook steps, consider combining those tickets or splitting up the Runbook steps.

Here is a fictional example of a Runbook/SOP combination. I live in Boulder, and for some reason (cough, Excel, cough), we keep losing power. The other day, I was not home when we lost power, and I got a call on what to do. I jokingly thought to myself - I need to put a Runbook on the fridge. Well, here is an abbreviated one. While contrived, it is functionally similar to many Runbooks and SOPs that we use for software systems.

Runbook:

  • Power Outage
  • First, confirm the power is out. ** Follow SOP “Verify Power is Out” to validate the outage.
  • Second, check if others’ power is out as well to determine the blast radius. ** Look outside and observe if your neighbors’ lights are on or off. ** Check local utility company’s website or social media for outage updates.
  • If it’s only your power out: ** Follow SOP “Reset Power by Resetting Breaker” to attempt a local fix.
  • If everyone’s power is out: ** Follow SOP “Fire Up Generator” to restore power temporarily.

SOP:

  • Verify Power is Out ** Check the lights in multiple circuits by flipping switches on and off. ** Go outside and look at the neighborhood. Are your neighbors’ lights on or off? ** If you have a gadget that runs on batteries (e.g., a flashlight or radio), check if it’s still working. ** Unplug and replug a device to see if it’s receiving power.
  • Reset Power by Resetting Breaker ** Locate the control panel (outside on the west side of the house). ** Identify the tripped breaker(s) – they’ll be in the “Off” position. Its okay if you cannot identify. ** Flip the tripped breaker(s) to the “Off” position, then back to the “On” position. When in doubt flip them all. ** Make sure to flip the power control breaker at the top of the panel. ** Wait a few minutes and check if power has been restored. ** If not, repeat the process or call an electrician for assistance.
  • Fire Up Generator ** Ensure there’s enough gas in the generator tank (refill if necessary). ** Locate the generator’s disconnect switch and flip it to the “Off” position. ** Press the start button on the generator and wait for it to warm up. ** Once the generator is running smoothly, flip the disconnect switch to the “On” position. ** Confirm that power has been restored to essential circuits. ** Remember to run the generator in a well-ventilated area and monitor fuel levels.

In today’s fast-paced and complex software environment, the importance of having well-documented Standard Operating Procedures (SOPs) and Runbooks cannot be overstated. These essential tools provide you and your team with a standardized approach to executing critical tasks, handling incidents, and maintaining operational efficiency. By following detailed, step-by-step instructions, you can minimize the risk of errors, ensure compliance with regulations, and maintain consistency across various processes. SOPs and Runbooks not only serve as valuable resources for training and knowledge transfer but also act as a safety net during high-stress situations, enabling teams to respond swiftly and effectively without having to reason through complex procedures. By investing in the development and maintenance of these documented procedures, you can pave the way for long-term success and save yourself from the pain of being woken up at 2 AM, trying to remember how that system you built 3 years ago works.

← Back to all posts