Failure is not an option


Mission critical technology projects are situations where failure and unplanned outage of the project is likely to result in severe loss of revenue, reputation, customers, or potentially even lives in certain instances. Even a brief unplanned outage of mission critical systems can be devastating. Failure can be measured in millions of dollars of unrecoverable loss per minute, in certain situations and markets. Even in small to mid-size companies, an outage can cost over $40,000/hr in lost productivity, sales, reputation, and customer confidence. These concepts can provide significant benefit to companies of all sizes.

How can we prevent failures? To answer that, consider the following list of failure causes.

  • Hardware failure – Failure of power supplies, disk, memory, CPU, support infrastructure, or entire systems
  • Software failure – Bugs in software, operating system issues, hardware incompatibilities, configuration problems
  • Network failure – Equipment failure, routing problems, wiring problems, firewall configuration
  • Facilities failure – Failure or insufficient air conditioning, heating, electrical
  • Data center failure – Weather, disaster, flood, severed connectivity
  • Supply problems – Lack of available equipment to repair or replace failed items quickly enough, insufficient fuel for backup power generators
  • Human factors – Misconfiguration, mistakes, errors, and sabotage

Redundancy is key to maintaining hardware availability. We want to avoid single points of failure, to assure failure of one part of an infrastructure does not result in a failure of the rest of the infrastructure. This includes entire data centers, network connections, disks, hardware, applications and support infrastructure. The more critical the infrastructure, the more redundancy and expense is required to maintain it availability at an acceptable level.

Beyond hardware redundancy, software resilience is also required. This includes security, monitoring, testing, verification, bug fixes, and load balancing. Being able to run software in a way that concurrently handles your requirements helps, but getting to that point requires careful planning and implementation. Done incorrectly, the very redundancy that should be increasing availability will result in outages. An example of this is when a web application is load balanced, but the session information is local to each individual server. A failure of a server loses currently open sessions in an unrecoverable way.

Outages due to unintentional human factors can be reduced in a few ways. First, change management procedures should enable stakeholders to review proposed modifications and provide input to the process. If this step is skipped, a change to one part of the company can adversely affect another portion without proper notification. Redundancy can also be used in a few ways here. A development/test environment that enables testing changes without affecting production service provides the capability to discover unforeseen failure conditions. Redundant testing environments reduce production outages significantly, and can be implemented by nearly any size company for a relatively low cost.

Redundant production environments are the next step. These enable scaled deployments, where updates are deployed and tested to part of the environment to validate they work properly before changing the entire environment. Production deployments should always have a rollback plan that restores the environment to its known functional state. This might be as simple as backing up the files to be changed, or provisioning systems and scripts that deploy changes. Revision control systems are often useful in tracking, testing, and promoting changes from development through production, with the ability to rollback to a known good deployment in case of problems.

To prevent sabotage, security is critical. Good security procedures take a layered approach, where compromise of one layer does not compromise the overall system. Each layer should take a “least privilege” philosophy, where individuals only have access to as much as they require to do their job. Security begins with physical security. Are the systems, and supporting infrastructure, secured from physical access by those who should not have it? Are your data center physical security procedures effective? I’ve seen instances where data center doors were locked, but the doors were installed with hinges exposed to the unsecured side, enabling anyone with a screwdriver to easily gain access to this supposedly secure facility. Other facilities have had biometric locks with vault doors, only to have easy access into the “secure” facility through a raised floor.

Security doesn’t stop at physical security. Networks, systems, and software should be installed with least privilege as well. For especially critical data, requiring a process where multiple approvers review the access request before data is unlocked will limit the possibility of an individual utilizing data inappropriately. The approval process is another check in the process, moving an individual action to one that requires collusion.

Having procedures to stop denial of service attacks is another aspect to effective security and maintaining service availability for legitimate users. This may consist of utilizing firewall capabilities, request to a network provider to stop the rogue traffic upstream, updates to software, or infrastructure changes.

Facilities also need to be sufficient to support the infrastructure. The data center needs to have sufficient power and cooling, including backup power and cooling. Common failure points are uninterpretable power supplies with batteries that are insufficient or too old to handle the load, generators that cannot provide sufficient power, equipment that is not attached to circuits with backup power, and HVAC systems that do not provide enough cooling, especially when a unit fails.

Maintaining data in multiple geographical data center locations can result in latency issues, where one facility has newer data than another location. Depending on the locations and distances involved, speed of light issues may provide latencies that exceed acceptable limits. There are ways to assure data consistency, but they trade performance. Redundant data centers should also avoid similar geographical risks between locations, so that a disaster in one area does not take down all facilities.

These are just a few examples of issues faced while developing resilient, highly available infrastructures. This document is not at all comprehensive. It is important to have a trustworthy, competent consultant with experience at multiple sites to help improve your mission critical information architecture. I provide consulting service to provide comprehensive recommendations specific to your infrastructure and availability needs.

This article was written by Doug Spencer, a technical and business consultant who helps companies utilize technology to improve business operations. Doug’s experience spans many industries, company sizes, and technologies. A public example of Doug’s results is in his suggesting and implementing infrastructure changes that enabled the United States Postal Service to approximately double their online sales annually, while sharply reducing recurring operating costs, and improving availability of online services. Revenue to USPS from their online offerings now exceeds $650M annually, with peak days exceeding $4.7M. Doug has provided similar successes for many private sector companies, and works with companies of all sizes.

Doug helps companies to realize their potential by utilizing his experience to improve revenue and save costs. Doug can be found on LinkedIN at http://www.linkedin.com/in/dougspencer