Security Disciplines for Objective 3: Detection and Recovery
3-4. Disaster Recovery and Business Continuity
Description
A disaster is any event that can cause a significant disruption in operational or computer processing capabilities for a period of time. Disasters can include the loss of a critical file, the rapid spread of a virus, a denial-of-service attack, the loss of a network segment or critical link, or loss of an entire facility or personnel from a fire or bomb. Although the probability of a major disaster is remote, the consequences of an occurrence could be catastrophic, both in terms of operational impact and public image. Disasters have an uncanny habit of occurring at the most inconvenient times, damaging equipment and materials one can least afford to lose.
Disaster recovery focuses on handling the immediate emergency, whereas business continuity takes effect after a disaster and focuses on getting the critical business functions operational and eventually restored to full capabilities. Together, they cover what to do, beginning with the emergency response; continuing through crisis management, prioritized business operations recovery, and detailed recovery; and ending with full business restoration. Knowing what needs to be done before, during, and after a disaster can prevent panic, reduce the extent of the damage, and help in a coordinated recovery effort.
Purpose
The purposes of disaster recovery and business continuity plans are to prevent serious impact, to avoid disruption of services, and to coordinate the recovery tasks so that normal business operations may resume as quickly as possible. Plans are different from one organization to another because risks vary widely, as do the organizational priorities and goals. There is also a wide range of alternatives available in both method and technology. These alternatives vary in rigor (i.e., the security assurance level or the degree of protection that they provide) and cost. In general, rigor and cost are directly proportional—the more rigorous a method, the more it costs. The information system owner should look to methods that provide as high a level of assurance as possible within cost constraints.
Principles
- The amount of time and effort put into a plan should reflect the value of the information or service provided by the organization and the amount of effort required if the system had to be rebuilt from scratch. It is normally much more cost-effective to prevent or minimize damage than to repair it after the fact.
- The disaster recovery plan should address procedures such as employee safety, emergency services notifications, family and employee notifications, operational communications, identification of key personnel, emergency authorizations, power and hardware recovery, media backup and recovery, and maintaining event logs.
- The business continuity portion should address procedures such as manpower recovery, alternative business processing methods, administration and operations, budget for replacements and/or insurance, customer service, identification of key vendors, office supplies, public affairs, and premise recovery. Nontechnical management should own and control the business continuity plan in order to ensure proper funding.
- The plan should be practiced and tested. A failed test of the plan still provides valuable information about the organization and where changes should be made. It is also an invaluable tool to train personnel on how they should react in an emergency.
- No matter how good a plan is when first finished, it will almost immediately become out of date. Constant review and update is required to keep the plan pertinent and useful.
Policies
Once an organization decides on an approach for disaster recovery and business continuity, the policies for that approach should be documented. The guideline ensures the consistent and comprehensive application of disaster recovery throughout the information enterprise. The guideline should identify scope, methods, standards, and organizational and individual responsibilities. The reader may refer to the following documents for examples of disaster recovery and business continuity policy statements:
- Massachusetts Institute of Technology, Emergency Response System, http://mit.edu; search on “emergency response system.”
- Massachusetts Institute of Technology Business Continuity Plan, http://web.mit.edu/security/www/pubplan.htm.
Best Practices
Disaster Recovery Team—A team needs to be assembled that will respond in the event of a disaster. This team should include a member of management, members of the technology unit that will perform the assessment and recovery, representatives from facilities, and members from the information user community to determine what level of recovery is needed and to verify when recovery is complete. The team takes an active part in developing the plan and carrying it out in the event of a disaster.
Threat/Risk Assessment—A threat is anything that can adversely affect the operation of an organization; i.e., fire, natural disaster, virus, bomb, and strike. The threat assessment is the process of formally identifying the nature of the threats and degree of damage each can do to an organization. This includes damage to all assets, including, but not limited to, personnel, facilities, computer systems, and reputation.
The risk assessment takes the threats identified for the organization, assesses the adequacy of the controls in place, determines the expected loss for each threat, and then establishes the degree of acceptability to system operations. It will also recommend changes to controls to improve the current security protection. Steps include the following:
- Assess the current computing and communications environment, including personnel practices, physical security, operating procedures, backup plans, systems development and maintenance, database security, data and voice communications security, systems security and access control, application controls, security administration, insurance, and personal computers. Inventory all equipment, and make a list of the vendors.
- Define all critical information needed to operate. Retention schedules, federal mandate, state law, or business needs will define this subset of data. Note the location of all critical information. Depending on the criticality of the information, either backups or safe storage containers should be considered. Store backups of critical information off-site.
- Define critical personnel, equipment, facilities, and single points of failure. Try for redundancy, or make arrangements to quickly replace these assets. Potential sources of failure include network, hardware, software, malicious attack, physical damage to the facility, and loss of personnel.
- Assess the insurance needs of the organization and the budget required to purchase replacements.
- Assess any dependencies on critical partners. Utilities, vendors, customers, and building partners are examples.
Business Impact Analysis (BIA)—Complete a BIA to identify the critical processes and functions of the organization.
- Set priorities for restoration based on the overall impact by looking at the interdependencies of the departments within the organization.
- Determine maximum acceptable losses, and define the window of time available to resume operations. The analysis will then define the restoration timeline and the possible need to use alternate facilities in different scenarios.
- List resources required to restore those critical functions identified in the BIA. This should include the hardware, software, documentation, facilities, personnel, and outside support needed for recovery. Different strategies could be formed for short-term, intermediate-term, and long-term outages.
Mitigation of Risks—Mitigate risks identified in the risk assessment by implementing new procedures and providing redundancy wherever possible. This includes cross-training personnel on other job duties as well as making plans for extra hardware and backup software.
Store electronic media in protective jackets or media boxes. Consider purchasing data safes (fire-resistant safes, specially designed to protect magnetic media from damage caused by magnetism, fire, heat, water, and airborne contaminants such as smoke and dust). A water vacuum or roll of plastic can be extremely useful with a water leak or malfunctioning sprinkler system.
Power is critical to computing environments. It is common to provide protection of computing equipment through UPS systems, connection to two different power grids, and the use of diesel generators.
Hardware Redundancy—The following techniques are used to provide hardware redundancy:
- Disk Mirroring—Disk mirroring is the duplication of data from one hard disk to another. Mirrored drives operate in tandem, constantly storing and updating the same files on each hard disk. Should one disk fail, the file server issues an alert and continues operating. Should the controller fail, access to either disk may be denied.
- Disk Duplexing—This is similar to disk mirroring except each drive has its own controller circuitry. Should one disk or controller fail, the file server issues an alert and continues operating.
- Disk Arrays—These enable the administrator to replace a failed drive while the server is still running, and users can continue operating. The system automatically copies redundant data on the file server to the new disk.
- Hot Backup—Two file servers operate in tandem, and data is duplicated on the hard disks of the two servers. This is like disk mirroring but is across two servers instead of one. If one server fails, the other automatically assumes all operations without any outage.
- Cold Site—A cold site is an emergency facility containing a heating, ventilating, and air conditioning (HVAC) system and cabling, but not computers. When outsourcing, evaluate providers on high availability and disaster tolerance. Such arrangements may be informal (as a reciprocal agreement) or formal (a separate recovery site or a contract with a third-party provider). Cold sites are generally cheaper than hot sites. They should be a reasonable distance away from the main facility to prevent the same disaster from destroying their capabilities as well as the primary facility. Also, they should not be overextended in the number of organizations for which they provide this service. In a massive disaster, all of the organizations will want the facility at the same time.
- Hot Site—A hot site is an off-site facility contracted to have compatible systems ready to restore an organization’s backups and run them as if in their own facility. Hot sites contain computers, backup data, and communication equipment. Written agreements should be signed if contracting with another unit for alternate processing of critical systems in the event of a disaster. Again, they should be a reasonable distance away from the main facility to prevent the same disaster from destroying their capabilities as well as those of the primary facility. They should not be overextended in the number of organizations for which they provide this service.
Software Redundancy—There are several different types of data backups. Determine the level and frequency of backups (e.g., daily incremental backups with weekly full backups). Consideration should be given to using more than one technique to better ensure the information gets backed up promptly.
- Full Backups—All files on a hard disk should be copied to a tape or other storage medium. These are used for total system recovery and are often done once a week.
- Differential Backups—These are done only for the files that have been changed or added since the last full backup. Earlier versions of these files will be replaced in differential backups and are often done nightly.
- Incremental Backups—These are completed only for the files that have changed or been added to a system since the last backup and are often done whenever work is finished on the computer. These backups use less storage space and are faster to run. They are generally used to aid in the recovery of old versions of files and the restoration of file integrity when files become corrupted.
- Off-Site Storage—At least two copies of server backups should be made. One copy is kept on-site to restore files. The second backup should be stored off-site, or an electronic tape vaulting service should be used. A mutual agreement should be signed with the off-site facility to ensure that it provides the security needed to protect the information at the same level as that provided by the primary facility. Fire protection, air conditioning, heating, moisture control, availability, and other security factors should be considered. Regularly scheduled delivery of the backup media will help ensure the backups are available when needed. Backup and recovery functions should be limited to the administrator and alternate.
Plan Development—Procedures should be documented for various types of disasters, such as fire, flood, extended power outages, bomb threats, chemical spills, and loss of personnel. This phase also includes the implementation of changes to current procedures to help prevent disasters and to support recovery strategies and vendor negotiations with recovery services or off-site storage. Individual responsibilities for members of the Disaster Recovery Team should be defined, and recovery standards are also developed at this stage.
The first priority should always be the safety of personnel. Escape routes and evacuation procedures should be documented and made clear to all personnel, and the availability of adequate medical and first-aid supplies should be ensured.
Testing the Plan—Practice and test the plan. Set up a mock disaster, and work through the plan to discover its weaknesses and make necessary changes. Routinely perform restorations from the various kinds of backups (full, incremental, or differential) to ensure they will work when needed. Plans tested less than once a year will probably not support critical business requirements.
Plan Maintenance—Regularly review the plan once it is complete. The information within the plan constantly changes. Critical functions, telephone numbers, and job duties change. Even organizational priorities and goals may change.
References
- National Technical Information Service, Gaithersburg, MD: U.S._Department of State, Washington, DC, Domestic Disaster
Recovery Plan for PCs, OIS, and Small VS Systems, see http://www.ntis.gov/search/
product.asp?ABBR=PB90265240&starDB=GRAHIST. - Federal Emergency Management Agency. Emergency Management Guide for Business and Industry: A Step-by-Step Approach to Emergency Planning, Response, and Recovery for Companies of All Sizes. Washington, DC: FEMA, 1993. Order from: FEMA Publication Distribution_Center, Post Office Box 2012, Jessup, MD 20794. Telephone: (800) 480-2520.
-
National Archives and Records Administration, Office of Records Administration. Vital Records and Records Disaster Mitigation and Recovery. College Park, MD: NARA, 1996. Available from: Publications and Distribution Staff (NECD) RM. G-9, National Archives, Washington,_DC.


