Friday, November 30, 2012

SOP Friday: Disaster Recovery - Simple Restores


What is a disaster? 

For a business, it's pretty much anything that causes a widespread stoppage of business. That might mean a server down, or Exchange down, or the company's commerce web site down. In this article we're going to look at a specific scenario.

Most small businesses only need to deal with three basic types of business interruption: 1) Utility outage (power outage, internet outage); 2) Server crash (some critical function on the primary server renders it unusable); or 3) building-wide disaster (fire, hazmat spill). A fourth, less likely scenario would be a region-wide disaster. This would include major floods, major hurricanes, major earthquakes, etc.

Please Note: When you create a disaster recovery procedures, it must exist as a set of written documents. Everyone in our business wants to put everything in an electronic format. Please don’t argue with me on this. This is disaster preparedness, fergoodnessake. Let’s say the emergency is a power outage. Or a failed hard disc. Or a fire. Or a flood. Or a hazmat spill that requires the neighborhood to be evacuated.

Get the picture? In an emergency of almost any kind, the kind that would require you to implement this plan, you won’t have access to the plan in electronic format! It is totally fine to have the plan exist primarily in electronic format, but there should be a printed version for the day the lights are out and the water is rising.


What is a Simple Restore?


There are many kinds of disaster recovery. In the most complicated and expensive, the goal is to completely recreate the failed network in every detail. This doesn't work well in small businesses. If your old system had three year old server and 50 machines ranging from Windows 7 to XP, you are very unlikely to recreate that environment exactly.

A second type of disaster recovery focuses on active directory. The goal is to bring back the server, with all the data and all of the network SIDs for machines and users. It does not require the exact hardware that existed before, but focuses on the most important pieces of the network, which could easily be on new hardware.

The third most common type of disaster recovery involves simply "getting your data back." That includes total server recovery in some cases. Two examples of this are very common: Recovering from a failed hard drive and restoring an entire server from tape or disc. These are simple restores.


Rule #1: Go Slow!

Disaster recoveries are always stressful. No matter how good your system is or how often you've tested it, there's always a chance that something went wrong with the last backup. Or the backup software won't load.

And then there's human error. You might over-write newer data with older data. You might not write-protect a restore medium, and it might get written to as soon as your mount it because the next backup job is waiting for it. Stuff happens.

Think. Think. Think. Go slow. Be careful. Be methodical. Know what you know. And once you completely master the technology in front of you and the resources available to you, your chances of success go way up.


Rule #2: Make a Plan

Please do not show up, slap in a new hard drive and then start to think "What should I do first?" You've already done first. Or maybe you did second or third and skipped first altogether.

A plan - a checklist - is critical to success because it guarantees that you will do everything you need to do, and in the order you need to do it. It also makes sure you don't skip the "small" steps that make life easier in the long run. For example, we like to label hard drives before we put them in the system. This allows us to note exactly where they were as we take them out. No matter how much switching and swapping takes place, we can get the system back to where it was when we showed up on the scene.

You also want to mark the old (bad) hard drive. We take a large permanent marker and make a big X across the top of the drive. Then we write BAD and the date it was removed from service.

If you are using a cloud-based D.R. system, how do you notify them that you need to execute the D.R.? And how do you actually bring that data down to an empty hard drive? What's the plan? What's the checklist?


Rule #3: Use a TSR Log
(and Document Everything You Do)

A Troubleshooting and Repair (TSR) log should be started as soon as you begin logging time on the service request. You should make a note with each important action step you take, and you should make a note at least once every 15 minutes. The TSR log will help you document everything you do and will help you troubleshoot in case something goes wrong.


These three rules will give you a high level of success - and confidence in your process.

Simple restores are just was stressful as any other disaster recovery. They are "simple" only in the fact that you're doing two basic things: 1) Fix the system, and 2) Reload the data. But those two things can be rather big and troublesome on their own, so "simply" doing them might not be simple at all.

The other thing that's nice about simple restores is that you can hand them off to lesser-qualified technicians.

Gulp.

That's right. Once you have the checklist down, you can have it executed by anyone. This is a bit scary until you realize how good it makes YOU. You have to be crystal clear about making sure they understand the dangers and the basic process. They need to verify that a backup job is not waiting for a medium. They need to label all the drives, including the new one and the old one. They need to be able to identify the correct media to restore from.

When you can guide them through the pitfalls and pleasures of data restore while they're on site and you are not, then you know you have a great checklist.


Rule #4: Improve Your Process

When you are finished with any "emergency" data recovery, you should take stock of what went wrong - or could have gone better. Did you know how to do a bare metal restore with this specific backup system? If a medium was bad, did you know the next best restore point, where it was located, and have it available?

Lots of people (myself included) preach that you do not have a backup until you test your restore. Similarly, you do not have a disaster recovery until you test your D.R. That means you have to learn each system you have in place well enough to actually execute a recovery. The time to learn this is before disaster strikes, not after.

So whatever needs improvement, make a note. Whatever you needed and didn't have on hand, make a note. All all these notes to the TSR log and then use them to update the checklist. What's the last item on the checklist? "Update this checklist."


Hurricanes, floods, and fires are real. So are hard drive crashes. You need to know what you need to know before you need it!


Comments welcome.

- - - - -



About this Series

SOP Friday - or Standard Operating System Friday - is a series dedicated to helping small computer consulting firms develop the right processes and procedures to create a successful and profitable consulting business.

Find out more about the series, and view the complete "table of contents" for SOP Friday at SmallBizThoughts.com.

- - - - -

Next week's topic: Hiring vs. Outsourcing

:-)

Still the best Quick-Start Guide to Managed Services: 


by Karl W. Palachuk 

Now only $39.95 at SMB Books!

Ebook or Paperback

Learn More!

No comments:

Post a Comment

Feedback Welcome

Please note, however, that spam will be deleted, as will abusive posts.

Disagreements welcome!