Friday, December 06, 2024

The Absolute Truths about Backups! - Lessons Learned

Lessons Learned Episode 8 - The Absolute Truths about Backups!

In previous installments, I mentioned some of the backup systems I'd worked with and managed. They varied from old "mainframe" systems to Novell, Windows, and Unix. The biggest and scariest were truly massive projects - executed every business day.


With the old HP 3000 system, we did a 57-tape reel-to-reel backup five nights a week. That was a complete backup. And the system was extremely stable . . . until a hard drive failed. I told most of that story in Episode two. 

The other monster backup was the massive corporation with 5,000 desktops spread across two buildings on a huge campus. Fiber ran between the buildings, and each building backed up the servers in the other building. I literally don't know how many tapes that took. I managed the team that did the backups. And it was a large array of DAT-72 tapes every night.

Both of those backups were designed, built, and funded well. I liken them to NASA projects. They simply cannot fail. Lives were not at stake, but the value of the data was unfathomable. And part of the zero-failure strategy was an acceptance of the First Absolute Truth about Backups: Something's going to fail. It's probably not the tape. It better not be the design. And that leaves the hardware, the software, and the people. Between the three of these, you can bet that the people fail a lot more than the hardware or software.

Interestingly, over the course of about four years, I don't recall ANY failed backups. With one true crisis (failed hard drive), the "current" backup was in progress during the failure and therefore was an incomplete backup. 

The Second Absolute Truth about Backups: You need multiple restore points. Which is related to a belief that will change your perspective for the better - Assume that the most recent backup is unusable. This might be because of hardware failure, the backup is incomplete, software failure, media failure, or human error. You don't know what it is. But if you assume the most recent backup will be bad when you design your backup system, you will work to make sure that they backup before that does the job. 

Murphy's law is in full force with backups. When you do lots and lots of backups at each client, and have lots and lots of clients, then "one in a million" things happen regularly. And when you're talking about millions of files, they quickly become hundreds of millions of files. STUFF happens. You might have an incomplete backup on the current backup, a hardware failure on the one before that, a software failure on the one before that, human error on the one before that, etc. STUFF happens. 

As I took on more clients, I noticed a trend over and over again. About half of them did not have a working backup - whether they knew it or not. Some had no backup and knew it. Some thought they had a backup, but the system didn't work. Many of these switched out the media (tape or hard drive) everyday, unaware that the backup either failed or never finished. And so . . .

The Third Absolute Truth about Backups: There is no such thing as a "set it and forget it" backup. And the best way to rephrase that is, "If you don't test your backups, you don't have a backup!" The worst current iteration is the modern BDR system. You're told to never think about it. And that's one of the main reasons people get into trouble. Nothing works perfectly forever. Entropy works every day.

I said that about half of the clients I met did not have a working backup. Over time, that estimate has grown stronger and stronger. Today, after seeing more than a thousand of backup systems, I believe this is a rule you can rely on. And the sad truth is, it absolutely doesn't have to be that way.

This leads to my strong belief that testing backups is the single most important thing you do in your business. Really. No matter what else happens, you can get a company back in business if you have a good, tested backup.

I used to receive the printed copy of the quarterly Disaster Recovery Journal. I don't know why I didn't keep the issue that came out after the September 11th, 2001 terrorist attacks. DRJ had stories about companies with good and faulty backup strategies. More than one of those companies was based in the Twin Towers and was back up in business with fully functioning computer operations the next day - but no employees. 

Even today, that kind of recovery is miraculous and expensive, especially with financial data subject to a mountain of compliance law. But it's doable. At the same time, small businesses today can be back in business the next day for a good amount of money - and within a week for a very small investment. They can, IF they have a well-tested backup.

You - the IT consultant - will always have a job for one simple reason . . .

The Fourth Absolute Truth about Backups: Everything gets old. Technology gets old fast. If you've been in business five years, you might have old backups on hard drives you no longer support - because you can't find them. At ten years, you've got clients whose backups include DVDs or even CDs. Fifteen years ago, you were replacing 250 GB ZIP Disc backups. And on and on it goes.

The folks who make the hardware, software, and media all want to claim that it's "archive" quality. Tapes are well over 99% reliable. But not if you write over them every few days for fifteen years or leave them in a storage unit unattended for ten years. Stuff happens.

Hard drives are awesome but shockingly fragile. Take your favorite brand of hard drives and Google the MTBF (mean time between faults). Gulp. Some of the most popular brands are greater than 3% or 4% per year. You spin a physical device long enough, something's going to break.

Everything gets old. Everything needs to be replaced. There's always a better, faster alternative. And none of them will ever be permanent or live forever.

-- -- --

Bottom Line. If you believe these four Absolute Truths, you will be able to design, build, sell, and maintain great backups. And if you don't, then you'll make one of the most common errors in our business. You'll try to fix a real problem by buying an off-the-shelf solution so you can set it and forget it. Instead of creating a system that works, you'll create one of those backup systems that falls into the wrong half of all backup systems.

The four Absolute Truths about Backups:

  1. Something's going to fail.
  2. You need multiple restore points.
  3. There is no such thing as a "set it and forget it" backup.
  4. Everything gets old. 

Association rules to live by:

  • The first medium you try to use will be bad/unusable
  • About half of all backup systems are failing in some way right now
  • Testing backups is the single most important thing you do in your business

-- -- --

Final Notes

A few more things I have come to learn about backups. I am a BIG believer in full (complete) backups every night. Remember: If a restore fails, it's probably the operator's fault. You can have a good backup and a bad restore. That's a big reason for doing monthly test restores: Your people know HOW to do it. 

When you have incremental backups, especially if they're different with every client, you add a layer of complication that can lead to long, expensive, unsuccessful restores. Even if the technician is trained to make the backup media read-only before they start, a failed restore means redoing the entire restore.

"Automated" incremental backups, which are almost universal in BDR and cloud backups, rely on trusting automated systems to never fail. Never is a long time. And every once in a while, there's a story of when this failed.

Again, the more complicated, the higher the likelihood of failure.

Finally, I have learned that almost all back backups start with technicians who are operating beyond their current level of knowledge. They are very technical and know that they can "figure it out." Sometimes, that on-the-job learning comes at a very high price. 

We'll talk more about that next time. Stay tuned for the next installment:  Technicians Who Refuse to Learn

-----

Small Biz Thoughts Technology Community Members: I have a 12-page white paper on how to create a great backup. Its' free inside the Community. Just log in. You will find lots of resources if you search for "backup." Or just follow this link to the backup white paper: https://www.smallbizthoughts.org/member-content/create-a-great-backup-system/

Non-members should consider joining today. https://www.smallbizthoughts.org/join/

:-)


No comments:

Post a Comment

Feedback Welcome

Please note, however, that spam will be deleted, as will abusive posts.

Disagreements welcome!