The Worst Technical Mistake I Ever Made
- Lessons Learned, Episode 24
We've all made mistakes. It's part of learning. But some mistakes are bigger than others. Luckily for me, my biggest technical mistake did not cause a client to lose everything, or for me to lose a client.
There is a bit of "humble brag" here as well. But that doesn't diminish the absolute stupidity of what I did.
The scenario: The client had a server with a RAID-5 array and we got an alert that a drive had failed. The server worked "fine" with one failed drive, just a bit slower than normal. We did not have a spare because they were about $300 at the time, and the drives were not hot-swap.
We ordered a drive to be shipped overnight and developed a plan.
Our plan was pretty standard.
1. Verify that the backup is good. A full restore would take a long time and was not optimal, but we can't start a process that might lose data until we know we have a known good backup.
2. Take the backup out of rotation until we have two new post-replacement backups. Just to be super safe.
3. Power down the server.
4. Install the new drive and verify that an array rebuild is working.
What happened.
The backup was good. We informed the client of the process and then powered down the server. I removed the old drive.
When I went to install the new drive, it did not slide in easily. I wiggled it a bit and then gave it a good push.
That's when the world went into slow motion as I saw a small "something" fly by. It was metal. I removed the drive and discovered that I had sheared the top off of a tiny capacitor on the drive circuit board.
Freeze. Panic.
Options:
1) Client has at least one more day of living-on-the-edge with degraded performance.
2) Begin the restore from backup, which would also involve an immediate day of unplanned downtime.
3) Fix that drive so they can be up while we wait for the replacement.
In all cases, I just ate the cost of the drive.
Relax. Focus. Succeed. I took a walk around the block. That's when I realized that I'd soldered thousands of capacitors over the years. I can fix this.
I went to an electronics store and bought a replacement capacitor, a soldering iron, and some solder.
Of course, I had to come clean to the client. My contact (John, the operations manager) was a great guy, and fairly technical. He thought I might be insane, but agreed to assist.
John held the light as I un-soldered the remnants of the capacitor and then soldered the new one in place. We made sure everything on the circuit board was nice a flat and installed the drive.
The array started rebuilding immediately and I ordered another replacement.
The aftermath.
First, as I said, I have lots of electronics background and experience. And I did take that drive home and use it in my systems. But I don't have so much skill that I could say with certainty that I hadn't over-heated something with my non-robotic soldering.
I believed it was not right to install a "fixed" drive when the client bought a brand new one. So I ate that cost. (The fixed drive worked fine for three years and then got rotated out of existence.)
Second, as soon as the drive array was rebuilt, we took two good backups before we broke the array and replaced the new-but-fixed drive. That caused additional downtime. After the array rebuilt again, and we had a verified backup, I considered the incident closed.
Third, I wanted the billing to be fair. I charged for the one hour I had estimated. All other time was essentially rework or fixing my mistake. I charged the client for one drive, but not for two. I did write lengthy notes in the ticket explaining everything. I'm sure no one but John understood them.
We had a great reputation with the client, and this was a minor annoyance for them. Luckily, they had two isolated networks. One ran the part of the company that provided services to clients and brought in all the money. The one with downtime was only used by five people in the office. They had significant downtime, but no revenue was lost.
I consider this my worst technical mistake ever because it was 100% my fault. I did it. I could have avoided it. I felt completely helpless. I will never forget how I was filled with terror when I saw the top of that capacitor fly off the drive.
It was truly and objectively a stupid thing to do.
And it was completely avoidable. One of my absolutely unbreakable rules of service delivery is Slow Down, Get More Done. If I had taken just thirty more seconds, I would have avoided all of this drama, downtime, and expense. I had a rock solid plan. And my impatience broke the plan.
It's hard to use an incident like this to fix a process. But we did. Now, when we install equipment that slides in, we first verify that everything is in place and that the equipment slides easily with no parts sticking up. I don't think I've ever seen something like this again. But, as they say:
Once bitten, twice shy.
All comments welcome.
-----
Episode 24
This Episode is part of the ongoing Lessons Learned series. For all the information, and an index of Lessons Learned episodes, go to the Lessons Learned Page. https://blog.smallbizthoughts.com/p/lessons-learned-blog-series.html
Leave comments and questions below. And join me next week, right here.
Subscribe to the blog so you don't miss a thing.
:-)
No comments:
Post a Comment
Feedback Welcome
Please note, however, that spam will be deleted, as will abusive posts.
Disagreements welcome!