Friday, February 24, 2012

SOP Friday: Server Down Procedures

There's a fun meme going around Facebook about "What I Really Do." That set of juxtapositions is humorous because it's got some grain of truth to it.


Taking care of servers is a little bit like that. What the customer thinks you do, and what other technicians think you do may not be at all what you really do.

Too often (Far FAR too often) techs challenge me during presentations. "Well what happens when the server goes down? You have to work evenings and weekends. You have to stop everything and work on that. You can't schedule your time. You can't plan these things. You're stuck until the server's up."

The day I realized that NONE of that is true, my business became suddenly more successful.


Mini Rant on Server Downtime

First, Servers don't fail.

If your business consists of putting out fires and doing break/fix work on servers YOU sold, built, and maintain, then you're in the wrong business. Either you're selling the wrong server, putting it together wrong, or maintaining it wrong.

Servers don't fail. Servers run and run and run.

What's the wrong server to sell? One that's already out of date, technologically. One that that's underpowered when it's new. One that's a cheap piece of crap you've upgraded to "server level." One that you've packed with cheap parts and third party add-ons instead of manufacturer approved parts. One you built yourself with individual parts.

We sell HP business class servers. The ML 350 is the workhorse. On average, these servers run three to five years with ZERO issues of any kind.

Maintenance is a no-brainer today. Use Continuum (Zenith RMM), or Level Platforms, or something. Get a tool that monitors everything you do, reports issues, and allows you to patch the system remotely for every single client while you sleep. Do regularly scheduled monthly maintenance. Test your backups. Tune up machines. Love them and they will treat you well.

Many people don't believe me. Mike didn't believe me when he came to work for me. "How is it possible that you never have a server failure?" Now he sells them, builds them, maintains them the right way. And they never fail. Period.

Second, You never have to work evenings and weekends unless YOU choose to.

We have a simple policy: Work during the hours 5:00 PM and 8:00 AM is not covered by managed service and is twice our regular service rate. All weekend and holiday work is also twice the price.

So . . . fixing a critical issue during business hours is covered by managed services. At 5:01 PM we go to $300 per hour and we burn at that rate until the system is fixed or the client sends us home. In almost every case, the client sends us home.

If the client chooses to have us work all night at $300/hr, we are happy to do that.

Third, Even critical labor is scheduled and planned.

When your heater goes out during the first freeze of the year, or your air conditioner goes out the first time it hits 100 degrees outside, you have a crisis. And you call the repair place. And what do they say? "We'll get there as soon as we can."

Servers are the same way. You want to calmly come to a stopping point with other projects. You want to put everything in a nice orderly state so you can go work on the server. It might take you an hour to calmly put things in order so you can go address the critical issue and give it your full attention.

Panic serves no one. It doesn't server the client you're leaving behind and it doesn't serve the client you're going to.

TALK to your client. Don't assume downtime is the end of the world. Let them know that you will be there and give them an idea of when. You can only do what you can do.

Important safety tip for life: You CAN prepare for emergencies. You CAN have a standard operating procedure for when servers go down. You don't have to panic. You don't have to act like this has never happened before. You don't have to make bad decisions because you're in a hurry.

You can have a rational, calm, profitable response to an emergency.


Okay. Having said all that, there are weird instances when something goes wrong on a server.

But it's the exception to the norm. You can't build your business around something that almost never happens. Build your business around the standard processes that happen every single day. Make that profitable.

Then build a process around "emergencies" that is also profitable.


Define a Priority One Incident

I have mentioned the priority system and how you set priorities on several occasions. See Service Ticket Updates and Setting Job Priorities for example.

Priority One issues are never set by a human being. They set themselves. A fire, a flood, a failed hard drive, a motherboard failure. That sort of thing.

I hope you're saying, "Hard drive is a bad example. We have redundancy. No single hard drive failure can bring down a business." Yes. That's true if you sold the right server, built the server, and maintain it properly. If you inherited the server from Cousin Larry's Pretty Good I.T. Shop, then your server is down.

Anyway, Priority One means Critical. A P1 sets itself. That means
• Server down
• Network down
• Email System down
• Server based Line of Business application down
• Fire, flood, earthquake, hazmat spill, etc.

A server down situation is considered any outage of a server or major LOB services such as Exchange or SQL that is not planned. At times, especially over the weekend, a server might reboot due to patch management. As long as the RMM (remote monitoring and management) tool reports that the server is back up in a reasonable amount of time, you may safely ignore the issue during off hours. You'll still want to look and verify that it was "normal" the next business day.


Server Down During Normal Business Hours

The normal process for working service requests is from highest priority to lowest priority, and from oldest to newest. So all P1s are more important than all P2s, and older P1s are more important than newer P1s.

You should almost never have a P1 in your system. So when it happens, it needs serious attention.

In the normal course of your day, you'll be working on something else when a P1 comes in. Whoever manages the service board needs to acknowledge the client in a timely manner. If it is after 3:00 PM, be sure to let the client know that all work up until 5PM is covered and all work after that is at the after-hours rate.

The service coordinator (I know these all might be the same person) needs to decide who should work the ticket. In most SMB shops this is going to be the owner/tech or the lead tech. That person is doing something else right now. So you need to coordinate having the current job come to an orderly stop or have someone else take over.

You can't leave one paying client to go to another without taking care of the first client. They'll understand that someone has a server down. But you still need to leave their business in an orderly state.

Note: It is critical that everyone on the team constantly check to make sure 1) Tickets are in the right Queue (or service board); 2) Tickets are assigned the correct Priority Level; 3) Tickets have the right service agreement attached to them; and 4) Work type and sub-type are correct (e.g., maintenance vs. add/move/change). If you all check these things constantly for every ticket, then managing workflow around a critical issue will be easier.

If you can connect remotely and work on the issue, you should do this. Remote work provides a faster response and may allow you to solve the issue without a trip. If you have an ILO (integrated lights out, or equivalent) card installed and activated, you can get to the console level on the server even if the operating system won't load. That allows you to run hardware level diagnostics and updates as well as access the operating system in active directory restore mode. This gets back to selling the right server.

Once you begin working on the P1 ticket, nothing takes higher priority. The only thing that might be more important is an older P1.

Once a P1 is in progress, the Status Update becomes critical. See Service Ticket Statuses to Use and When to Use Them. Your client might never read their monthly reports or invoices. But after a server down situation, they just might want to discuss response time and performance.

So, the ticket moves from New to Acknowledged. Then to Assigned, and probably to Work in Progress (skipping Schedule This and Scheduled).

Once the work is in progress, it will stay there until one of three things happens. First, you fix the problem. This includes a temporary fix that puts the client back in business. You'll close out the P1 and create a P3 ticket to order parts, update software, etc. That properly puts and end to the crisis and moves work to a schedule status.

Second, if the client is not willing to pay for after hours work, then a ticket will move to "Scheduled" for 8:00 AM the next business day. Yes this does happen.

Third, in the process of fixing the issue, you need to wait for a third party vendor (for software, hardware, network, etc.), then the status goes to Waiting Materials, Waiting Results, or Waiting Vendor.

There are other weird things that happen, but you don't have to have SOPs for things that you can't foresee. So the client might decide in the middle of all this to buy a new server, put the project on hold until he has more money, or something else. Again, weird stuff you don't expect.


Server Down During Off Hours Procedure

The procedure for after-hours support has mostly to do with processing the alert so you can execute essentially the same response you would have during business hours.

First, determine what constitutes business hours. For us it's Monday through Friday 8am to 5pm. And we all know that there are "golden hours" of 7-8 AM and 5-6 PM when we might do a little work at standard rates. But that normally happens for scheduled work, not for emergency response.

We assume that you have some kind of system to alert you when a server goes down. Your pager goes off. You get an email. A technician in India calls your cell phone. Something.

Note: If a server goes down outside of working hours, your company will know about it before the client does. If you have more than one technician, your techs need to really use your PSA system (e.g., Autotask or ConnectWise) to communicate with one another. That means accurate status updates and notes.

Second, Tech notes are always critical. Keep track of every single thing you tried, the order in which you did things, what you observed, and the time for everything. If you need to call escalated support (Microsoft, HP, Third Tier, Dove Help Desk, etc.), the more information you have the better.

If for any reason you do not have access to the PSA system, you must take very good notes for when you do have access. You are expected to update the PSA as soon as possible. For now, we assume you do have access to the PSA.

Third, monitor the system and determine whether it's P1. Is the server just rebooting? How do you know? Create the service ticket. You can create a P1 even if it later gets changed to P3. But if you haven't scheduled a reboot, then it's a legitimate P1.

Fourth, Perform the client call down process in the event of a server down.

You should have a note in your PSA about the client call down. Every one of your clients should fill out a simple form that says who should be called first, second, and third. You should have first, second, and (if you can) third phone numbers for each contact.

If you have an alarm system, this is very similar to the form they use. If you do not have access to the required contact phone numbers for whatever reason, you must contact anyone inside your company who can get them for you.

Call Down Procedure:

1) Go to the Company page in your PSA system and look under the section "After Hours Contacts." Start by calling the primary contact and then work your way down the list.

2) If you reach a voice mail box (including the companies general mailbox) the recommended script is as follows:

“Hello this is [insert technician name here] calling from [your company]. We have received a page alerting us that the server [insert server name here] is currently unreachable. This is not a planned outage and we have created a priority one service request. This call is just to inform you of the current status. We will call back with further updates. Our normal hours of operation are Monday through Friday, 8am to 5pm and we will begin working on the issue during those hours.”

3) If you reach a person, convey all of the same information above. In addition, you must ask them if they would like us to begin working on the issue outside of normal business hours. Script:

“As the primary contact, do you wish to authorize off-hours work to be performed for this issue? Off-hours work is not covered under managed services and would be billed out at double rate.” (typically $300 / hour)

IF YES:
Inform the contact that the service request has been assigned a Priority 1, a technician will be assigned, and they will begin working on the issue as soon as possible. That technician will be contacting you and might require access to the site. If so please be prepared to have someone meet our tech at the location.

IF NO:
Inform the contact that the service request has been assigned a Priority 1 and we will begin working on the issue as soon as our office opens. Answer any questions and then conclude the call by reaffirming that we will call again if the status changes.

4) If you can only reach the secondary contact, be aware that the secondary contact may NOT be able to authorize off-hours billable work. Anyone Authorized to incur after-hours labor expenses should be noted in the PSA system with an (A) next to their name.


Misc. Notes

We'll have another post on managing your after-hours technician/service coordinator. In many cases, the person hired to catch calls after hours is NOT authorized to begin executing the work. In particular with a server, the person who catches the alert might not be qualified to work the ticket.

Make sure that part of your process for the after-hours tech is to know whether or not they are authorized to begin working a ticket. If not, then you need a call-down for your own technicians to find someone who is authorized.

Any service request that requires third party vendors who are not available should be set to the status of "Waiting on Vendor." Make sure your notes are up to date! In a perfect world, the vendor will arrange a time to start working on the system. Ha ha. I know. Anyway, you might not be the one working the issue when the vendor calls, so perfect notes are required.


Implementation

There are several pieces to this that you should already have in place.

You need a definition of Priorities. See previous article on this.

You need a definition of Statuses. See previous article on this.

You need a policy about hourly rates, when they're applied, etc.

You need to create a call-down form and have each of your clients fill it out. This could be an online form. This information should be stored in your PSA system and available to the after-hours tech/service coordinator.

You absolutely CAN prepare for emergencies. Most of the preparation consists of having processes in place before the crisis hits. Your team needs to know how to massage the service board, how to set priorities and adjust work types, etc. Once everyone on your team gets in the habit of touching all these bases on a regular basis, then a P1 is just another process to execute.


Your Comments Welcome.

- - - - -

About this Series

SOP Friday - or Standard Operating System Friday - is a series dedicated to helping small computer consulting firms develop the right processes and procedures to create a successful and profitable consulting business.

Find out more about the series, and view the complete "table of contents" for SOP Friday at http://www.smallbizthoughts.com/events/SOPFriday.html.

- - - - -

Next week's topic: Schedules and Timelines for Running Your Company


:-)

Check Out the Managed Services Operations Manual

Four Volume Set
The Managed Services Operations Manual

by Karl W. Palachuk

Over 1,100 pages - plus lots of juicy downloads


Paperbacks - Ebooks - Audio Books

Standard operating procedures, policies, and practical advice for IT consulting companies of all sizes.

From the author of Managed Services in a Month.

Learn More!

2 comments:

  1. Anonymous5:27 AM

    Karl - what a fantastic blog post. Thanks for sharing!

    ReplyDelete
  2. Anonymous8:39 AM

    Excellent Post Karl, i really do enjoying your blogs.. very inspiring.
    I have worked for several Small Buisness IT service companys who will work at all sorts of hours for next to nothing, it's no wonder they dont grow in size.
    I feel you devalue yourself and your company if you do a "drop everything and work like mad untill the client is back and running" job You would'nt get a big ICT services company doing that (unless you were paying a good rate.

    ReplyDelete

Feedback Welcome

Please note, however, that spam will be deleted, as will abusive posts.

Disagreements welcome!