Friday, June 22, 2012

SOP Friday: Troubleshooting Guidelines

IT Troubleshooting: Eight Rules and Thirteen Techniques for Success

Troubleshooting is a very interesting piece of our jobs. It is almost never taught. Yet it is central to our success. After all, if you have no troubleshooting skills, you'll spend lots of time NOT solving problems. Eventually, you'll go do something else for a living. Conversely, the better you are at troubleshooting, the faster you solve problems and the more money you make.

So . . . you guessed it . . . you should have some standard training on troubleshooting. There are three keys to success with troubleshooting. Two of them can be taught. First, there are rules or principles of operation. If you can follow the rules, they will help you make good decisions. Second, there are specific techniques to follow. We use some techniques for one occasion and other techniques for other occasions.

The third element can't be taught: Experience. Experience is the magic ingredient. How do you know which rules apply? How do you know which technique to use? In fact, how do you know exactly what to do without troubleshooting at all? Because you've seen it before. You've done the task many times. You've solved similar problems. And if you're good at applying rules and techniques, then experience just makes you faster and better.

This article focuses on the skills you can teach. These can be imparted to your technicians. And many of the rules are designed specifically to save your company money. The most obvious example of this is the rule to call for help after 60 minutes with no progress.

Note: You should REALLY read the article on the Trouble Shooting and Repair Log from last July. We will use this and refer to it as the TSR Log.

A training on troubleshooting is useful so that everyone on your team takes a similar approach, and because they can learn from each other. The only way anyone gets the experience that will make them great at troubleshooting is to spend time troubleshooting!

Sometimes you go into a troubleshooting situation from the start. For example, some thing is not working and you are going to find out why. At other times an issue becomes a troubleshooting situation. In other words, you went in to do one thing and ended up finding a bigger problem than you thought.

The only real difference between these two situations is your level of awareness going in. Troubleshooting rules and techniques are not different. The biggest problem with a situation that becomes a troubleshooting job is that you might not realize you've slipped into troubleshooting mode for some time. That's why one of the rules is ask for help if you're not making progress.

Interestingly enough, troubleshooting experience in other areas can be very helpful in troubleshooting computers and networks. If you fix anything (cars, toasters, etc.), you'll see that many rules and techniques apply to other fields. Now let's look at rules and techniques for troubleshooting IT issues.

Eight Rules for Troubleshooting Hardware, Software, and Networks

1. Try the obvious solution first.

There's a famous quote in medicine: "When you hear hoofbeats behind you, don't expect to see a zebra." (Dr. Theodore Woodward, University of Maryland School of Medicine). That means you should not start by looking for weird problems.

2. Document everything. Use the TSR Log!!! A TSR log must be started at the 1 hour point of any single ongoing issue. Documentation also means that you label everything you can. Seriously: this helps. A perfect example is hard drives. Before you start switching hard drives around, make sure you label them so you keep track of where you started and what you did.

3. Only change one thing at a time. Inexperienced technicians (and many clients) make several changes at once. For example, they apply all the waiting updates, switch out the network cable, and change the IP properties. The obvious problem (whether the issue is fixed or not) is that you have no way of knowing which change made a difference.

In addition, one of these changes alone may have changed behavior that you can measure and document. That knowledge might be useful if the issue is not solved because you could have eliminated two of those, but now you can't eliminate any of them.

This rule is extremely important. In fact, it's the reason we do not allow clients to take on two major projects at the same time, such as a network migration and changing ISPs. If you have network issues and you've changed two things system wide, you can spend a lot of time chasing the rabbit down the wrong hole.

4. Know what you know. See the blog post about this. You should be completely and honestly aware of the margins of your knowledge. In addition, this rule affects documentation. As you progress, you need to be very clear about solutions you have tried and eliminated from consideration. This keeps you from trying the same thing again and again. Also see the article on working with third party tech support and documenting calls.

5. Don't forget the basics! Even experienced technicians forget this one. If they think they know what the problem is about, they jump in and try a few things. A half hour later, they call someone else on the team who asks "What kind of errors are in the event logs?" Uh . . .. If you haven't checked the event logs, you might be embarrassed to see that the problem is obvious.

6. Slow down, get more done. I've covered this many times in my blog and my books. There are several angles to this rule. If you're in a hurry, you won't be careful, you'll miss things, you'll forget the other rules, and you'll become frustrated faster. Go slow. Ask for help. Splash water on your face. Take your lunch break. Get a fresh pair of eyes.

7. Use your PSA's knowledgebase capability !!! Your professional services automation tool (Autotask or ConnectWise) is a great place to document issues so you can do research within your own knowledgebase. Has your team solved similar problems? Is there a hotfix already downloaded to the cloud drive? Don't duplicate work of any kind, even research.

8. Ask for help! You're not alone. The maximum time anyone should work on any problem before stopping and calling for feedback or support is 60 minutes. It is critical to the company's profit to not waste time by continually trying the same thing over and over or to simply stumble around hoping to find success.

The first time I took a vacation and handed my clients over to an employee, I told him that he is not alone. He can call HP, Microsoft, Dell, Symantec, me, other technicians, or anyone else who might be able to help. Sometimes a vendor knows the answer but the online support system is horrible (e.g., SonicWall) or the vendor has undocumented fixes that they only hand out if you call in with the problem (e.g., Microsoft). At other times, another tech will have worked on something similar. Or they might just suggest a different approach.

The more consistently you apply these rules, the smoother your operation will work. One of the great benefits of experience is that you take note of "weird stuff" you come across. Having a good process will help you isolate the weird stuff, document it, and add it to your internal mental database.

These rules are like your muscles of success. You need to practice them so you'll get good enough that they become an automatic part of your troubleshooting process.

Thirteen Techniques for Troubleshooting Hardware, Software, and Networks

Techniques are different from the rules above. Rules are big-picture guidelines. Techniques are specific approaches or actions you use to find problems and isolate solutions. Everything above continues to apply. Here are some techniques you will employ in troubleshooting. If you've been in technology long, you'll recognize all of them, even if you didn't have labels for them.

The wisdom of an experienced troubleshooter is to know which technique to attempt first, second, etc. Some techniques are opposites of each other. Experience will help you decide how to start and how to proceed.

1. Start with the highest probability. This is simply the action that executes Rule #1 above. But remember that things change over time. Here's an incredibly general but accurate example:

Let's say there's a problem with the network. Period. That's all you know. In the world of Windows NT 4.0, I would say to start with the physical connection (cable, NIC, switch). In the days of Windows 2008/2011 I would say to start with DNS. One of the jokes in my office is "All problems are DNS." That's surprisingly helpful!

2. Start with the Physical. This technique is as old as networking. But it also applies to software and hardware issues. For example, if a hard drive is old or beginning to die, it might spend an excessive amount of time writing and re-writing data, marking bad sectors, and moving things around.

We tend to forget this technique today because hardware (and networking) has become extremely reliable. The network-specific equivalent of this rule is "Work your way up the stack." Remember the ISO (Open Systems Interconnection) model? The stack looks like this:

7. Application
6. Presentation
5. Session
4. Transport
3. Network
2. Data link
1. Physical

3. Start with the program. This is the opposite of Technique #2. Assume that cabling, network, and connections are all stable. This technique is actually more common today than #2. You examine the program, then the operating system, and on down.

4. Apply all the patches, fixes, and updates. It is amazing how many problems go away when you apply all the patches to the hardware, operating system, and program. Keep updating until there's nothing left to add. Now test the issue. Very often, the problem will be gone. You might be dissatisfied that you don't know what caused or fixed the problem. See the last technique, below.

5. What has changed? Whether changed by the client, automatic updates, or even one of your staff, changes are a very frequent cause of problems. This is an easy technique to try: Simply reverse the change if you can. Often you cannot. In such cases, you simply have to find the new conflict and figure out how to solve the problem. But at least you're on the right track.

6. Order matters. Sometimes a problem only happens because a series of actions was taken in a specific order. Change the order and you have no problem. This is the most common cause of people saying that a problem is "random." If there's one thing a computer system is NOT, it's not random. This is why it is critical to get clients to report exactly what happened. In what order did they open programs, save files, etc.?

7. Serial substitution. This means that you change one thing and test. If the problem is not solved, change that thing back. Then change the next thing. If the problem is not solved, change that back. And so forth. The technique works best when you can define the possible variables. For example, in TCP/IP you have the IP address, subnet mast, default gateways, hosts file, DNS, and so forth. You can literally make a list and check things one by one. Documentation is critical to this. Use the TSR log!

8. Troubleshooting Checklists. We get frustrated when we call tech support for a client and the ISP wants us to verify that the router is plugged in, the cables are good, etc. But we have some checklists ourselves. After all, if a client called you and said that her computer doesn't start, you'd go down your own checklist. Is the UPS plugged in? Is the light red or green?

Creating a few simple checklists can help your technicians to quickly solve some of the most basic problems. This helps them learn troubleshooting and saves you money!

9. Reproduce the problem. This is particularly helpful with intermittent issues. Can you (or the client) reproduce the problem every time? If so, that will give you great clues about where to start. If not, then you need to begin investigating when the problem happens, which programs are being used, and so forth.

10. Have users help with documentation. This follows from Technique #9. If you can't re-create an intermittent problem on demand, then you need to engage the client to help you. Give them a form with instructions. When the problem recurs, they need to tell you what were they doing, which programs were open, etc. Anything they can give you at that very moment may be helpful. They can either keep a paper log, enter notes in the service ticket, or email information to you.

11. Do you have multiple problems? This is essentially the opposite of Technique #1. It will certainly not be the first thing you consider. But at some point, consider the possibility that two things changed, or two things failed at the same time. It does happen. Finding one issue may or may not help to find the other issue. But fixing one issue may make the other issue easier to find!

12. Product-specific gotchas. This technique relies heavily on experience. Sometimes you just "know" that a program behaves a certain way, or comes set up wrong out of the box, or you have to connect a certain cable first. This is knowledge that doesn't really help much with figuring out problems generally.

But whenever you see one specific product, you know what you need to do. This is also one of the reasons that you will be more profitable is you have a small set or products you sell. You get to know those products and can troubleshoot them very quickly.

13. Do you care why? This is a completely different approach. Do you care why the problem happened? Can you simply reboot and fix it? Can you roll back the system to the last restore point and it works? As left-brained techno-goobers, we want to understand what happened. But as business owners and managers, that may not be the most important question. If you can fix something quickly, who cares why it broke?

Note: If the problem happens again, then you need to care. If the problem recurs, then you need to pick the right technique and troubleshoot the problem until you really fix it.

- - - - -

Remember: Computers are not random. That makes troubleshooting them a very systematic job. Unless you've engaged a random function, computers always do exactly the same thing under that same circumstances. When that does not appear to be true, it's because you haven't explored enough to determine when it does one thing and when it does another.

What remains true under all circumstances is that the rules always apply, and one of these techniques will reveal the solution. There's very little you can do to speed up experience except to tackle as many problems as possible.

Your Comments Welcome.

- - - - -

About this Series
SOP Friday - or Standard Operating System Friday - is a series dedicated to helping small computer consulting firms develop the right processes and procedures to create a successful and profitable consulting business.

Find out more about the series, and view the complete "table of contents" for SOP Friday at

- - - - -

Next week's topic: Front Office Roles and Responsibilities


Check Out the All New Book:

Cloud Services in A Month
by Karl W. Palachuk

396 pages - plus lots of juicy downloads

Paperback - Ebook

A great resource for managed service providers or anyone who wants make money selling and bundling cloud services.

Featuring all the details you need to create and sell YOUR custom Cloud Five-Pack (TM)

Learn More!

No comments:

Post a Comment

Feedback Welcome

Please note, however, that spam will be deleted, as will abusive posts.

Disagreements welcome!