Problem Finding | Michaelpedia Galactica

I’ve noticed that a number of my colleagues don’t really have the knack for identifying problems and solutions to them. I guess that’s not really something that is often taught, so I thought why not write up my method for identifying problems. One a problem is fully identified and understood, the solution is typically trivial. This is written from the point of view of identifying issues in computer software, but as an expert in one area, I’m sure that also means I’m an expert in all other fields, and the method is directly applicable everywhere. Or at least this idea can provide inspiration for other areas.

Is it really a problem?

The first step is to understand what is wrong. This is particularly important if the problem is reported by a third party; for this, we consider three questions:

what did you do?
what happened?
what did you expect to happen?

Especially the third question is important, so we know if the expectation is in line with the application. It can be useful to consider these questions even if a question doesn’t come from a third party.

Some will provide answers to these questions of their own volition, sometimes, we need to drag the answer out of the reporter. A customer wrote me the other day, showing a screenshot with unexpected billing information. He explained what he expected and showed what he actually saw and explained how the customers were set up. It was an excellent error report, and we only had to check that his expectation was not according to the application, and provide suggestions for how to accomplish what he wanted within the application.

Auxiliary information can be useful or useless depending on the reporter; if a reporter provides their own theory about what is wrong without having technical knowledge it can more often than not safely be ignored. I had an associate repeatedly report that a web host was slow because WordPress wasn’t working. After getting him to answer the 3 questions above, I was able to establish it was a case of a caching plugin interfering with a theme.

Is it important?

Many problems are resolved with a restart. Turn it off and on again fixes 90% of problems (and turning it off and leaving it like that solves 100%). Never start a big investigation of a problem that occurs only once unless the failure is critical or expensive. Restart and see if it happens again.

If a problem happens more than once, it is worth considering if it needs a fix: is it wrong expectations or easily avoidable with a work-around. Sometimes, that’s sufficient. Especially, if the fix is complex or expensive.

Note though, that if users’ expectations are often or even consistently incorrect, maybe the application is wrong. Perhaps it should communicate the correct behavior or perhaps the operation should be changed to be in line with expectations. This is a different problem from what was originally reported, and it is worth starting the process with this new problem.

Reproduce the problem

If the problem is really a problem and important, it is time to start working towards a fix. For this, we will aim towards using the scientific method. Normally, the scientific method comprises the steps: question, hypothesis, prediction, test, and analysis. The question is a formulation of the problem we are investigating, and is given by the three questions above (“why doesn’t the software perform as expected when doing these steps?”). Sometimes, we can form a hypothesis directly, especially with more experience, but no matter what, we need to be later able to demonstrate we have fixed the issue, so we need to reproduce the issue. Reproducing the issue can also provide valuable input to forming a meaningful hypothesis.

Sometimes, reproducing the issue is as simple as executing the steps from “what did you do?” above. Otherwise, it is worth gathering information about the problem: how often does it happen? does it happen at specific times or randomly? does systems logging or resource monitoring tell anything. At least get a relatively precise time stamp for one occurrence and check the logs/monitoring around that time. If the problem is reported by a customer, ask them to try keeping track of this, or at least one concrete time stamp. If you’re encountering the issue yourself, also be aware of when the problem happens. Some problems are resource or data bound and will only happen after running for a certain amount of time, with a specific amount or configuration of data, with certain resource constraints, or with a certain load. Take these into account when trying to reproduce issues (reduce memory limits, use a clone of production data or make production-like test data if this is not possible, leave the application running over night, use artificial load to simulate – or exaggerate – production, …)

Distill the problem

Once you have a reliable way to reproduce the problem, try reducing it to the minimal components. Can you reduce the number of steps? The data needed? The external constraints? Any factor you remove will give you insight into the problem and fewer variables you have to later test. Simplification also means reproduction for testing purposes becomes faster.

Don’t spend too much time on this, tough. The goal is to get to a hypothesis, and it’s better to quickly form a (good) hypothesis and test it than spending days coming up with a great hypothesis. Even if you have to reject a hypothesis, this can provide insight to further distilling the problem and forming a better hypothesis. The goal is to find and fix the issue as fast as possible, so time spent distilling the reproduction should be gained back in shorter time to diagnose and test the issue. Be aware of your optimism bias, though: you are going to need more iterations to diagnose and fix this then you initially think, so while a complex 10 step process requiring 5 applications is annoying, it may seem more efficient to just fix it and go thru the process once than writing a proper test, but I have seen way too many people having a testing loop of several minutes for a simple issue (changing application, running application – maybe even deploying application to some external environment – manually setting up a complex scenario) before realizing they have now done this time-consuming loop 20 times.

Optimally, you can distill the problem down to a single failing unit test. That’s the dream: you can easily retest and you know you are done when the test no longer fails. Adding the test to your test suite also allows you to avoid reintroducing the issue later.

Form a hypothesis

With a simple way to reproduce the problem, you should be able to form a hypothesis as to the cause. Maybe you can do it straight away, maybe you need more information. Depending on the kind of problem, stepping thru parts of the application may be helpful. Or checking the logs or monitoring. Or attaching a CPU/memory profiler.

The hypothesis must be testable. That is, you must be able to make a prediction. It can be predicting a new way to reproduce the issue, a minor change you can make to the reproduction steps to avoid the issue, or it can be that memory fills up with a particular kind of objects. It must be possible for the test to fail, rejecting the hypothesis.

The hypothesis must be useful. The hypothesis should help you isolating and fixing the problem. It is fine to start broadly and narrowing down, but each hypothesis should rule out potential explanations, and the final hypothesis should essentially point to the line of code causing the issue (with the test being “if I change this code, the problem goes away.”)

Make and test a prediction about your hypothesis

We have already discussed what it means for your hypothesis to be testable. So, make a prediction based on your hypothesis and test it. This should be a simple step, but is essential. While it is possible to skip many of the previous steps, or only implicitly doing them, this can never be skipped.

Fix the issue

This step has also already been covered; your last hypothesis should always be “if I change this, the problem is fixed,” and in testing that hypothesis you will end up with a fix.

Skip this step

Be aware of the information bias, the tendency to seek information that doesn’t affect the outcome. It is a waste of time and can serve as a red herring for your investigation. This seems obvious, but it is worth being explicitly aware of. Does a step need to be executed? Otherwise skip it. Is a given hypothesis useful? Otherwise ignore it. It can be easy to fall into a trap where tests are performed while stuck to feel productive, but that is an illusion: if a hypothesis doesn’t aid in discarding possible explanations, it is not productive to test it, even if it feels so. The same goes for retesting already discarded hypotheses. Make sure, you test them well when doing so, and you can then ignore them entirely in the future. It can be good to note down hypotheses, tests and outcomes for long problem finding sessions, for your own good and to communicate to others what you have tried already.

Avoid confirmation bias

Try to be objective when forming hypotheses. Confirmation bias is the tendency to seek information confirming what we already believe and ignore (or assign less importance to information that might dispute it). Don’t start with an explanation for the problem and make hypotheses designed to confirm you are right. If you are wrong, you will spend a lot of time testing irrelevant things. It’s fine to make a quick test to see if you have guessed the cause, but if not, take a step back and form hypotheses based on data only.

Treat causes, not effects

Or, “don’t catch exceptions or make null checks.” If code throws an exception, there’s typically a good reason for it. If code does not check a value is null before dereferencing it, there’s often a reason for it. Don’t blindly fix such issues with a simple check without understanding why the exception is thrown or the pointer is null. Masking errors is worse than errors, even if it may seem like it is fixing the issue short term. Masking errors mean they can spread as data corruption over time. And such problems are MUCH harder to diagnose and the have much more severe impact. Try understanding why an apparent assumption is wrong and consider whether the assumption was wrong or the present issue is just the symptom of the real problem.

If the assumption was wrong, and the right fix is to catch an exception or making a null check, make the assumption clear either thru code or comments. Make it easy for the person that comes after you to understand what you are doing. If the assumption was correct, document that too (e.g., by marking variables as @NotNull or witching form plain references to Optionals).

If the assumption is correct, you have to discard your hypothesis, and look at the real cause. It will prolong the problem seeking but will make your life much easier later on.

Consider related issues

Having spent a lot of time diagnosing a problem, take an extra couple minutes to consider whether the same problem may occur in other places. We often reuse ideas or even code in different places, and if one of them has a problem, it is likely others do too. While you have everything fresh in mind, it should be quick to check if the same problem appears in other places.

Form good hypotheses

This is the end of the recipe; with the above you should be well on your way to diagnosing and fixing problems. As you grow more experienced, you can skip more steps, but at the end of the day, the final step: forming the hypothesis about what to fix and doing so will always be crucial, as is the series of hypotheses. With experience, you may have a couple of ideas that may allow you to fix the issue in a single step, and that’s fine as long as you are sure you actually fix the issue and not a symptom. But for beginners or particularly complex issues, it is important to form a good sequence of hypotheses to avoid wasting time testing things that don’t matter.

It is good to first identify the component that is causing issues. This seems obvious, but it can be easy to assume that the failing component is the one you know best or worked on the most/last, which has lead colleagues to spending half-entire days diagnosing a problem that was really in an entirely different component. I like to follow the flow of data/messages, and rule out components on the way. If a system comprises multiple applications, confirm that data passes successfully thru each of them. It’s going to be a long debugging session of a web service if the message is thrown out by the reverse proxy or a TLS check. Note that declarative frameworks may discard data so it never reaches your code (like data consistency or security checks performed by code injection or aspects using frameworks like Spring Web or Spring Security).

Then start from the top of the offending piece of code. Use a debugger to single step thru the code at the top level. Validate data before and after each step. Avoid stepping into methods until we have identified which method call causes the problem. Each statement we step over tests the hypothesis “this step causes the problem,” and rejecting that hypothesis by validating the results means we don’t need to consider and details inside the method. When a line indeed yields unexpected results, step into that method call and repeat until the line in question is so simple it is obvious it is either the cause or it uses other data which has unexpected values. If the line is the culprit, time to hypothesize it is the issue and fixing it, if not, start from scratch, trying to trace the origin of the data value with the value causing the issue.

These steps are of course mostly useful for logic errors. Other issues, especially non-functional ones, are likely caused by external systems or bad algorithms. Make sure you have sufficient monitoring of external systems, also systems not under your control, to identify if they respond too slowly or incorrectly. Try to gather information to respond to our three questions for reporting the issue to the third party so they have an easier time diagnosing their issue.

If the issue is algorithmic, it is likely not too complex. Business applications typically have one of 3 problems:

not using streaming,
accidentally turning what should have been linear quadratic, or
accidentally trying to solve an NP-hard problem.

1) happens if we have to process large amounts of data and load them into memory for processing, which is fine for testing, but may fail in production. Often, it is possible to steam the data into memory, processing the data in smaller batches and building up the result incrementally. Just always use steams – it prevents problems and makes code more consistent. 2) happens when we accidentally search thru a list for each element or have nested for loops. Consider if the list is the appropriate data structure even if you are more familiar with it. Familiarize yourself with abstract data structures and the complexity of their operations. 3) happens if somebody promises to solve problems we cannot solve. NP-hard problems have no known efficient solution and have a hard limit on how much data can be processed, no matter how much hardware is thrown at the problem. Recognizing NP-hard problems takes a computer science degree, but the rule of thumb is that if it involves trying different options, it may just be NP-hard. If you have one of these, there is no simple solution: regroup and consider how much data you need to process or if it is acceptable to solve a simpler problem.

Conclusion

So, that’s my method: starting from a proper error description and iteratively trying to reject causes of the issue using testable hypotheses rather than trying to identify the problem in a single step. At the time time, it is important to be aware of various cognitive biases that will block us from being most efficient, most important the information seeking bias, optimism bias and confirmation bias, which may causes us to perform unnecessary hypothesis testting, unnecessary manual labor, or outright testing the wrong things.