Plan to Fail

This post has 1462 words. Reading it will take approximately 7 minutes.

I hate achievements. Or, rather, I hate how most achievements are implemented. For example, Apple has an ongoing heart month achievement, which means I should do 30 minutes of exercise 7 days leading up to February 14th. That’s not a huge problem, except I study on Saturdays, so my activity level is closer to that of a sloth on those day, and, indeed, that means I blew getting this achievement on the first day, so it is entirely meaningless. The “longest move streak” achievement is even worse: it counts the number of days in a row I have reached my goal of doing 670 kcal worth of exercise. My current highest is 97 days. I’m now on day 5 of 97 to just match that, and then there just one day og hangovers/SARS 2.0/laziness/study between me and losing it, putting me back on having to go almost 100 days to just match the previous achievement. Sure, when I’m on day 90 it will be motivating, but until then it is a pie in the sky. The achievement is on pie-in-the-sky level for 97 days out of 100 right now. Heck, even on day 98 I’ll probably break it on purpose just to avoid the stress of failing entirely by failing for just one day.

Some achievements are structured better. For example, Ali Express has a skinner box that allows you to check in daily to get coins. The coins are useless, but you get 16/day if you check in every day over a period. If you fail to check in one day, you start from scratch getting only 1 coin the first day, then 5, etc. until getting 16 on day 8 onwards. The improvement is that while there is a cost for failing, but even so, you quickly get back on your feet. After you go up to 16 coins/day, the achievement becomes boring though: you have already “won,” so there is no longer anything to aim for.

I have written a small app I use myself to set and track goals. It is based on continuously setting goals and tracking general progress instead of streaks and achievements. I can set a goal, such as flossing regularly (don’t forget, 3 out of 4 dentists say you die if you don’t floss!). It shows a counter keeping track of how long ago I last did the target action and how long until I should do it again to reach my goal. If I pass that time, the otherwise green counter turns red. At all times, I see two counters: the current and the previous one. If I miss a goal, I only need to do the action twice, and then the reminder is gone. Failing is built in and while I am encouraged to meet my goal each time by the counter, failure disappears quickly. To make failure have some cost, it also keeps track of overall adherence for the past 10, 100 and 1000 times. If I break the goal once by a large margin, it will significantly affect my 10 times goal, but be gone after achieving it 10 more times. It will to a lesser degree affect my 100 and 1000 times averages but remain for longer. Failure to achieve the goal is planned for. I also get various charts and heat maps to satisfy the data nerd in me.

This difference also applies in IT. I used to work at a company, that was implementing a new strategy of “first time right.” Too many customers were getting annoyed that too many deliveries had issues. Of course, this great “strategy” came with absolutely no backing. How was this to be achieved? Where was the money going to come from? How would we measure improvements?

First time right is a bad strategy. Planning to make absolutely no errors is expensive. If I am to guarantee that a delivery has no issues, I need to test it much more thoroughly. Sure, we should all to tests, both unit tests and regression tests, but often deliveries include an action that has to be performed exactly once and never again. Such a delivery will often have a dry-run, but the dry-run is rarely 100% identical to actual implementation (e.g., deals with less data, less time pressure, is run during business hours instead of a maintenance window, etc.). You can strive to make the dry-run closer to reality, but what happens if the dry-run fails? Try again? Recover and proceed? If you recover, it is again not identical to the actual implementation, but starting from scratch takes time.

Even if a dry-run is perfect, if you strive for no errors, you need contingency plans for issues that may pop up if you want to ensure first time right in the strictest sense. What if a backup is running and slowing things down? What if a third party does something outside your control which has impact (e.g., the internet breaks or a service you rely on is down)? When do we stop making contingency plans and devise workarounds?

Increased testing will add costs, detailed contingency plans for increasingly unlikely failure scenarios will add cost. Planning to get things right the first time is expensive. There is still a case for this, mind you, if I am sending a rocket to the moon, failures are very expensive and may not be possible to work around. It is good to have contingency plans in those cases. But in cases where it is perfectly possible to work around or on-the-fly fix a failure? It is probably much cheaper to just skip all the preparation that is likely to be unnecessary and instead make a plan for failure.

Planning for failure is not failure to plan. I often do a dry run and weed out the obvious failures. I make ad-hoc check lists of any manual actions I need to take. But I do not make detailed plans for all possible failure scenarios. Nor, do I rehearse again and again to get everything right. I do reasonable testing: if I have changed one feature, I test that and features it may have impact one. I have reasonable unit tests for high-risk (either important or complex code), but not for everything.

I plan for failure. I make sure that a deployment window has a reservation for “things I didn’t need to do during rehearsal but for some reason have to do now.” I make sure to have normal backups, so in case of complete failure, I can back out and revert to the previous situation. I fix issues on the fly and don’t resort to a rollback at the first smell of problems. If there are issues, I make sure to note them down and follow up afterwards if necessary; was this something we could/should have foreseen? do we need to do things differently in the future? Most of the time, it doesn’t matter, though. It was a one-of-a-kind action and future situations will be different and not trigger the same issue. I also most likely will not need to be able to reproduce the action in the future, because, again, it was a one-of-a-kind one.

After delivery, I also make sure to monitor the delivered system if applicable. I make sure not to do deployments on a Friday afternoon, but instead early in the week or Thursday afternoon/evening at the latest. I make sure I don’t have my days filled with meetings the days afterwards, and that I can quickly switch to bug-fixing mode if an issue arises in production. I also make sure, that a fix can be quickly rolled out in that case.

If you plan to do everything right, there may be no margin for failure, so if one manages to sneak past all your preparation, it may have much more disastrous impact. If you don’t plan for failure, you may get surprised if something pops up anyway. And it will. Which, curiously, can be a surprise for some even the 5th and 10th time it happens. And every time you get surprised by failure, you have to go into disaster recovery mode and fixing an issue will take much longer and be much more expensive.

Sure, I may have more issues during roll-out (I don’t actually think I do), but I can resolve them much faster. In my experience, the issues also never escalate to disasters, as they are just acknowledged and fixed within hours or even minutes instead of days or weeks. Add to that greatly reduced cost from not planning for things that are not going to happen and the relief of everybody being able to complete a rollout after 30 minutes instead of the planned 90 minutes 4 times out of 5, and planning for failure wins over naively not believing in them.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.