Agile: Not Falling Enough

Thursday 23rd November, 2017

“You’re not falling over enough”: that was a bit of unsolicited advice that a colleague of mine once gave me. Not in any kind of work context; we were on a long weekend’s skiing that a few of us used to go on each year. Apparently, in his eyes, the fact that we were 2 days into our ski trip and I was the only one of the group who hadn’t yet gone face-first into the snow was evidence, to him, that I wasn’t trying hard enough. Sure, I wasn’t the fastest skier there, but I was keeping up OK, and didn’t really feel like being one of those people who pushes themselves to the limit on the ski slope, risking a painful tumble for the sake of getting down the slopes a few seconds quicker. This turned into something of a philosophical debate that carried on in the bar that night: to what extent do you limit your ability to learn and improve if you don’t allow mistakes to happen?

Delivery in IT departments: are you falling over enough?

I’m often reminded of this line when looking at the delivery of an IT department. Are they falling over enough? In many large IT departments, a live system falling over becomes a minor catastrophe accompanied by the scurried panicking of multiple teams all hastily analysing the problem, rolling back the deployment, applying a fix, handling the fallout and eventually – days, weeks, maybe months later – re-deploying the release back into production. A whole industry employing armies of people in large organisations has built up around trying to ensure that nothing ever falls over.

The trouble is, the more you put measures in place in pursuit of this dream that you can’t ever fall over, the more painful it becomes when you inevitably do. Imagine a sole trader putting their own website live: oh look, there’s a bug on the front page. Never mind, quick amendment and it’s fixed; job done in 15 minutes. Contrast this with a live defect in most major companies: oh look, there’s a bug on our home page. OK – well the change that we put live this morning that introduced the error was part of a release of £1m worth of software and hardware changes that were tested and deployed en masse and will therefore have to be entirely rolled back. Months of planning, testing, scheduling, preparation, customer comms, all fatally undermined by having to roll back the release. A big post-mortem exercise into how it didn’t get tested or deployed correctly, how some human error meant the wrong thing got copied across. New controls put in place to ensure it can never happen again, meaning it now costs more and takes longer for each release.

Breaking the vicious circle of large, critical releases

As companies grow their IT departments from small ones to big ones, the tendency towards control and governance means releases become more of a big deal, which usually means they become less frequent, which usually means that failures are compounded. Because each release is so huge it becomes unthinkable that it should in any way fail given the amount of disruption this can cause across multiple parts of the business. If that happens, then all the governance, checks and balances and post-mortems will never be enough, and from there it becomes a vicious circle: the more critical each release becomes, the more work is required to ensure it can’t fail; the more work required, the less regularly they can happen; the less regular, the more change that has to be incorporated into them to keep the business current, and they become more critical still. Eventually every release almost becomes make-or-break for your business: if you only release quarterly and one release fails then you might go half a year without putting any system changes live, whereas your more Agile competitors have deployed 20 releases, 50 releases, even 100+ releases in that same period. No business being that out-paced will survive long.

De-escalate the situation: make releases smaller

Companies must get away from this cycle and move back to small, iterative, regular releases. Think of it like de-escalating the situation: make the releases smaller, and they are less critical. As they become less critical, you can allow your levels of control and governance to be loosened. As the complex choreography of control and governance is reduced, releases become faster to deploy (and yes, roll back/re-deploy if required), not to mention cheaper and less disruptive. Make the releases smaller still and you can often deploy code without even necessarily having to tell other people you’re doing so. Automate your testing so that at the push of a button you can regression-test downstream feeds and know that they’re OK, and then you can loosen those controls even further. Suddenly, you’re releasing regularly and each release is a minor event, with minor levels of risk.

This is where the ‘fail fast’ mantra of Agile is recognisable. Don’t be afraid to fall over: to err is human, and as long as you can be reasonably confident you’re not introducing any irreversible damage to your IT estate or hugely offending your customer base, then putting something live that isn’t 100% perfect is not the disaster it might seem. Lots of companies will even test their new features in live, making clear to their customers that this is a ‘beta’ version and asking that they give feedback. Code deployment is like any point of contact with your customers: of course it should be managed and considered, but it should not become something that paralyses you with fear, or that will, in itself, makes the situation more fearful. Don’t be afraid to show customers your product before it’s 100% perfect – many customers will actually feel more attuned to your business if they feel that they’re consulted as part of the development of new products and services.

Allow yourself to take risks

Naturally, there are points at which this isn’t as straightforward as it sounds. Let’s say you work for a bank: you can’t just chuck new code into live without some concern for what kind of mess you might make; if you put live code that wipes out thousands of bank balances, of course you’re not going to be able to laugh that off. I’m not advocating that you sweep aside all governance and control. However, that doesn’t mean you can’t shrink the size of your deployments. There’s always the temptation to pile more into a release, but resist this. Instead shorten the route to live, making sure that small changes can find their way to production quickly, that small teams can control their own routes to live without requiring the huge wheels of major organisational departments to creak into life first.

So, I went skiing the next day, pushed myself a bit harder and sure enough took a big face-plant into a snow-bank. I got a good cheer from my mates, we all had a good laugh about it, and nothing bad came of it. It didn’t hurt, not really. Allow yourself to take risks so that you fall over now and again: it’s not going to be as bad as you think, you’ll improve and go faster, and you’ll probably enjoy yourself a bit more as well.