Dancing with the flames

Developing large and long-lived software products in a state of continuous change comes with its unavoidable set of surprises—one of their categories being a dense fauna of bugs and incidents. Depending on the nature of the application (e.g. a banking app vs. a cat classifier app, or B2B vs. B2C, project vs products), the team topology and leadership (technical vs. non-technical), and ultimately where the project/product sits on its market (trying to find a spot vs laying comfortably over 4 chairs), teams can choose to be more or less bug-averse/reliability focused, which will in turn dictate how they handle the unexpected.

There are a couple of strategies software teams can follow as part of making a platform more reliable

Become great at avoiding incidents (high coverage unit testing, integration tests, QA phase)
Become great at dealing with incidents (observability, fast incident response, feature flagging, canary releases)

While both approaches are important and taken in combination to produce exhaustive reliability, I tempt to see the former (avoiding issues) to gather most of the team/business energy, especially as team grows and in the product space; Intuitively this makes sense: bugs and incidents can significantly hurt trust in the product being reliable for its consumers, and painful experiences with products are extremely hard to convert back, so most product teams would want to avoid them at all cost.

Im practice, the core trade-off of focusing on the first approach is a slower pace of development: seeking 100% tests coverage can easily bump pre-production time; QA handoffs are notoriously slow even well oiled and produces overly fortified angles (the ones we know of).

What’s probably most important and least intuitive, is that our lack of almighty omniscience still leaves a strong taste of distrust in teams in how prepared you can be for the unexpected. It disregards the nature of the beast, that the unexpected cannot ever be completely avoided - In fear of it, teams can do more QA, gated deployments and releases to secure more bases all of these eating more precious time in a competitive environment.

Accepting change as the status quo

Many moons ago I had the chance to see a talk in Melbourne on the concept of “Anti-fragility” in the context of software engineering, which was an eye opener in the different ways you could choose to operate systems (most particularly in a web setting, notorious for change)

The book this concept originated from, wrote by the economist Nassim Nicholas Taleb is introduced as followed

Some things benefit from shocks; they thrive and grow when exposed to volatility, randomness, disorder, and stressors and love adventure, risk, and uncertainty. Yet, in spite of the ubiquity of the phenomenon, there is no word for the exact opposite of fragile. Let us call it antifragile. Antifragility is beyond resilience or robustness. The resilient resists shocks and stays the same; the antifragile gets better

The idea behind the concept is relatively simple: what doesn’t kill you makes you stronger; In the context of software

Robust systems favourise protection from incidents and reliability
Resilient systems adapts and recovers quickly from incidents
Antifragile systems learn from incidents and get better from it

This was around the same time the concept of Chaos engineering came along, spear headed and exemplified by Netflix Chaos Monkey. This was a ground breaking approach for the field and for a company like Netflix where reliability is a lot more important than finding and developing market fit

The concept of self-induced incidents caused by antifragile methods was so unique that it overshadowed the rest of what resilience and antifragility are all about. Many startups have attempted to replicate this methodology (most likely leading to costly AWS bills from self-induced DDoS attacks). However, the underlying principles of resilient and antifragile development can be applied much more broadly: by adapting and learning from real-world incidents, software can become increasingly more reliable in the face of what is actually likely to happen.

The art of dancing with the flames

Induced or not, incidents impacting customers will hurt the brand/business; If we’re going to learn from them and let defects happen in production to improve reliability, teams must focus on creating a space and framework where the impact of production incidents is minimised; By providing an arsenal of mitigation and resolution tools and fostering a culture where failure is part of the process of betterment, teams can eventually minimise the impact of issues, improve the software quality gradually all the while moving fast and efficiently by minimising pre-production steps;

For the tad bit of history, the philosophy of quality assurance embedded in the production process directly is something that was popularised by Toyota and their “Jidoka”, (part of the LEAN philosophy which I believe comes from Toyota as well) which aim to produce quality products while maintaining the pace (or in the assembly game, “Takt time”).

The foundation of an effective incident response is a blame-free environment. Although it can be difficult to create or foster, as culture changes slowly, it is absolutely necessary to ensure that incidents are actually reported and properly resolved and learned from as a group

In the same vein, fire fighting is also a team effort: I would actually advise against “incident responders” or make it a sport only reserved to senior staff, and would advocate instead for incident response a common matter across the development team, regardless of seniority; the idea is really to normalise change as part of the process of evolution, and not the exception handled by specialists only. After a while, an incident-ready team would have developed a tight set of processes for mitigation, resolution and learning that would feel as a normal as any other process

Observability ensures that the team is able to quickly and accurately diagnose and respond to incidents, it is essential to have visibility into what is happening in the system (there’s nothing worst that watching incidents unroll with no tools to investigate). This involves having the right logging and metrics tools, as well as monitoring tools to detect any incidents quickly. It is absolutely essential and fortunately there’s plenty of tooling on this front (the topic is relatively large so I hope to delve further into it in a different post)

Finally, to actually benefit from this reliability strategy and not just waste the previously gained time doing pre-production defensive strategies, teams must take the time to learn from these incidents and produce actionable from it. They should retrospect on "what went wrong" and "how can such issues be prevented or mitigated in the future". This is pinnacle for the gradual progression of reliability

Pick the right reliability strategy

As with pretty much everything, choosing a reliability strategy is highly contextual and should be thought through based on elements I’ve stated such as product/project type, market positioning, leadership and team topology, etc..

Automated tests for example are pinnacle to effective software development (and not only to safeguard from bugs!); what I advocate here is to carefully think through the cost of incidents as well as the cost of the time and energy put in these processes, and think a little counter intuitively on how reliaiblity could be produced

Can the product/project afford the potential happenance of in production issues if the team was able to deliver more value faster? To what extend?
How much time to we spend avoiding issues to happen in production? What is the actual measure of incidents happening in production?
How confidant is the team in releasing bug free code? How confidant is the team in their ability to mitigate and resolve an incident quickly? How long would it take?

Answering these questions will best help define what type of strategy will be best adapted. As a small insight from experience working seeing and implementing different strategies and incentives to keep software reliable, I believe there’s a sweet spot for the majority of software teams where through adapting to real world incidents and be prepared, reliability improves gradually and continuously through adaptation while keeping development velocity peak.

Getting an efficient in-production reliability will take some time and is most definitely a gradual process, but the confidence and safety it creates in a team as part of developing these processes is something extremely powerful; the by-product of feeling safe and prepared for what worst can happen results ultimately in performance and team unity, sounds pretty dreamy, right?

Thank you for reading this far! It's a first, so pardon any repetitions. I can already tell this is getting lengthy, so I'd like to delve deeper into the practical side of implementing a good reliability strategy, where I will write more about the arsenal of incident-ready teams, guiding questions to find the best processes in context of the product teams, and dive further into observability, tooling like feature flagging, and canary releases, and processes like incident checklists.