Peter Willert
Header image of 'Break Free from Friday: A Guide to Reliable Releases'

Break Free from Friday: A Guide to Reliable Releases

Author portrait

Introduction

Software releases are challenging. And the timing of these releases can be crucial. A recurring question among software engineers is, "Do I want to release this now. An a Friday?". Some have developed a certain angst about deploying on the last day of the working week. Because if things go wrong, they need fixing. It might be a quick code change, but it could also be an all-hands-on-deck situation. On a Friday at 5 PM, when half the team has already checked out. You don’t want that. Your team doesn’t want that. Not to mention the customers who might rely on your systems Friday evening to get their job done.

But why start with Friday? If Fridays are scary, where do we draw the line? Thursday? Things can go wrong Thursday evening as well. Wednesday morning then? If you cut out all odd moments you will release every other Wednesday when Mercury is in retrograde. You get my point. Friday should be as safe to deploy and release as any other day.

Drawing from exceptional incidents like CrowdStrike's ginormous release that took down a ton of Windows machines worldwide, I'm going to dive into strategies and thoughts around release stability, the nuances of release versus deployment, and the indispensable role of feature flags, canary rollouts, and fast rollback mechanisms. These recommendations have served me well in the past and friday releases have not been scary.

I'm going to look at some building blocks to have at hand and go through some best practices. The fifth entry of best practices will surprise you! - I hope not, but let's see.

The building blocks

Clarification: The Release vs. Deploy Dichotomy

First things first, let's distinguish between "release" and "deploy." People, including myself sometimes, tend to use these terms interchangeably. Deployment is the act of moving code to a production environment, while a release is when these changes become available to users. It's a subtle but crucial difference.

As an engineer, you might move code to production multiple times without users noticing anything new — this is deployment. The moment you activate those changes for users, that's the release. Decoupling these processes can drastically reduce risks and give you control over the user experience.

The Case for Feature Flags

Feature flags are a game-changer in managing releases and the risks involved. They allow you to toggle features on and off quite often without even redeploying code, offering unparalleled control. If something goes wrong, you can switch the feature off with little effort.

There are tons of solutions for feature flags out there. You can find third-party solutions or libraries for any stack. Firebase, for example, has native feature flags, or you can implement them on your own.

Benefits of Feature Flags:

One caveat should be mentioned: Implementing these switches, aka flags, for your features might require you to have multiple versions of your features in your codebase at the same time. Depending on the change, this could be smooth sailing or a maintenance challenge. Alongside this, you want to have a strategy to remove the flags when they become obsolete.

Environments for Intervention: Canary Rollouts and Fast Rollbacks

Limiting exposure to potential problems by starting with a small audience reduces the risk of deploying faulty code that has a huge impact. Yes, there can be errors, but only 0.1% of the users witnessed them.

Canary Rollouts

Fast Rollbacks

Alright, we’ve done all the stuff. Code was reviewed, automated tests passed, code is on staging, and we're good. The canary rollout didn’t yield any issues either. But boy oh boy, we’re live and things are wild! What now?

These strategies allow you to act quickly, ensuring that even if something does go wrong, you can mitigate the impact efficiently.

Best Practices for Safe Releases

And now something handy to take action. To mitigate risks with Friday, and releases on any other day, explore these strategies. Each of them should put you in a better position to trust your releases and sleep calmly at night.

  1. Automated Testing: Implement comprehensive automated tests to catch issues early
  2. Staging Environments: Conduct thorough testing in staging environments that closely mimic production
  3. CI/CD: Implementing a Continuous Integration/Delivery pipeline is essential for efficient and reliable releases. By automating the build, test, and deployment processes you can reduce human error, accelerate delivery times, and increase release frequency.
  4. Monitoring and Alerts: Set up monitoring systems to detect issues immediately. The metrics mentioned in the feedback loop part are a good starting point here.
  5. Documentation and Playbooks: Maintain detailed documentation and predefined playbooks for handling common issues. I can’t stress playbooks enough. They are the simplest form of asynchronous guidance a team can have.
  6. Communication and visibility: Talk and share what you are doing. That doesn't mean an @channel is in order for every deployment. But if your peers know what's going on, a new release, or maybe a roll-back, they can act accordingly.

Conclusion

Releasing on a Friday can be risky, but it is manageable with the right strategies. Every step you take after your code is written can help you to get a better idea if your code is good or not. Automatic testing, decoupling deployment from releasing, using feature flags, implementing canary rollouts, and having fast rollback mechanisms in place are all building blocks to minimizing potential impact.

I want to provide you ideas and not a fully comprehensive guide. Some teams don’t have a staging environment and that’s on purpose. They develop and test locally, and push to a production environment; flags, canaries, and metrics will help them find issues.

I’m curious. What’s your experience? Got any good stories where things went wrong? Tell me and let's laugh together!

On a personal note:
I once nearly invalidated all memcached keys of Germany's biggest social network. There was no cloud, no way to scale, we would have been under heavy load and the page unusable for weeks. This was caught by a colleague reviewing my code. Phew! You are reviewing code, right?