Published August 17, 2019
by Doug Klugh

Software Screening

Manage the risk of releasing new software by first releasing to a small user population, then gradually rolling out to the entire user community.  If problems are detected, users can be quickly re-routed back to the old version.  In addition to functional testing, this can be used to facilitate capacity testing in a production environment with a dependable rollback strategy.  If quality is not an issue, canaries can be used for A/B testing of hypotheses of user engagement.  Avoid using canary releases on critical systems that cannot tolerate failure.

What's With the Name?

Canary releases are named after the practice of coal miners who once carried canary birds into the mines to alert them when toxic gases reached dangerous levels — which would kill the canary before affecting the miners, signaling them to evacuate the mine.  Likewise, canary releases are intended to signal us to pull a release out of production when things start to go wrong.

Canaries with Microsoft Azure

Canary Releases can be easily achieved on a variety of cloud platforms and services.  The example (below) illustrates how this website (DevLead.io) uses Deployment Slots within Azure App Services to direct user sessions to one of two deployments slots named devlead and devlead-Stage1, based on the Traffic %.  In this example, Stage1 serves as the canary, with 5% of the users being routed to that particular deployment.  Then I can gradually increase the Traffic % for Stage1 or click the Swap button to immediately route all remaining users to the Production slot.

If you are using canaries for A/B testing, you can add additional slots to accomodate testing multiple configurations at once (perhaps A/B/C/D testing).  The number of deployment slots available depends on your App Service plan.  The bottom tier provides 5 staging slots (in addition to the Production slot) and goes up from there.

Canaries with Amazon Web Services

Route 53 Traffic Flow can be used for canary releases by routing traffic based on a weighted round-robin basis to various AWS services including EC2 instances, Elastic Load Balancing (ELB) instances, CloudFront (CF) distributions, Elastic Beanstalk (EB) environments, API Gateways, VPC endpoints, or S3 buckets.  It can also be used to route traffic outside of AWS.  However, since Route 53 is a Domain Name System (DNS) service, a long Time-to-live (TTL) configuration would inhibit the ability to quickly re-direct traffic back to the old environment if issues were discovered — or even to quickly direct traffic to the new environment. 

If you don't need to route traffic based on a strict weighted round-robin basis, you can avoid the DNS issues by using Auto Scaling Groups (ASGs) with an Elastic Load Balancer (ELB) to split traffic between two or more ASGs (one being the canary).  Spin up an ASG with an Amazon Machine Image (AMI) containing the new software version, register the new ASG in the load balancer, drain the connections with the old instance, terminate it and you're done.  You can approximate a weighted round-robin approach by the number and ratio of old to new instances you have running.