Come, paint my World … in Blue/Green colours -introducing Blue/Green deployment in my current project
Blue/Green is not a new technique but is now becoming the industry standard for deploying software. In this post I am going to describe how it is implemented on my current project.
Agenda:
- Quick intro and benefits
- Chosen method
- Implementation
- Testing and migration
- The End
Quick intro and benefits
Blue/Green deployment approach is a nice to have release process. It involves switching traffic between two identical environments running different versions of the application. The blue environment represents the current application version serving production traffic. In parallel, the green environment is staged running a different version of your application. Both environments have their own set of resources and do not affect each other. After the green environment is ready and tested, production traffic is redirected from blue to green. If any issues are encountered, you can rollback by reverting traffic back to the blue environment.
Benefits of Blue/Green:
- eliminate/minimize downtime
- reduces deployment risk via simple rollback
- in most cases opportunity to validate what was deployed — If discover the green environment is not operational, there is no impact on the blue environment, so you can take a decision to not switch traffic to green environment. Or if switching process is done you can easily route traffic back, minimizing impaired operation or downtime, and limiting the blast radius of impact
- in AWS, blue/green deployments also provide cost optimization benefits — unused environment is stopped. Your environment does not have to run an overprovisioned architecture for an extended period of time.
- fits well with continuous integration and continuous deployment (CI/CD) workflows
- generally, less risk and less stress
Chosen method
There are a number of ways to implement Blue/Green deployment into a project. What fits for one project would not be good for another one. After some brainstorms with Solution and Technical Architects we came to the conclusion that a “DNS Routing” method fitted best in our use case. It is a very common technique that uses DNS for switching network traffic from blue to green environments and vice versa when rollback is necessary. If a problem is encountered while releasing you can either take a decision to do not switch traffic to the green branch or do the rollback by updating DNS entry to switch back to the blue environment.
Implementation
DNS routing method is simple to implement for Blue/Green deployment, however, DNS TTL can add complexities that could be painful. The time that DNS servers need to refresh their caches could be so long that the switchover between environments could be delayed. To mitigate that situation, we have another layer which is unchanged. DNS still points to the same CNAME of an ELB, but proxy servers redirect network traffic to appropriate environment once new application version is deployed. It is transparent for customers.
Nginx on proxy instances uses Amazon R53 DNS servers as resolver and refreshing upstream address happens every 60 seconds. It is realized by setting resolver in http section and define variable by setting directive in server section. You can enforce refreshing upstream address by reloading nginx configuration. You can read more on https://tenzer.dk/nginx-with-dynamic-upstreams/. Below nginx configuration snippet which resides on our proxy instances:
Nginx.conf
Http section
http {
include /etc/nginx/mime.types;
default_type application/octet-stream;
access_log /var/log/nginx/access.log;
resolver {{ ansible_dns.nameservers[0] }} valid=60s;
…….
include /etc/nginx/conf.d/*.conf;
include /etc/nginx/sites-enabled/*;
}
Server section
set $upstream_endpoint https://{{ wfeservices_upstream }};
location / {
proxy_pass $upstream_endpoint;
…..
}
Variable $upstream_endpoint is dynamically refreshed and can take up to 60 seconds or happens on nginx reload action. So, having that configuration, when DNS record updated, nginx would recognise that and redirect traffic to newly deployed application.
Switching DNS records is actioned by a separate module in terraform (envswitching module). It is a very simple module, responsible only for managing DNS entries and looks like:
This module also manages live-proving URLs — URLs for the client for validation purpose. Client can validate and confirm if new features work as expected on production before switching operation is completed.
Terraform envswitching module takes decision based on “active” variable which is passed to terraform as one of the command line arguments and based on that activate deployment branch and set live proving URLs to unused branch as a preparation for next deployment. Below snippet of the shell script that pass variables to terraform and run terraform:
As I mentioned earlier, doing infrastructure in the cloud has one great benefit. It helps manage cost by using auto scaling for instances to scale based on the actual demand. If you do not need a resource, you just delete it. AWS on-demand resources allow you to stop paying for the failed or unused green environment resources and simply release those resources. Below shell script snippet is great example of it. It sets autoscaling groups on unused branch to 0 to reduce costs to minimum. Due to complexities, it was easier to do it outside of terraform, in shell script:
So, after putting everything together I had to adopt update the old pipelines (production, development, master, sandbox, performance) to reflect the new Blue/Green technique. Production pipeline has changed from:
To:
Introducing Blue/Green allows doing zero-downtime deployment for non-breaking changes. Of course, if there is requirement of changing DB schema or something critical (breaking changes), we have to do the deployment in the old fashion way — with a maintenance page.
Test and migration
Before introducing Blue/Green in production, I had to thoroughly test it in development. It was not possible to fully test production pipeline before as it consists of a slightly different set of steps and additional scripts than development pipeline. It was clear that I had to adopt production pipeline and used scripts by this pipeline to work on development account. That was the most time-consuming part. When it was done, I had the opportunity to thoroughly test the production pipeline on the development account and the migration process to Blue/Green approach. I would say that the migration part of the work was the easiest because most of the work was done by terraform.
The End
Introducing Blue/Green gave us a way to introduce many new features without downtime and in a more reliable way. It extends our Continuous-Integration approach, driving more towards ability to rapidly put software into production. As we work in the KANBAN methodology, It means we are now able to deploy most changes to production in a very short amount of time with less risk of fail. At the end I would like to thank technical architect ( Dawid S. ) and solution architect ( Kamil P. ) for supervision and paying attention to details in this venture and Daryl P. for your help in writing this post. Thanks guys!
Author information
- Linkedin profile
- Github account
- Email szewczyk.christopher[at]gmail.com