Deployment and Rollback Strategies Cloud Applications
Release Guidelines:
- Defining mandatory rules for releasing into production by setting up roll back alarms. Rollback triggered during the deployment process if any one of the alarms got triggered.
- Rules are defined using a set of lambda functions to rollback services at each stage of the rollback process
- Adopting rules for all the pipelines over a certain period of time
- Pipelines that are not adopted the rule should be put in quarantine.
- Using break glass approach for pipelines to support critical releases
- Classifying pipelines and opt-out pipelines which are not directly impacting the customers
- Pre prod testing before canary deployment
- Promoting deployments with canary and traffic shifts to target a limited number of customers and test the impact of the release.
- Run soak tests to validate the canary deployment
- Using synthetic traffic to mimic customer experiences, validating metrics, and rollback alarm.
Classifying Pipelines:
- Customer impacting pipelines
- Non-customer impacting pipelines
Setting Rollback Approach standard metrics:
Anomaly detection with standard service metrics
- Anomaly detection using fault codes HTTP 5XX metrics
- Anomaly detection using error code HTTP 4XX metrics
- Anomaly detection using traffic latency
Anomaly detection with standard instance metrics
- Anomaly detection using CPU utilization
- Anomaly detection using Disk utilization
- Anomaly detection using Memory utilization
Anomaly detection with standard run time metrics
- Anomaly detection using Heap utilization
- Anomaly detection using Garbage collection
- Anomaly detection using Thread metrics
Training Anomaly detectors prior to deployment:
Prior to deployment check metric generated over 3 hr of time and train anomaly detector using average and standard deviation to make anomaly detector working as expected. Rollback deployment based on alarm threshold breach.
Rollback using tariff drop:
Using anomaly detection for the sustained drop after release and trigger rollback based on the impact of the event.
Rollback using combination multiple anomaly detectors:
A rollback can be triggered based on a combination of anomaly detectors instead of a single anomaly detector.
Rollback using fault spike:
the spike in HTTP 5XX and 4XX errors