A nice little nugget of a problem was handed to me today: identify ways to help an operations team reduce their system maintenance / deployment window [for production system updates] that has somehow grown to require a xx-hour window, and achieve zero downtime (or as close as possible).
The environemnt is complicated in the extreme: highly regulated industry, compliance requirements, clustered servers, high availability, PCI security zones, 3rd party software/service providers, cloud service providers/integrations (SaaS and PaaS), frequent commercial software upgrades/patches, vendor constraints on database schema changes, disaster recovery dependencies, a legion of upstream and downstream data integration dependencies.
For the last year I've been carefully planting seeds of certain ideas in various conversations with key stakeholders within an organization - to begin the gradual introduction of concepts and practices such as DevOps, Continuous Deployment, and Continuous Operations. Now that a sufficient level of pain has been experienced, there is a broad consensus and acceptance that there needs to be change.
"He was not in a hurry, 'hurry' being one human concept he had failed to grok at all. He was sensitively aware of the key importance of correct timing in all acts — but with the Martian approach: correct timing was accomplished by waiting."
- Stranger in a Strange Land, by Robert E. Heinlein
I have some ideas, but as a good researcher, first order of business is to review current directions, trends, peer articles. This posting will be a place for me to share some of the information that may be of interest to others:
Zero Downtime, Instant Deployment and Rollback
http://www.ebaytechblog.com/2013/11/21/zero-downtime-instant-deployment-and-rollback/
Jevgeni Kabanov (ZeroTurnaround)
Pragmatic Continuous Delivery, at W-JAX 2012
http://vimeo.com/79959315
Continuous Operations for Zero Downtime Deployments
http://www.virtualizationpractice.com/continuous-operations-for-zero-downtime-deployments-22680/
The Virtualization Practice
http://www.virtualizationpractice.com/
Deploying the Netflix API
http://techblog.netflix.com/2013/08/deploying-netflix-api.html
Cloud Architecture Tutorial
Constructing Cloud Architecture the Netflix Way
Gluecon May 23rd, 2012, by Adrian Cockroft
http://www.slideshare.net/adrianco/netflix-architecture-tutorial-at-gluecon
Cassandra in the Netflix Architecture, Denis Sheahan
CassandraEU London March 28th, 2012
http://www.slideshare.net/acunu/cassandra-eu-2012-netflixs-cassandra-architecture-and-open-source-efforts
Patterns for Continuous Delivery, Reactive, High Availability, DevOps and Cloud Native Open Source with Netflix OSS
Adrian Cockroft + Ben Christensen, YOW! Workshop Dec'2013
https://speakerdeck.com/adrianco/patterns-for-continuous-delivery-reactive-high-availability-devops-and-cloud-native-open-source-with-netflixoss
Best Practices for Zero Risk, Zero Downtime Database Maintenance
http://www.oracle.com/us/products/database/311390-133499.pdf
VMware vSphere High Availability 5.0 Deployment Best Practices
http://www.vmware.com/files/pdf/techpaper/vmw-vsphere-high-availability.pdf