A nice little nugget of a problem was handed to me today: identify ways to help an operations team reduce their system maintenance / deployment window [for production system updates] that has somehow grown to require a xx-hour window, and achieve zero downtime (or as close as possible).
The environemnt is complicated in the extreme: highly regulated industry, compliance requirements, clustered servers, high availability, PCI security zones, 3rd party software/service providers, cloud service providers/integrations (SaaS and PaaS), frequent commercial software upgrades/patches, vendor constraints on database schema changes, disaster recovery dependencies, a legion of upstream and downstream data integration dependencies.
For the last year I've been carefully planting seeds of certain ideas in various conversations with key stakeholders within an organization - to begin the gradual introduction of concepts and practices such as DevOps, Continuous Deployment, and Continuous Operations. Now that a sufficient level of pain has been experienced, there is a broad consensus and acceptance that there needs to be change.
"He was not in a hurry, 'hurry' being one human concept he had failed to grok at all. He was sensitively aware of the key importance of correct timing in all acts — but with the Martian approach: correct timing was accomplished by waiting."I have some ideas, but as a good researcher, first order of business is to review current directions, trends, peer articles. This posting will be a place for me to share some of the information that may be of interest to others:
- Stranger in a Strange Land, by Robert E. Heinlein
Zero Downtime, Instant Deployment and Rollback
http://www.ebaytechblog.com/2013/11/21/zero-downtime-instant-deployment-and-rollback/
Jevgeni Kabanov (ZeroTurnaround)
Pragmatic Continuous Delivery, at W-JAX 2012
http://vimeo.com/79959315
Continuous Operations for Zero Downtime Deployments
http://www.virtualizationpractice.com/continuous-operations-for-zero-downtime-deployments-22680/
The Virtualization Practice
http://www.virtualizationpractice.com/
Deploying the Netflix API
http://techblog.netflix.com/2013/08/deploying-netflix-api.html
Cloud Architecture Tutorial
Constructing Cloud Architecture the Netflix Way
Gluecon May 23rd, 2012, by Adrian Cockroft
http://www.slideshare.net/adrianco/netflix-architecture-tutorial-at-gluecon
Cassandra in the Netflix Architecture, Denis Sheahan
CassandraEU London March 28th, 2012
http://www.slideshare.net/acunu/cassandra-eu-2012-netflixs-cassandra-architecture-and-open-source-efforts
Patterns for Continuous Delivery, Reactive, High Availability, DevOps and Cloud Native Open Source with Netflix OSS
Adrian Cockroft + Ben Christensen, YOW! Workshop Dec'2013
https://speakerdeck.com/adrianco/patterns-for-continuous-delivery-reactive-high-availability-devops-and-cloud-native-open-source-with-netflixoss
Best Practices for Zero Risk, Zero Downtime Database Maintenance
http://www.oracle.com/us/products/database/311390-133499.pdf
VMware vSphere High Availability 5.0 Deployment Best Practices
http://www.vmware.com/files/pdf/techpaper/vmw-vsphere-high-availability.pdf
Free Ebook: Continuous Delivery — What It Is and How to Get Started
http://info.puppetlabs.com/download-free-continuous-delivery-ebook.html
The Phoenix Project, A Novel About IT, DevOps & Helping Your Business Win
http://www.amazon.com/Phoenix-Project-DevOps-Helping-Business/dp/0988262592/
How Draw Something Scaled to 50 million New Users, in 50 Days, with Zero Downtime
http://www.infoq.com/presentations/games-scalability-omgpop
I Ain't Afraid of No Downtime: Scaling Continuous Deployment, by Cody Powell
http://www.codypowell.com/taods/2012/04/i-aint-afraid-of-no-downtime-scaling-continuous-deployment.html
Mandi Walls free ebook, Building a DevOps Culture [Kindle]
http://www.amazon.com/Building-DevOps-Culture-Mandi-Walls-ebook/dp/B00CBM1WFC
Daily Dose of DevOps: 27 People to Follow on Twitter
http://puppetlabs.com/blog/daily-dose-devops-27-people-follow
Selected QCON 2013 San Francisco presentations:
Adopting Continuous Delivery, Adjusting your Architecture
Rachel Laycock, ThoughtWorks
http://qconsf.com/system/files/presentation-slides/Adopting%20Continuous.pdf
Build Your Own PaaS the Netflix Way
Sudhir Tonse, Manager, Cloud Platform Infrastructure, Netflix
http://qconsf.com/system/files/presentation-slides/BuildYourOwnPaaSTheNetflixWay-QConSF.pdf
Facebook Infrastructure
Pedro Canahuati, Director, Infrastructure Operations
http://qconsf.com/system/files/presentation-slides/ScalingtheOperationsOrganizationatFacebook.pdf
Tools:
Liquidbase:
- Improved checksum performance
- CORE-1509: Significantly decreased memory usage, especially with large sql files
- CORE-1533: Performance improvements in dropAll
ZeroTurnAround's LiveRebel:
log4j2:
- "Log4j 2 can automatically reload its configuration upon modification"
- "Log4j 2 contains next-generation Asynchronous Loggers based on the LMAX Disruptor library. In multi-threaded scenarios Asynchronous Loggers have 10 times higher throughput and orders of magnitude lower latency than Log4j 1.x"
- Note the performance benchmark results recently posted on takipiblog.com
Puppet Labs:
- PuppetConf 2014, September 23-24 – San Francisco