A nice little nugget of a problem was handed to me today: identify ways to help an operations team reduce their system maintenance / deployment window [for production system updates] that has somehow grown to require a xx-hour window, and achieve zero downtime (or as close as possible).
The environemnt is complicated in the extreme: highly regulated industry, compliance requirements, clustered servers, high availability, PCI security zones, 3rd party software/service providers, cloud service providers/integrations (SaaS and PaaS), frequent commercial software upgrades/patches, vendor constraints on database schema changes, disaster recovery dependencies, a legion of upstream and downstream data integration dependencies.
For the last year I've been carefully planting seeds of certain ideas in various conversations with key stakeholders within an organization - to begin the gradual introduction of concepts and practices such as DevOps, Continuous Deployment, and Continuous Operations. Now that a sufficient level of pain has been experienced, there is a broad consensus and acceptance that there needs to be change.
"He was not in a hurry, 'hurry' being one human concept he had failed to grok at all. He was sensitively aware of the key importance of correct timing in all acts — but with the Martian approach: correct timing was accomplished by waiting."I have some ideas, but as a good researcher, first order of business is to review current directions, trends, peer articles. This posting will be a place for me to share some of the information that may be of interest to others:
- Stranger in a Strange Land, by Robert E. Heinlein
Zero Downtime, Instant Deployment and Rollback
Jevgeni Kabanov (ZeroTurnaround)
Pragmatic Continuous Delivery, at W-JAX 2012
Continuous Operations for Zero Downtime Deployments
The Virtualization Practice
Deploying the Netflix API
Cloud Architecture Tutorial
Constructing Cloud Architecture the Netflix Way
Gluecon May 23rd, 2012, by Adrian Cockroft
Cassandra in the Netflix Architecture, Denis Sheahan
CassandraEU London March 28th, 2012
Patterns for Continuous Delivery, Reactive, High Availability, DevOps and Cloud Native Open Source with Netflix OSS
Adrian Cockroft + Ben Christensen, YOW! Workshop Dec'2013
Best Practices for Zero Risk, Zero Downtime Database Maintenance
VMware vSphere High Availability 5.0 Deployment Best Practices
Free Ebook: Continuous Delivery — What It Is and How to Get Started
The Phoenix Project, A Novel About IT, DevOps & Helping Your Business Win
How Draw Something Scaled to 50 million New Users, in 50 Days, with Zero Downtime
I Ain't Afraid of No Downtime: Scaling Continuous Deployment, by Cody Powell
Mandi Walls free ebook, Building a DevOps Culture [Kindle]
Daily Dose of DevOps: 27 People to Follow on Twitter
Selected QCON 2013 San Francisco presentations:
Adopting Continuous Delivery, Adjusting your Architecture
Rachel Laycock, ThoughtWorks
Build Your Own PaaS the Netflix Way
Sudhir Tonse, Manager, Cloud Platform Infrastructure, Netflix
Pedro Canahuati, Director, Infrastructure Operations
- Improved checksum performance
- CORE-1509: Significantly decreased memory usage, especially with large sql files
- CORE-1533: Performance improvements in dropAll
- "Log4j 2 can automatically reload its configuration upon modification"
- "Log4j 2 contains next-generation Asynchronous Loggers based on the LMAX Disruptor library. In multi-threaded scenarios Asynchronous Loggers have 10 times higher throughput and orders of magnitude lower latency than Log4j 1.x"
- Note the performance benchmark results recently posted on takipiblog.com