This week, I attended a couple great conferences and talked with some of the brightest minds this side of the world in DevOps at both Velocity Conference in Santa Clara followed by DevOps days.
Game Days – Coming Soon to a Webmaker Near You
A few of my friends were in those conferences giving talks, and one that stood out to me was one I lived through. Dylan Richard, our boss as Director of Engineering at Obama for America 2012, gave a talk about building resilience in applications, infrastructure, and teams. This talk is precisely a subject we’ve talked about a lot at Mozilla Foundation lately: Building Resiliency in your applications, processes, and teams.
In the last two weeks, the Mozilla Foundation Webmaker team and community have really thrown down some amazing stuff. If you haven’t seen it, you should check it out ! The webmaker team pushed over 130 times in about three weeks, between staging and production, and along the way there were predictably and inevitably bumps. A couple apps went down here or there, and imperfections plagued the initial hours of the launch.
That’s no moon
But, a great decision was made by the organization to go for a soft launch, so while there was definitely imperfection, there was a leg to stand on to set expectations and get the community to buy into a little bump in the road. By the way, thank you community!
What allowed us to push so incredibly fast, to move quickly was the acceptance of some failure along the way. Don’t get me wrong, there is a time and a place for testing. I want problems caught in staging. But, I am a realist, and I have been watching IT Operations in infrastructure and applications for fifteen years.
Failures are coming, friends. We sure will always prevent them when we can. But, at light speed, a lot can go wrong, and that is more than OK. That experience will make you strong in the force. Buckle in.
I did not berate any of the mistakes made by anyone, and others did not berate mistakes I made along the way. (Sorry for the loaded footgun, and amazing recovery of popcorn in an outage I caused Jon “Dev and Ops” Buckley [@jbuckca] )
Instead, we learned how we failed, and we learned what that looked and felt like. We have monitoring up, and we see logs. We see logs of changes, we know when we did what and we can see what it is doing, all through the app to the infrastructure.
We have color charts to show us how bad we messed anything up, really quickly.
Developers have an incredibly tight feedback loop, and some of my devs often argue, too much power. That power, however, also comes with all the links and tools to have situational awareness.
I’ve shared every process and trick with the Webmaker team, in the form of screencasts, live videoconference/screenshares of production pushes, and backout plans documented in shared etherpads. That has come in handy as multiple times, we’ve had to roll back or rapidly repatch and refix. The thing is though…we’ve KNOWN about it quickly, and we’ve responded to it quickly. We’ve tried a ton of ways of doing it along the way
While we built the new fancy bits you see at https://webmaker.org/, we also built our ability to move fast, and our resilience as a team. We learned to move fast, to break things gracefully and safely as an entire team. We gathered together in a foxhole for a sprint, did it visibly and transparently.