Sue Lueder, Google
Outages and incidents happen. Sh*t breaks, fibers get cut, bugs get pushed to production, teams fail to communicate, and all hell breaks loose. But those who don't learn from mistakes are doomed to repeat them...over and over and over again, with increasing frustration for those on the frontlines fixing the problems and from the users who suffer the impacts.
In an effort to better learn from what happened across all products and services, Google launched an initiative in 2014 to gather data from all outages and incidents that occurred on production systems for trend analysis into system and user impacts, incident timelines, and root causes. The data is then used to drive improvements across systems, processes, and tools to improve the balance between system stability and development velocity. This talk aims to share Google's approach to setting up and running such an analysis program, some preliminary results, and lessons learned.
View the full SREcon15 Program at https://www.usenix.org/conference/srecon15/program