At Sensu Summit 2018 I gave a talk on the subject which was well received. I was asked to write a blog post to expand on the subject as you can only get into so much detail in 30 minutes. I started sitting down to flesh out the content and when I was done with my initial draft I knew I could not do it in a single post as it was 17 pages. I am working on breaking these out into a series and will update here as they are posted on the sensu blog. I believe the cadence will be to release a post every week or two until we have reached the end of the scoped content. While much of the content is specific to sensu even if you are not using or looking at sensu there is a lot of valuable concepts in designing a monitoring and alerting system. For a primer watching the talk and the slides is a great place to start. I would then start with part 1 as it helps sets the stage beyond that you can skip around although some of the specifics may be confusing as they assume prior reading to an extent.
Here is the link to the original talk and slides:
- Alert fatigue, part 1: avoidance and course correction
- Alert fatigue, part 2: alert reduction with Sensu filters and token substitution
- Alert Fatigue, part 3: automating triage & remediation with check hooks & handlers
- Alert Fatigue, part 4: alert consolidation
- Alert Fatigue, part 5: fine-tuning & silencing