Building a Failure Detection System in Splunk

In this blog series, we will be diving into how finding blind spots in your data and security can help you with your: time, coverage, scheduling and failure detection. We previously looked at how we can detect anomalies in time, coverage and scheduling that can impact our ability to deliver consistent monitoring results. Now let’s wrap this series up by discussing detecting failures.

Some would argue that it is best to utilize an external system to detect failures in a monitoring system. That is probably the best approach but not always feasible. In the end, it is up to the owner, what works best for them. However, the bottom line is that we need a way to make sure all of our processes are working and that one can be alerted when they are not.

When should you build a system for failure detection?

Normally when you finish a project with identified outcomes
In the case of monitoring blind spots, once those monitors are in place, you build detection to find out when those monitoring things aren’t working

How often do we check for failure detection?

It depends on the severity of the detection or the schedule of the underlying detection
Contingent on the bottom line - i.e. how it will impact the business and operations

Where do you check for failure detection?

You do it within Splunk as that is preferred
The best way to do it is to create a dashboard and/or alerts
- For example, include all of the searches used to monitor the failures or have dashboard searches saved as reports (so you can easily grab their status)

Read more below as I showcase how you can create your own "failure detection" dashboard.

how to create your failure detection dashboards

I borrowed a search from the Monitoring Console: Runtime Statistics panel to create my dashboard.
The primary difference is that I created a lookup (saved_searches.csv) with all my search names related to my detections. This way, I have one dashboard to look at only my items. I could also use the same search to create an alert on any field of concern.
NOTE: The CSV contains one column named “savedsearch_name” that ends up adding “savedsearch_name = <my saved search name>” for every saved search. This filters out all the other searches I am not interested in.

reference #1

reference #2

reference #3

Don’t forget to include the new saved search in your lookup.
Modify the search to create an alert:

reference #4

reference #5

You may want to add a trigger action, such as an email to notify someone to investigate.

YOUR RETURN ON INVESTMENT

To recap, failure detection is important from security, operations, management, and compliance perspectives because it offers the following:

Piece of mind knowing expected outcomes can be delivered
Confidence in systems
Improved adoption of products and services through delivering consistent results

We will be providing more tips and expertise in future blogs!

Finding Blind Spots in your Data & Security Visibility | Blog Series - Part 4 - "Failure Detection"