One key element in production assurance is the timely detection of faults. Detection doesn’t mean just that there was some indication of a fault, but rather the fault was accepted as something that needs to be fixed, as opposed to just notification that something is amiss. The reason we stress this is that false positives are the bane of the industry, causing users to ignore alerts . False positives are usually caused when a detection system finds an anomaly – but one that doesn’t really matter and can be ignored.  Since the system doesn’t understand the context of its alerts and is focused on symptoms, it can’t tell the wheat from the chaff and issues too many alerts – causing “alert blindness”

Another way to think about this is that if a detection system can’t relate “detect” to “resolve”, then the value of alerts is greatly diminished. One framework for relating detect to resolve is IDEA –  Identify, Decompose, Estimate, Act:

  1. Identify – This is where systems usually excel – by using anomalies to identify problems, they in fact they way too many potential  problems.
  2. Decompose – This is breaking the problem into it components – a tree depicting  observed symptoms down to the root cause.
  3. Estimate – decide which branches of the tree should be investigated.
  4. Act – Fix the problem.

Most people are surprised that most of resources, e.g. tiger teams and war rooms,  are spent on steps 1-3, and”act” is usually the easiest step to accomplish once the preceding steps are done correctly.

The topology graph of production assurance minimizes the time needed for the first 3 steps. By indicating the exact processes involved in the problem, it reduces the time spent on IDEA to minimum and assures that the root cause is fixed, not the symptoms.