Delivery QA

Anyone out there ever hear of delivery QA? Neither have we.

When delivery was a manual process, there was a release plan (or checklist) that was executed once in a blue moon for the release. Months were required to set up the release plan, and the release itself was a rather traumatic event for all involved. In a world like this you didn’t need QA, you needed some sort of audit mechanism (like a checklist).

In today’s world of DevOps and delivery automation,  the release plan is actually code. The problem is that, as we all know, code needs QA – but code associated release doesn’t really go through any formal QA process (i.e. test plan, test harness, formal acceptance criteria).

In most cases the application sanity tests are assumed to check the deployment as well. In some organizations they actually differentiate between types of sanity tests – the ones you call Ops when they break vs. the ones you call Dev. The problem with this approach is that as deployment architectures become more complex, and deployment code grows it becomes more bug prone, and the lack of QA will become a bigger and bigger issue.

If staging environments were completely equivalent to production, you could claim that the application testing done during staging should be sufficient. If staging successfully completes the application test plan – then it has been deployed correctly. The problem is that because of economic considerations – production deployment is an environment similar to staging, but not a clone. Whenever things are different – you can be sure code will break especially when there is no way to test.

Since there really isn’t a way to assure Delivery in production, the answer has to be Production Assurance, a way to compare and understand the differences between the staging and production to ensure correct delivery without QA.


Don’t be fooled by the Hype of Release Automation

Executives are excited by the prospect that Application Release Automation will eliminate the problems associated with application delivery to production environment.

This excitement is anticipated by Gartner’s Hype Curve. Here DevOps and Application Release Automation are located close to the peak of inflated expectations. Picture1This implies that attributed utility is significantly higher than the “Plateau of Productivity”. The belief that release automation will be the magic wand that will remove all gaps in delivery to production environments is unfounded. Two obstacles prevent this from happening:

  • Changes in the application require modifications to the automated release manifest. These will become much more frequent as a result of Agile and DevOps; and
  • Changes in the production environment similarly require modifications. This is the result of Virtualization.

Production Assurance is therefore a vital tool verifying the authenticity correctness of delivery to the production environment.

Containers – Not Your Father’s Components

The title is taken from an 80’s commercial describing the new Oldsmobile line of cars:  

Components (and micro services) are the hot new keyword in the world of agile development and DevOps. For developers they bring up connotations of reusable components – sort of like object oriented programming and design patterns. That is true – but not the real driver for containers. Containers solve delivery problems more than development problems. In fact, in the short term moving to containers can be quite uncomfortable for developers, and may require a complete re-architecting of the system.

Component enabled delivery is similar to the platform and module approach to designing products, like cars and airplanes.

MQBThis requires much more upfront design to ensure that components are architected to be independent and reusable, like seats that can be be used both in a Volkswagen and Audi. On the other hand this type of modularity, if not designed correctly, can create intra-module dependencies that increase complexity, time-to-detect and time-to-resolve. It can also wreak havoc with non-functional aspects of the system.

So before embarking on a component strategy decide if you are willing to pay the upfront costs associate with redesiging your code. The tradeoff is a better overall system that is more flexible and easier to deploy.


Here’s an IDEA – Identify, Decompose, Estimate, Act

One key element in production assurance is the timely detection of faults. Detection doesn’t mean just that there was some indication of a fault, but rather the fault was accepted as something that needs to be fixed, as opposed to just notification that something is amiss. The reason we stress this is that false positives are the bane of the industry, causing users to ignore alerts . False positives are usually caused when a detection system finds an anomaly – but one that doesn’t really matter and can be ignored.  Since the system doesn’t understand the context of its alerts and is focused on symptoms, it can’t tell the wheat from the chaff and issues too many alerts – causing “alert blindness”

Another way to think about this is that if a detection system can’t relate “detect” to “resolve”, then the value of alerts is greatly diminished. One framework for relating detect to resolve is IDEA –  Identify, Decompose, Estimate, Act:

  1. Identify – This is where systems usually excel – by using anomalies to identify problems, they in fact they way too many potential  problems.
  2. Decompose – This is breaking the problem into it components – a tree depicting  observed symptoms down to the root cause.
  3. Estimate – decide which branches of the tree should be investigated.
  4. Act – Fix the problem.

Most people are surprised that most of resources, e.g. tiger teams and war rooms,  are spent on steps 1-3, and”act” is usually the easiest step to accomplish once the preceding steps are done correctly.

The topology graph of production assurance minimizes the time needed for the first 3 steps. By indicating the exact processes involved in the problem, it reduces the time spent on IDEA to minimum and assures that the root cause is fixed, not the symptoms.


DevOps, Production Assurance and Application Risk Governance

Deployment failures happen to everyone. There was an article a couple of years back that stated that 30% of app deployments fail – and things haven’t changed much since then. Threats We can divide deployment failure into two main  buckets – technical failure and business failure. Technical failure means the deployment broke the app and users can’t use the app as planned. Business failure means that the app works as expected, but it has a negative business impact.

DevOps is above the line in the chart, focused on techniques and technologies lowering deployment risk through mechanisms like shifting left of topologies as we described in a previous post – in the chart above that is hedge. DevOps also  lowers deployment risk exposure by enabling small incremental deployments- in the chart above that is  cover.

Production assurance focuses below the line on detect and resolve- making it easy to find and fix the inevitable application problems through topology and process anomaly tracking. Production assurance provides a baseline of normal topology behavior that enables rapid detection of application failure – in the chart above that is detect.

Production assurance also provides a feedback loop that pinpoints the processes or nodes that take part in the anomaly making it easy to mitigate and fix the problem – in the chart above that is  resolve.

Both DevOps and production assurance give organizations the confidence to use modern deployment techniques like A-B deployment to test different business scenarios and ensure that features are validated from a user perspective – lowering risk of business failure.

The Next Generation of Shift Left

We read an amazing article the other day that states that “Over 30% of Official Images in Docker Hub Contain High Priority Security Vulnerabilities” which if you think about it isn’t really surprising – that fact is probably true for all software just much harder to actually quantify.

Addressing these types of issues is one part of using production assurance to minimize production risk.

The reason that containers is a good way to measure risk is that they have well defined boundaries and each container implements a specific application service, making their separation of concerns easy to understand.

The same separation of concerns is what enables us to quantify the types risks we address through production assurance.

  • The first risk for production assurance is the “container risk” for each container used in the application. This entails 2 types  – functional risk (bugs in the code) and security risk (vulnerabilities in the code).But boundaries aren’t enough you also need to understand the connections between services. Containers also provides what is needed to understand the service topology as described in the post “Service discovery with Dockers“.
  • The second measure of risk for production assurance is “deployment risk” – the risk associated with incorrect configuration or artifact deployment that generates an incorrect service topology.
  • The third measure is “COPO risk” (the risk associated with corrective and preventative operations in production). If you have true immutable infrastructure this risk is negligible but deployment risk goes up because there is no “hotfix” only redeployment.
  • The fourth risk is “cyber risk” – the risk that an external entities is exploiting your vulnerabilities in your containers and\or services.

These four measures are a framework that provides a process for enabling production assurance feedback loops from Ops to Dev (i.e. the next generation of shift left).

Production Assurance: The Environment Chasm

Like we wrote in an earlier post, DevOps tries to minimize the pain points associated with hand offs from Dev to Ops. But there is another hand-off that causes a lot of pain – the promotion from environment to environment (e.g. from staging to production).

The reason is the “difference” between environments, or in other words the need to promote across the environment chasm. Most the pain is caused by the differences in deployment topologies between the environments – so by minimizing distance we lower the cost of promotion, probability of failure. Production assurance also lowers the cost of failure but we’ll talk about that in the next post.

  • Lowering promotion cost:
    moving to a new environment requires multiple iterations until you get it right. Even if you use an automated release manager -human errors from incorrect policies and manifest still cause problems.
  • Reduced probability of failure:
    If the environments were clones – then promotion would be trivial.  Problems arise anytime there are differences. If the differences are simple – like additional nodes in a cluster, things are easy. On the other hand, if you have an application server and database server on the same node in one environment, but on different nodes in the second environment you are setting yourself up for promotion problems.

Bottom line: By shifting left topologies you can make promotion less painful and more successful.


Application Delivery, Six Sigma and Production Assurance


We like to think about application delivery as if it is a pipeline from development to production. Each step in the pipe is usually a separate environment – e.g. a different topology, different configuration and a process for promotion from environment to environment. Just as many process failures are a result of failed hand-offs; many, many delivery problems are the result of promotion process mis-configurations.

Deployment topology is the mapping of components to systems and the dependencies between them. Deployment topology and configuration are the basis for how we measure difference between environments (with topology being much more important than configuration). Difference is influenced much more by the linkages of different element types, less by repetition of equivalent elements, e.g. the relationship between application server, db server and their deployment topology is much more important than the number of nodes in a application cluster.

Applying the principles of six sigma calls for minimizing the variance between environments along the pipeline in order to minimize problems. So in a perfect world it would make sense for all environments to be identical to production – i.e. shift left. In most cases that just isn’t possible, so the trick is to make them as similar as possible – but in all cases the application topology must be identical.

Because the environments are different we can’t be certain of the integrity of the promotion – that is where assurance comes in – by providing the capability of ensuring the promotion process “did the right thing”. Assurance uses topological equivalence and configuration analysis to verify deployment topology across two consecutive environments.


Assume a Spherical Cow…

One school of thought is that DevOps exists to solve the problem of wasted developer time for creating environments. Since developers are such a scarce and expensive resource, it makes sense to subordinate operations to development.

Wrong! If you do that you recreate the problems inherent in the machine metaphor – each role with its own specific function.  An alternative is the “organic metaphor” where responsibility for the final deliverable is shared across the organization.

A great illustration of problem with the machine metaphor is in Jez Humble’s post on “Elisabeth Hendrickson Discusses Agile Testing” where she discusses working at a product company which was suffering a series of quality problems. As a result, they hired a VP of QA who set up a QA division. The net result of this, counterintuitively, was to increase the number of bugs. One of the major causes of this was that developers felt that they were no longer responsible for quality, and instead focussed on getting their features into “test” as quickly as they could. Thus they paid less attention to making sure the system was of high quality in the first place, which in turn put more stress on the testers. This created a death spiral of increasingly poor quality, which led to increasing stress on the testers, and so on.

Similarly by allowing development to ignore the needs of production environments you are setting yourself up for development creating products that aren’t deliverable.

That is why we named this post “spherical cow” – creating a solution that is beautiful in theory – but doesn’t work in practice.

OpsLess Agile

The basic tenet of Scrum is that each team have available (as part of the team) all the resources needed to get the job. The team also defines the work items, and how they will work. So far so good.

In the real world agile teams have developers, customers, product but lack a key component – delivery. It doesn’t matter why, but it ensures that what ever the team develops won’t be optimised for delivery – or even worse it won’t deliverable.

Delivery must be a part of every agile team – the problem is that there just aren’t enough delivery people to get them involved in every sprint for every team.  Of course,  you can always hire a lot more delivery folks – but that just won’t happen.

The solution is to make delivery a pig in every sprint 0 (planning sprint). They will plans and validate the tools and scripts to be built in support of team. They will also provide important input for time boxing and the order of feature development. They should also be part of every sprint review demo – just like customers. In a way delivery is a customer – they are the customer for the team’s artifacts and their job is to deliver them to the real customers.

Otherwise – the DevOps wall of confusion will ensure that agile will only cause more delivery problems, not solve them.