How drift causes database deployments to go off-track

Daniel Nolan included in Database delivery

2021-03-02 1152 words 6 minutes

Contents

TL;DR

Drift occurs when the source control baseline diverges from the state of production, e.g. when someone makes a change directly to the database instead of using the delivery pipeline. If left unchecked, this can ultimately lead to release failure.

Drift in database delivery blog series

PART 1: How drift causes database deployments to go off-track (this post)
PART 2: Why drift can be an unavoidable hazard in Database DevOps
PART 3: Defending your database delivery pipeline against drift
SUMMARY

One of the most important aspects of adopting automated database delivery is keeping test environments (and by extension, source control) in-synch with the Production environment.

Ensuring that our dev and pipeline assets remain a good representation of Production is critical to ensuring that, if a deployment worked during development, we can be confident that it will work when released to Production.

Enterprise factors

But, over the years, I've noticed a number aspects to the way that enterprise IT environments work that can cause database deployments to go off track:

Last minute changes: it's not uncommon for features to either be entirely abandoned or at least modified to some degree at the last minute prior to a Production release. Without proper cleanup, this can lead to a significant amount of schema and data drift in lifecycle environments (e.g. column additions that are never subsequently cleaned up), particularly in Staging environments that are not often refreshed
Unplanned maintenance: Unforeseen circumstances can lead to the occasional need for emergency fixes to be applied to the live database (e.g. a performance index added by the DBA or a file group/partition configuration change made to the Production database only). Sometimes database changes caused by maintenance activities can even be by design.
Legacy database factors: Database schemas tend to be long lived assets, so battles with unresolved tech debt can cause an endless supply of headaches. A common trap that developers fall into is discovering upon Production release that an identity column doesn't match that of the development environment. Or, when altering a column, finding out that a system-named constraint has a different (automatically assigned) name in every environment.

Over time, slowly but surely, each of these everyday events within a system's lifecycle can cause our "source of truth" to drift from reality: as our source control baseline starts to differ to the actual state of Production, the danger of release failure grows with each passing day.

What can go wrong will go wrong

Here's a few archetypal examples of what can happen when the target database and source control baseline drift apart (and yes, they've all happened to me):

The ALTER COLUMN failure: A developer writes a migration to change the precision of an existing column. Since the source control baseline was created, said column has had a performance index added to it, outside of the pipeline. While the migration succeeds during dev testing, the script fails when deployed to Production because the SQL engine does not allow a column to be changed while it has other objects bound to it.
The inadvertent permission drop: A developer writes a change to a natively-compiled stored procedure, which requires that the procedure be re-created. Little do they know, a permission was recently granted to a service managed by another team. As the permission is not in source control, the deployment causes the service to immediately lose its access to the procedure.
The quiet overwrite: A developer writes an UPDATE statement to change the name of a product record as part of a rebranding release. However, what they don't know is that, when the data was originally inserted, a different ID was assigned to the record in Production. Hence, their update ends up overwriting the name of an unrelated product record. The worst part is, since the update happens silently, someone only notices the problem the day after the change occurs.

Prevention vs Cure

It is easy to point out that each the above cases of schema and data drift could have been prevented from happening in the first place:

To avoid causing drift in Staging/Test environments, developers should always "fix forward" by adding a new migration, rather than deleting/modifying the original one. This ensures that abandoned features are always cleaned up
To avoid causing drift in Production environments, DBAs should commit changes to source control and follow the same process for propagating hotfixes to Production as developers
To avoid causing drift in source control itself, developers should never rely on non-deterministic values within their migrations, and instead use explicit IDs and name all database objects

While conscientious teams will always strive to "do the right thing" and treat source control as the single source of truth, there will inevitably be those pesky edge cases that will thwart even the most determined efforts to keep the deployment pipeline ticking along:

A datafix change may only be discovered to cause destructive changes at the acceptance test stage. In a majority of such cases, following a fix-forward strategy just isn't going to be worth the potential impact to customer data
The performance index may be needed to put out a Production fire, and for various reasons it may not be practical to utilise the pipeline when the DBA gets a 3AM alert

It may even be possible that your team always follows the process, but that you've simply inherited the (rotten) fruits of previous transgressions. Or, perhaps your team suffers from drift for completely different reasons, such as the unenviable position of having another team (e.g. Business support datafixes) or a COTS vendor making changes directly to your database.

So I have drift. Now what?

Regardless of how drift happens, you're going to want to get your pipeline back-on-track. But before you can do that, you first need to establish visibility of drift occurring within your environments. Getting a timely heads-up on drift, whether that be prior to deployment or as it happens, puts you in a good position to be able to make decisions about its potential impact, and make or change release plans accordingly.

Without that heads-up, the picture of the schema and data in source control will progressively degrade. Ignoring the slow but tireless march of entropy will inevitably lead to an erosion of confidence in the database pipeline, one release failure at a time. Keeping an eye out for drift while delivering value at speed is the best way to keep entropy at bay.

Tool support

In terms of support for drift detection, there's plenty of off-the-shelf tooling available both free and commercial (Disclosure: I previously created a tool with built-in drift detection called ReadyRoll, which is now part of a commercial solution offered by Redgate). Although, somewhat unfortunately, none of the purpose-built tools are able to detect drift in static or reference data, so rolling your own solution seems to be the only option for that part of the database right now.

Image credit to Shy Sol