Rollbacks with Automated Database Deployments

Rolling back a database deployment is one the most complicated topics with Automated Database Deployments. The code is rather easy to rollback, just deploy the previous version of the code package and call it good. It is just as easy to rollback database schema changes using Redgate's Database Lifecycle Management (DLM) tooling. Unlike code, the database stores data (hence the name!), and for some reason users don't like when their data gets messed up. That is why database rollbacks are so complex.

The most important factor to remember is how will the rollback affect the users of your application? My team is responsible for one of the loan origination systems at Farm Credit Services of America. It can take a user anywhere from 5 minutes to a half hour to enter a loan into the system (it depends on the type of loan, which influences the amount of data the user is required to input). The last thing we want to do is tell a user, "remember that loan your spent 30 minutes of your life entering in? Did you like entering in all that data? I hope so because you get to do it again!"

Why is rolling back so difficult? It should be a matter of redeploying the previous version of the database. At Farm Credit Services of America, we use Octopus Deploy to invoke Redgate's DLM tools to handle all our deployments. We have set it up so we have the current version of the database and previous version of the database in production. It should make that a matter of pushing a button.

But it is not that easy. There are multiple factors that come into play with making the decision to rollback.

Time Since Deployment

The amount of time between the deployment and the decision to rollback will greatly influence when if changes can be rolled back. Also to consider are the number of users who have used the application since the deployment. Is the decision to rollback being made shortly after an off hours deployment where the only users have been testers? Or, has it been several hours and many users have been in the system?

The rough timeframes to take into account:

  1. Verification - can be anywhere from one minute to one hour after the deployment, potential to upset the users with a rollback is very low during this time. Especially if done off hours.
  2. Zero Day - a critical bug is discovered within an hour or two of users getting into the application, potential to upset the users with a rollback is low to medium depending on how many hours have passed and how much data is entered.
  3. After Day One - A critical bug is not discovered until one or more days have passed since deployment, potential to upset users with a rollback is high to very high to "I'm going to meet you in the parking lot after work."

Complexity of deployment

In a perfect world, all of our applications would be using continuous deployments and push to production 20 times a day. That is the world I want to get to because it makes this discussion rather moot. For a large majority of developers, that is not the case. It might be several months before a release is pushed to production. That release could be lucky and only have minor nondestructive changes to the database. Or, more likely, it could be rather complex with a combination of destructive and nondestructive changes. The more complex the database changes the less likely there will be a chance to rollback.

Schema Changes

The type of change being made to the database will greatly influence if the database should be rolled back. If all that was added was a new column, then it might make sense to leave the column. Unless the column is the problem, then you might want to roll it back.

Most database changes can fall into two buckets, nondestructive and destructive.

Nondestructive Schema Changes

Here is a brief list of what can be considered as nondestructive database changes.

  1. New Table or New Columns
  2. Existing Stored Procedure/Function/View where the changes were new column or table added to procedure
  3. New Stored Procedure/Function/View

The biggest question to ask when considering a rollback with nondestructive changes, when will the code be rolled forward to fix the bug and what kind of problem is it? Would it hurt to keep these changes in place? Will the code be rolled back, the critical bug is fixed and then rolled forward within a few hours? Or, is the change to the database the cause of these problems?

Destructive Schema Changes

Here is a brief list of what can be considered destructive database changes

  1. Dropping Table or Columns
  2. Removing Stored Procedures/Functions/Views
  3. Renaming Tables/Columns/Stored Procedures/Functions/Views
  4. Migrating columns from one table to another

Rolling back destructive schema changes is very complex. You will need a backup of the database taken prior to the deployment. If the critical bug is discovered within a few minutes of the deployment then the backup can be restored. However, if the critical bug is discovered days after the deployment then that backup is pretty much useless.

Data Migration

There are some deployments where it takes 30 minutes or longer to deploy the database changes. Often this happens because a migration script needs to run where a large amount of data is moved around. Rolling back that migration will most likely cause even more problems. Chances are something will be missed or incorrectly rolled back.

The Rollback Decision

As much as we like to talk about rolling back database changes, the reality is, very few times does a rollback actually occur. The Rubicon is crossed rather quickly. Much quicker than most people expect. The second the code and database are deployed the clock starts ticking down.

For the loan origination system, my team is responsible the time before Rubicon is during verification of an off-hours deployment. Our DBAs have setup a point in time backup which automatically runs every 15 minutes. With Octopus Deploy handling our deployments we can simply redeploy the previous version of the database with nondestructive changes. For destructive changes, we would rollback to the point in time prior to the deployment. After the verification process is complete we would only rollback if the bug was so critical is stopped our users from doing their job. It has happened before. Heck, we brought down an entire SQL Server with one line of code. The majority of the time we fix the bug and roll forward.

Yes, Redgate's DLM tooling makes it possible to rollback. But, with all of those factors discussed above, chances are rolling back is a lot harder and more costly that fixing the bug. So we haven't even tried to automate it. How could you? It is easier and faster to fix the bug. And that is the benefit to automating database deployments. It makes it easier to fix bugs and roll forward quickly. If necessary, we can get a bug fix to production in 30 minutes and have the bug fix flow through all four of our environments with our automated tests helping to ensure the fix doesn't breaking anything else.