Automated Deployments is Mission Critical
In my late twenties, I was hired to work on a project for a new client. The project was to add a lot of new features to a SaaS application. The client was the largest player in its particular industry and wanted a lot of customizations. After several months of development work, it was time for the QA process to start. The company, at this particular point, followed the waterfall software development methodology, almost to the point of being extreme. The first step in the process was getting the code installed on the QA servers. I asked a couple of the long-timers to help me out. The first thing they did was have me open a Microsoft Word Template. Each deployment, be it to QA, pre-production, or production, had to be done manually by a system administrator and DBAs. The system administrator and DBAs had no idea how the application worked, they were only involved in the installation, so they had to be given a script. That script had to follow a template.
The initial template was close to 10 pages long.
Ten pages with no actual content.
Each install required a separate script because the server names changed. The idea was the developer would spend time creating a release script template from the initial template. From that release, script template would be reused for each subsequent build to each environment until the release was finally deployed to production. That first version release script template took anywhere from two hours to two days to write. This being my first time, it took almost two days to write. I got better as time went on and was soon approaching the two-hour mark.
Installing the code on the QA servers followed the same process as installing on pre-production and production. The core idea being that all the kinks would be worked out by the time the install to production occurred.
Side Note: I 100% agree with that idea. Except it should across all environments, including development. And it should be automated. The idea was sound; the execution was flawed.
As stated before, all the installations were manually done by the system administrators for the code and DBAs for the databases. QA would verify each release. The developers would pre-verify changes, such as database changes, to make sure the release was going smoothly. A phone conference was created for each install for all the people involved with the release. Below was a typical transcript.
- System Admin: Taking Server A out of the load balancer.
- QA: Verified Server A is out of the load balancer.
- System Admin: Shutting down the website on Server A.
- QA: Verified the website is shut down on Server A.
- DBA: Am I good to run the scripts?
- Developer: Yes, go ahead
- DBA: Finished running script #1
- Developer: Wait one, verifying the database changes. Okay, the changes look good
- DBA: There was an error running script #2
- Developer: Please send the message to me, I will take a look
- DBA: On its way
- Developer: Ah I see the problem, I made a small change to the script, please re-run
- DBA: Script ran without error
- Developer: Verified. I think we should be good to proceed the code install steps.
- System Admin: Finished installing on Server A
- QA: Starting verification on Server A
- 10-45 minutes later QA: Server A is verified, let's repeat for server B
The first time I got on one of those phone calls I thought we were launching rockets from Cape Canaveral.
Here is the kicker. The SLA stated production installs could only occur between 2 AM - 7 AM on Saturday morning. Doing that process for QA and Pre-Production was no picnic during working hours. The day before the production deployment the same debate would rage, is it better to go to sleep at 8:30 PM and wake up at 1:45 AM or just stay up? After watching the sunrise on a couple of installs I knew the right answer, go to sleep and wake up.
Unless there was a total system meltdown, the next deployment could not occur until the next time window.
The hidden cost
It took a village to deploy code to production. And for all intents and purposes, we only had one shot to get it right. If something was discovered the Monday after the deployment, we would have to wait at least five days to release a fix to production.
The QA team was under a lot of pressure because of that. Their response was natural; they want to do a full regression test before approving a release to production. But they also had to make sure deadlines were met. The system was over ten years old at that point. There were some automated tests, but 95% of all the regression testing was done manually. Because of their time crunch, they didn't have time to write more automated tests. On average, a QA cycle would last anywhere from four to twelve weeks, depending on functionality changed.
The development team despised creating those deployment scripts. It was a major time suck and prevented the developer from doing the thing they love to do the most, coding. At the start of the QA cycle, all the developers who worked on the project would be assigned to help QA because that was when most of the bugs were found. As time passed the number of bugs diminished. Not as many developers were needed, and several of them would be assigned to other projects. But there were always one or two still stuck on the project.
The product managers would see the large testing effort and do their best to cram in as many changes as possible if a developer was changing a page. For example, a major client wants a new field added to a web page. On the backlog, there are several other feature requests for that web page. What turned into a simple field addition becomes a mini-project of its own.
In addition to that, no one is a fan of staying up all Friday night for work. Especially system administrators and DBAs who have to support dozens of applications. Each deployment took at least an hour, if not more. They wanted to keep their sanity and only allowed one or two releases a night.
In other words:
The harder you make your deployments, the longer the software development lifecycle will be.
The longer the software development lifecycle is the more waste occurs and the more money a software change will cost.
When I starting out cycling I told people there was no way I was going to wear those goofy shorts, shoes, and put on a garish biking jersey. Within two years I was doing all of that. The shorts were the first thing I broke on. Although they look goofy, they have a nice amount of padding and no seams. Made the bike ride much more comfortable. Then came the shoes, I was tired of my foot falling off the pedal. Adding straps made it worse, anytime my foot moved my shoe would slip into an awkward angle. Having biking shoes and pedals eliminated that problem. The jersey came next. I was used to wearing a backpack to carry my ride snacks. Riding a road bike with a backpack is very uncomfortable. The straps dig in weird spots. The bag itself rides up the back and catches the wind. All biking jerseys have pockets in the back for snacks and other items. Another problem solved.
The point is, there was an evolution to me arriving at wearing the full bike kit. Just like there was an evolution to the deployment process I described at the start of this article. No group sits in a room and comes up with something like that in the first iteration.
Deploying between 2 AM and 7 AM Saturday morning most likely came about because changes were being made in production in the middle of the day compromising the stability of the system. Logs were examined and it was determined that was the time when the traffic was minimal.
Having a lengthy script happened because there was no consistency between development teams. The script template covered virtually every possible scenario one could think of for the company. It forced consistency as well as an audit trail.
Having system administrators and DBAs do the deployments is obvious. Developers are neither. One too many misconfigurations and deployments brought the system down one too many times. People were brought in to bring order to the chaos.
QA doing the verification is another obvious one. They are the ones ensuring the health of the system. They don't let things slide like developers would. That is why they are good in that role. And doing the same deployment over and over helped QA (and the rest of the organization) feel confident on the night of the release.
All of those things are good. I was told since that process was put into place the number of emergency deployments dropped to once a year versus once a month. I can't argue with that result. The problem was the lack of automation and trust. A website deployment should be measured in minutes not hours.
Breaking the cycle
That success rate had a negative side effect. No one had a desire to improve it. Any type of change could increase the risk. Why change something when it is working fine? It is only working fine if you are looking at it from a singular perspective, which is measuring how often do things break when a release is deployed to production.
However, the failure rate is never the only metric being looked look at to determine if the software development lifecycle is successful. Responding to user feedback, making a better user experience, innovation, time to market, throughput...the list goes on.
Which brings me to my second major point.
The longer the software development lifecycle, the slower you are able to adopt and embrace change.
The deployment process is often an afterthought. As explained earlier, a slow deployment process hinders the ability to deliver value to the user, and in turn, help the company turn a profit. It is here I present my final argument.
Making the deployment process as streamlined and as easy as possible should be a critical priority. Resources should be allocated to solving this problem.
And by resource allocation, I am referring to money and people devoting the necessary time to solving this problem. It won't happen overnight and it will take several iterations until it is right.
Points and counterpoints
This is not something that can be solved easily. Greenfield development is easy(ish), the system can be architected from the ground up to handle multiple deployments a day. Existing systems are a little more difficult. There are a lot of conversations which need to occur.
In my past experiences, I've been party to many of these types of conversations. Mentally prepare yourself for these points raised.
Our SLA demands 99.9% uptime, maintenance is permitted during only scheduled windows
A deployment doesn't mean there will be downtime for an application. This has already been solved for. The tools and techniques have been around for many, many years. A new technique gaining traction is the concept of Blue/Green deployment.
Back in 2009, Flickr famously delivered a presentation indicating they deployed to production 10 times a day. Flickr is a global website. I'm more than positive if we looked at their logs there would be very small lulls in traffic. It is running 24 hours a day, 7 days a week, 52 weeks a year.
Facebook has over a billion users. They make a lot of their money with ads on their mobile applications. They release a new version of the mobile application once a week.
Microsoft's Azure has a stated uptime of 99.9% in their contracts. That is a 24/7/365 service, which for all intents and purposes, powers a large part of the internet. They still release during the day.
When starting out with this change it is important to categorize the code changes into two main buckets. In bucket one, you have a release which can be deployed during the day. The second bucket has a very complex release which must be deployed during off hours. Some of the key indicators of what can go into bucket one are:
- Are there any destructive database schema changes requiring a complex rollback script, if there are then it should be deployed off hours.
- How many changes are being deployed, hundreds or a dozen? The more changes being deployed the higher the risk. Pick some random number and say after so many stories then it moves to an off-hours release.
- Are there any new features being deployed? Are they feature flagged so the code can be deployed but only a select number of users can see it? Is this something which should be shown in the middle of the day? If not then it should be moved to off hours.
Almost every developer and operations person prefers to deploy during business hours. Everyone is there to fix any problems. When a preferable option is presented people will do what it takes to make it happen.
Deploying code is easy, we can't deploy a database change in the middle of the day
You can deploy database changes in the middle of the day. Doing so requires code discipline and a fundamental understanding of how database changes are deployed. For example, adding a nullable column to a table only changes the table's metadata and takes on a few milliseconds to deploy. Adding a non-nullable column with a default value to a table can take a few seconds to several minutes depending on the size of the table.
The fundamental understanding of how database changes are deployed help, but you also need to always ask, can multiple versions of the code handle this database change? Adding a nullable column is backward compatible, the old code can insert records without the new data and the new code will take advantage of the new column. But the new code should also be written in such a way that it can handle a null value from that new column. Another option is to make use of feature flags. Do not allow the new code to write to the new column until the code is deployed to all the servers and the feature flag is turned on.
Moving a column from one table to another is a little trickier but not impossible. I found the best way to handle this is to add the column to the new table and keep the column on the old table for the first release. Then in a subsequent release remove the column from the old table. Doing this makes rolling back easier as well. The important thing to remember is to have a cleanup script run after the deployment is complete to finish moving data from one table to another. Discipline is needed in this case because it is very easy to forget to go back through and clean up the old column.
If all else fails and none of the above techniques work then that deployment would fall into the off-hours release bucket. But that should be a last resort, not the go-to option.
We only know the code works in production after we deploy it and verify it
Ahh, the old chicken and egg conundrum.
To begin with, a lot of this risk can be mitigated by making use of tools such as Octopus Deploy to deploy the code, and Redgate's DLM Automation suite with Octopus Deploy to deploy database changes. Octopus deploy forces you to follow the "build once, deploy everywhere" methodology continuous integration, continuous delivery and DevOps are built on. By the time the code and database changes are deployed to production there should be no doubt as to what will happen because the deployment will have been tested many times in the lower environments.
With that being said, I would challenge what exactly is being verified in production? The changes that were just verified in the lower environments? If so, this is where I would argue automated tests should come into play. I am willing to bet the majority of the applications we write have at least one user, if not many users and/or many client tenants. In the case of a multi-tenant system, a test tenant should be set up in all environments to run tests against. The code can first be deployed to a "staging" area in production (take a server out of the load balancer and deploy to it), and the database change should be deployed to that test tenant database. Then automated tests should be run against that. This only works in the case each tenant has their own database. For systems where there is only one database then, in that case, a backup and restore for testing will need to occur and that "staged" code will need to be altered to point to that backup of the database. More work, but not impossible to solve for.
Operations is afraid to let us deploy in the middle of the day
Look, I get it, operations' job is to maintain the health of the system. Keep in mind, most operations people have zero ideas how applications work. In the past developers have had the mindset to just throw the code over the wall and hope it works in production.
This is where you need DevOps. It is Developers and Operations folks working together. Developers need to teach operations how their apps function and operations need to teach developers what is required to deploy changes in the middle of the day. This collaboration must happen. It is the only way to build trust.
It is either that, or developers take over operations. When that happens I can guarantee the end result. Without an experienced operations staff monitoring and maintaining the servers something bad will happen. It could be the server gets so far behind on patches it eventually gets hacked or more and more resource demands will be placed on it by the code it won't be able to keep up and start randomly crashing.
The process was put into place because there needs to be an audit trail
Any process should have an audit trail. You always need to know when code was deployed, who deployed the code, what code was deployed and so on. In my previous experiences with an audit trail, it all came back to a person or persons manually entering into some sort of form when an action was performed, be it in a Microsoft Word document or an online form. They were asked to do this task after the fact. After a multi-hour deployment a person is tired, they just want to relax, the last thing they want to do is fill out some sort of form. After the multi-hour deployment they fill out the form as best they can, but unless they kept detailed notes, the time frames are guesses at best. And the majority of the people don't keep tabs on when code is finished deploying.
This is where tooling is so important. One of my absolute favorite tools is Octopus Deploy. Out of the box, it includes auditing. Every action is stored in the database. When it started, who initiated it, when it was completed, what version was deployed, who approved the deployment and so on. It does what computers do best, automate the capturing of minute details.
Even better is when Octopus Deploy is paired with Redgate's DLM Automation suite for database deployments. The source control component keeps track of who made a database schema change as well as when the change was made. In addition to that, the deployment tool provides a report detailing out exactly what changed during a deployment.
The point is, the majority of the tooling available to help with deployments includes auditing. The companies making those tools know an audit trail is important. Every one of their clients wants one.
The tooling is too expensive
Both Octopus Deploy and Redgate's DLM Automation Suite cost money. A team of ten developers can expect to spend around $24,000 to get both those tools. That is a lot of money. That is the visible cost. Without the tooling developers and/or system administrators are responsible for manually deploying the code and database changes.
Assume for a minute the capital cost for a developer is $50/hour/developer. $24,000 / 50 = 480 hours, or roughly 60 days of a single developer working 8 hours a day. But the tooling is designed to work in all environments, not just production. How often are developers, QA, DBAs and System Administrators deploying changes to the various environments throughout the day? How about the time spent trying to track down an issue because it worked in development and testing but not in production because a script was missed?
Time to do some different math. $24,000 / 50 = 480 hours. The license cost was for 10 developers. Time to divide that 480 hours by 10. Now it is down to 48 hours. Or a little over a single week worth of work.
What if you estimated the cost of a developer is $100/hour/developer. That works out to 24 hours per person. Barely over a half of week of work.
Before I started using Octopus Deploy and Redgate it took an average of 30 minutes to deploy code and the database to production. After a year of using the tool that time was cut down to less than 5 minutes. And that is just for a single environment.
The elephant in the room I have yet to address directly is code quality. Being able to deploy easier will lead to higher code quality. A bug is found in production and as soon as the fix is verified in the testing environments it can be moved to production. It exposes everyone in the process to automation and once sees the benefit of automation they will start looking around at what else can be automated. I've seen many times where the next item in the line is automating the verification process that occurs in production. Once that is solved for in production there is no reason why it can't be brought down to the testing environments (and it should probably be proved out there), which means less of a chance of lower quality code making it to production.
I believe most people are rational. When presented with a solid case backed with evidence their opinions can change. What I am trying to say is if you want to start making these changes tomorrow don't go into the office and liberally quote from this article. To some people, you might sound like a crazy person. Put together a proof of concept and solve some of the major issues plaguing your team. Get all the decision makers in a room and demonstrate the proof of concept and ask for feedback. Collaborate with operations to address their concerns. Talk with QA and see what tests can be automated. Make everyone feel like they are part of the process. List out all the goals and problems you are trying to solve. Come up with a realistic plan to achieve those goals and tell as well as show those plans to anyone who will listen. Most deployment processes are rather entrenched, it will take a lot of persistence to enact change. It is not something which can be changed overnight.
The payoff is worth it in the end. Anything that prevents 2 AM deployments is worth the effort.