How to bring down a Production SQL Server with one line of code

Small releases shouldn't cause major outages. It certainly shouldn't take down a production database server. In the end it was one line of code. That exposed several flaws and brought down the database server.

Wednesday night my team and I deployed a release to production. The release consisted of both the 1.0 and 2.0 versions of the application my team supports. We are in the process of rewriting the application, some functionality has been ported over to 2.0. 1.0 still contains a large portion of the functionality, so we have to support both. Both versions of the application use a web based interface. The changes deployed and available to the users for each application was minor. A few bug fixes, a style change, a process improvement. Nothing that would stand out as something that would cause major problems.

After each release the paranoid side of me takes over and I tend to watch performance monitoring tools we use such as New Relic. Everything looked pretty good when I got into the office around 7:30 Thursday morning. New Relic showed the usual uptick in traffic as people arrived in the morning and opened up the 2.0 version of the application.

Those of you not familiar with New Relic (first shame on you), it uses an aggregate score called ApDex to give you the general health of an application. Multiple factors tie into the ApDex score, response time, throughput, error rate are some of them. An ApDex score of 1 is the best. An ApDex score of .5 is okay, but indicates problems. Out of the box, warnings are setup when the ApDex score drops below .7.

The Day Of

The day before the release the ApDex score for the 2.0 version of the application was around .9. Pretty good, but with plenty of room for improvement. When I got into the office everything looked good. At 8:00 AM I watched the ApDex score start to plummet where it eventually bottomed out at .25. The throughput being reported on the REST service kept increasing, up to the point where it reached 1600% for the same time the previous day and previous week. At 8:22 AM multiple applications started reporting they couldn't connect to the database server. The entire support staff stopped on a dime and started researching this problem. By 8:52 AM they had found something. One of the production database architects asked me if we could turn off our application in production. I didn't quite grasp want he meant by off. Did he want a job stopped? No, they wanted the entire application shut down. Both Version 1.0 and 2.0.

After getting the necessary approvals and making the necessary notifications the application was shut down by the web admins. The database server started responding again. After a short while of monitoring the server I was given the go ahead to turn back on the application. It didn't take five minutes before they noticed problems and asked the web admins to turn it back off. The good news was we knew what application was causing all the problems. The bad news was we didn't know which version was causing the problem because we brought both of them online at the same time. A company wide email was sent out asking people not to use the application.

The highest priority was getting 1.0 back online. 2.0 is a much nicer interface to 1.0, but the vast majority of the work is still done in 1.0. We asked the web admins just to turn back on 1.0 and waited. The database server kept humming along. Now we knew which app specifically is causing the problem. Thankfully the deployment didn't change too much of the schema. It added three columns to a table, none of which is used by the UI. The previous version was deployed to the pre-prod environment, verified and then pushed up to production. 2.0 was back online.

A company wide email was sent out saying the issues had been resolved. I stuck around the support area to help monitor the application. After the rollback the REST service throughput being reported by New Relic was still 400% higher than the previous day. How was this possible? The application was rolled back. It took a few minutes but I eventually realized the problem. Version 2.0 of the application is a Single Page Application. If users never refreshed the page after the rollback, that would mean they had still the bad version of the JavaScript files. Some users had shut down their browsers and started them back up when they got the email, but not all of them. New Relic reported 99% of the traffic was hitting a single endpoint on the REST service. Only one part of the application hit that end point. I looked through the code and saw the culprit, a single line had been changed from a timeout to an interval. Eventually we got everyone to reload the page and the problem was mitigated.

The Flaws

Gatling Gun Requests

On the homepage of version 2.0 of the application the user can toggle on and off counts. When turned on the count will query the database for data in a specific state. A user can have between 0 and 12 counts enabled on the home page. The counts should refresh every 30 seconds. We recently discovered a bug where it would refresh once and then stop. The problem was the code was using a JavaScript timeout instead of a JavaScript interval. When we found the bug we correctly changed it us an interval.

However, the problem lay in how the counts were retrieved from the REST service. The JavaScript code didn't queue up the requests when getting the counts. Instead it would run a for loop through the counts and fire off a request for each count to retrieve. If the user had 12 counts enabled, 12 requests would be sent up to the REST service within a few milliseconds of each other (hence the 1600% increase in throughput).

Expensive Queries and Deadlocks

Each count queries the same three fields, just with different parameter values. There was no index on these fields. Each query cost 500 milliseconds of CPU and had 94,000 reads. In addition to the expensive select statements, the table was being written to multiple times per minute. Remember, the code was sending up to 12 requests per user every 30 seconds. The database couldn't keep up and started throwing deadlock errors. Eventually it used all its resources just to handle the expensive queries.

Unrealistic Load Test

All of the load tests we performed we done in a pre-production environment. The pre-production environment had the same number of records being queried, however the data was not being updated at nearly the same rate as production. The load tests only performed select statements because that is all the homepage supports. The UI, at this time, doesn't support the creation, updating or deletion of data in the same table the home page selects form.

The Fixes

Hindsight being what it is, all of these should have been done a lot sooner. Some of them don't take long to do, while others will take a while (namely fixing up the code).

Indexes

The production DBAs added an index on the table to those three fields. The query cost dropped like a rock to 0 milliseconds of CPU time and 4 reads.

Realistic Load Tests

A job was created that will randomly insert data into the table the homepage selects from. When I ran a load test in the pre-production environment with a higher throughput than what was seen Thursday monring with that job enabled and the index on the table I was able to see deadlock errors getting thrown. This will allow us to establish a baseline as the code is fixed. Without any code changes the index allowed made it so we push through 80% of the max throughput seen Thursday morning without any deadlock errors. As the code is changed more load tests will run to see if it helps the problem or makes it worse.

Code Fixes

A lot of ideas have been thrown around as to how to get the code to more efficenly query the database. Nothing has been solidified yet, but I can say it will most likely involve caching and bundling of requests. Instead of 12 unique requests per user, it will be 1 request asking for 12 counts.

Successes

The day was not without its successes. Don't get me wrong, bringing down a production database is probably one of the worst things code can do. But there were a number of successes throughout the day.

Continuous Delivery Practices - The Importance of Artifacts

In my previous posts I have talked about some practices to put into place for continuous delivery with the database. Thankfully all of those practices have been in place for some time with my team. Every time code is checked in a build kicks off. That build will use Red Gate tools to create a deployment script and rollback script. In addition it will zip up all the SSIS packages as well as create MSDeploy packages for the websites. The zip file of the SSIS packages, the MSDeploy Packages, the deployment script and the rollback script are placed in the artifact repository in the build server. When it came time to rollback the database I just had to pull those artifacts out of the previous build that went to production.

In all fairness, the web admins had previous versions of the website stored as well, so they beat me to it. If they couldn't find the previous version I would have used the one found in the artifact repository.

New Relic

New Relic was a huge help Thursday morning. It alerted us to the problem right away and it helped track down the root cause of the problem. With its reporting we could see the huge throughput increase compare to the previous day and same day the previous week. We could also see 99% of all requests were hitting the same endpoint. That isolated the problem down much quicker had that reporting not been in place.

Support Staff

It would not have been possible to fix the problem as quickly as it was were it not for the support staff. The production DBAs fixing the database, the Web Admins deploying the code and Tech Support fielding all the calls for us and keeping the rest of the company up to date. Each of them provided an invaluable service and their professionalism in addressing the problem is second to none.