A few weeks ago a key component the loan origination system my team is responsible for at Farm Credit Services of America had a problem. The loan origination system's main REST API calls this component when a user performs a certain action. Concerning overall usage, this key component is used maybe on 0.5% of all the REST calls back to this API. It is a key component, though if it is not working then loans cannot be booked and monies cannot be handed out. When the component went down the loan origination system dutifully started logging the error. The logging solution then sent out error emails. I saw the emails come in at a very high and unexpected volume for 7:30 in the morning. I was able to raise the issue to the right people and get it resolved as soon as possible.
Looking back on it, what would have happened if I was sick? Or was late that day? Or on vacation? It is rather easy to have a logging solution in place send out a notification when an error occurs. Emails have a nasty habit of being ignored. Often, the volume of noise is too high people just throw a filter on their inbox, and all error emails are routed to a subfolder or the trash. Issues then slip through the cracks and minor problems become serious problems, and serious problems become outages. How can we get smarter about alerting the right people when important issues occur?
The good news we already had the tool in place to handle that, New Relic. All we needed was a little collaboration between a web operations person and a lead developer to get this problem solved. In other words, DevOps.
New Relic Normal Error Reporting
New Relic is a fantastic tooling for gathering data about a web application. It has all kinds of charts and graphs and functionality to help monitor applications. One of the key metrics New Relic automatically captures is errors. New Relic treats any 500 or higher response code from an API as an error. A lot of times when a user makes a request and then closes the browser. The loan origination system has a series of counters to show to the user that auto refreshes every 5 minutes. A user's browser could request an updated count, but a millisecond after that the user could elect to close the browser. This will cause an error to be thrown by the API. The API wanted to send that data back down to the user. As stated before, this key component only accounted for 0.5% of all traffic to the API. The error percentage on the site was high, almost to 1% it wasn't abnormally high to anyone who is just glancing at the dashboard.
When the web operations person, Sean Speck, and I started working on identifying this issue, he said the same thing to me. It looks higher than normal, but not abnormally high. Almost all the errors were coming from one end point; the end point used to calculate the customer's risk. Checking out the error rate on that end point confirmed our suspicions, it was well north of 25% during the timeframe of the problem.
Knowing that he showed me how to mark a specific transaction as a key transaction through the New Relic Web UI. Just go to the transactions tab in, click on the end point, and on the right pain there is a link that says "Track as Key Transaction."
Once that is clicked you can give it a friendly name
Creating an Alert
With the key transaction in place, we can now create an alert. This alert will only fire when certain conditions are met for the key transaction. For the initial alert, Sean and I decided on to send notifications when the error rate reaches a 10%. Stacking multiple conditions on top of one another is possible. There could be one condition for 10% of end point A and another for 5% on end point B.
It is very easy to create a new alert. This article will go through the steps necessary.
Click on the alerts button at the top of the screen.
Then click on Alert Policies
Then click on new alert. From there select APM -> Key Transaction metrics
After that select the key transaction
Then it is a matter of defining the thresholds. There are options to choose from, if error count reaches a certain point, error percentage, ApDex score to name a few.
By default, it will send notifications to the email of the person creating the transaction.
There are some notification possibilities possible. Email, Slack, Pagerduty. The Webhook option is awesome because it allows for additional types of notifications without relying on New Relic to provide integration with a particular company or technology.
Critical vs. Warning Alerts
In talking it over with Sean, we decided it made more sense for the loan origination system to have two levels of alerts, Critical and Warning. Critical alerts will notify both the application development team and the web operations, team. Something has gone belly up and needs to be fixed right away. Warnings will only be sent to the application development team. The alert is something the team needs to look at soon before a minor problem becomes a major outage. An example is there are a couple of endpoints exposed to other systems. If the response time goes above 200 ms for over five minutes something strange is going on that we need to look at and fix in an upcoming release before the response time goes over a second. An alert like this is not something Web Operations needs to know about because they can't fix it.
This type of collaboration, in a nutshell, is what DevOps is all about. Operations have a fantastic tool available to them, but they lack the application knowledge to get the full functionality out of it. The developers lack knowledge about the tool but have the necessary application knowledge. When Sean Speck and I worked together, we were able to bring something different to the table and come up with a great first step. With the additional knowledge, Sean provided me I was able to create additional alerts to help my team maintain the application. And Sean now knows a little more about our application and can help support us going forward in the future. And, by making use of New Relic in this way it allows us to be proactive instead of reactive.