There is this magic sense of confidence you get when a status dashboard is all green. Suddenly, a highly complex system with dozens or hundred of modules, is saying ‘well, so far so good’. Visually alive in front of you, showing how many users are connected, how many transactions are being generated and how many resources are used to serve all that.
Otherwise, the only way you’ll find out when something is wrong, is that angry customer’s call notifying that your system is down, and you didn’t know. Not anyone’s idea of a happy day.
Consider the following scenario: your bank’s webpage has been down for 5 hours. You and all the other rich clients cannot access your money to do business, on the last working day of the year.
How many people at the bank do you think are aware of the situation? How many different areas?
IT services are no longer IT related exclusively. Downtime affects the performance metrics of many across the bank, many managers are being affected by this IT-nerd-issue that is still unsolved.
Knowing this, it sounds very reasonable that these managers and their staff should be very well informed about the health of the services that affect their relation with the clients. The IT crowd is not likely to receive final customer’s calls, is it?
System health dashboards display stat summaries of the components that make a system work, allowing for a very transparent and collaborative relation between all areas. Business people should have metrics of the systems involved with their work.
User front ends – avoid that angry call
Whether it is a stock exchange transactional platform or an e-commerce site, there is a tight relation between all business metrics and server health. The customer support phone starts ringing because users cannot login randomly for no particular reason.
Do you measure the «Successful user login» metric?
If so, it’s a very good start. Then, «Successful user login» coming up. Let’s add the average from a week ago, and from a month ago; to see some trends, shall we?
What about setting an alert when the average success rate for login drops below 150/min normal? Then an alarm then drops back to 75/min?
Simple rules like this, allows the team behind the product (operations or developers) to react fast. To start investigating sooner and resume normal operations before major portions of the user base experience the problem. Maybe it is just someone running a non-programmed backup in the middle of the day and using almost all the disk I/O in the database server. Happens in the best families 🙂
Backends – Not losing sells
Here in Chile, we have a very peculiar online payment issue: there is only one credit card operator, which is owned by the local banks. A natural monopoly.
Like every single player market, QoS is not very good. You would like to know right away if your online store is losing sales because of a third party service failure.
Payment rejection is a business critic metric in e-commerce. You’ll like to know with rather precision why your users are clicking the Pay button but your orders are not getting fulfilled.
That repetitive failure
In many production environments, there is this «special» service that crashes in a timely manner for no particular reason. There are other priorities, so the operator on call restarts the service and off you go!
Correlating downtime events with server metrics is a very powerful tool for narrowing down the cause of a constantly failing service.
Even if we are talking about a memory leak bug like this one – which kills your app every 48 hours – it is a very good option to automatize the restart until the bug is found. Nobody likes to connect to that VPN on a Sunday family BBQ.
Extending the traditional IT monitoring with full-stack metrics gives the organization a powerful information tool. It allows not only to discover a failure’s root causes, but to generate a more transparent and collaborative environment, giving the business teams information about the systems that affect their relation with the clients.
Datadog helps teams across different areas in an organization to have insight and control over the IT products and services that they are involved. Metrics & monitoring over SO + databases + services + apps + business; all in one TV screen.
Bithaus Software – Datadog partner from Chile.