I’m really passionate about applications crashing! JK, I mostly don’t care and find it super dull. But it is a thing that matters to customers (well, mostly that they don’t but same difference).
I had my first run-in with application monitoring in 2000 when I spent my nights watching cable TV and browsing the entire internet under the guise of actively monitoring hospital systems, shipping systems and other systems important enough that companies paid enough money to my employer that they paid me money to acquire really up-to-date knowledge of web-comics. We used some tool which had some name that I’ve forgot, but it was really shit. It would turn from green to yellow or red when we needed to do something, except sometimes we were just to ignore yellow or red because the system sucked.
Now, almost 20 years later the situation is largely unchanged. Application monitoring still largely sucks. I’ve been surveying the market several times in the hope that there would be this elusive non-sucky tool out there.
One class of tools are the enterprise level tools. One such tool that I’m reasonable familiar with is ManageEngine’s Applications Manager. It is what Hitler would be if hitler were a Java application for monitoring applications. I believe I even used that description in mails to management. It has monitors for a lot of standard components, and if that is what you want to monitor, it can give you the impression that you do. It by default displays a lot of metrics and charts, so you can feel in control without having any idea what you are monitoring and why. Every time I asked for something that wasn’t just monitoring free disk space on a Linux server, the answer was almost always that that would not be possible. I later found out that you could do it using custom scripting, but it did not have tools for very custom and sane tasks like calling a SOAP or REST service periodically and monitoring the outputs. On top of that, it was buggy as hell. It was half a decade or more behind on Java support so if you had something fancy like non-broken crypto or secure certificates, it would just shit itself. I had to set up monitoring of a database beyond just “does it live” to check that an application was not just up but also processing data, and found to my delight that the tool for doing this was broken (a single line entry field for a complex query, the tool forgetting any alarm I had set up if I changed the query, etc). The only redeeming feature was that the license was so exorbitantly expensive that we all shared a test account with administration privileges.tsplunk
Then there’s the up-and-coming technologies that claim to be able to monitoring things for you. If you’ve done anything computey in the past decade, you have heard of the ELK stack (Elastisearch, Logstash, Kibana). It is a great way for storing all your logs in a database so slow that you need a small car’s worth of an AWS subscription to even run it and even so it will take hours if you have to restart it. I you need to look something up from the logs, your best bet is to explain to the customer that this is simply impossible because of solar flares or whatever. Amazon has their own version of this called CloudTrail which is kind-of ok. Grafana also has tool which is in some level of beta stage called Loki. It is also ok, but only by comparison to the ELK stack. I have reasonably high hopes for Loki, though, I like what Grafana does.
Log databases, while related to what we are interested in, does not really provide monitoring. Part of monitoring is collecting information about performance of systems. The really old-fashioned way to do that is to gather this information from logs. CloudTrail has reasonable tools for extracting data from logs if you have the patience to setting up data aggregation. The ELK stack can do some parsing of data, but good luck coercing Kibana to show you a chart that makes any notion of sense. You can use an ELG stack (is that a thing?) where we replace Kibana by Grafana to get meaningful data. Grafana is an excellent dashboarding application, that makes it easy to set up dashboards and gain insight into your logged data. It is like a poor man’s version of Tableau or PowerBI but more tailored to setting up pre-cooked dashboards instead of interactive data analysis. I’ve used Grafana together with some custom scripts for filtering logs and storing them in Elasticsearch in the past to get very detailed insight into the operations of two customers’ entire environments with a bit of custom logging of all incoming and outgoing calls to our ESB component.
The hot new way of doing performance logging five years ago is using API management. The idea is that APIs need a lot of common functionality and there is no need to do that for each component. For example, most APIs need authentication and access control (using certificates, keys or other things), rate limiting and DoS protection, load balancing and HA services, tracking for payment, and performance monitoring. Especially with the advent of micro-services doing away with a centralized single-point-of-failure ESB, it makes sense to introduce a new service you use to route all your APIs thru, optionally adapting them as needed. Yes, an API manager is just a slightly different flavor of an ESB. One of the big problems IMO is that there never was a good open source API manager, so people never go used to playing with it for toy projects. Sure, there are products like tyk which is free, but has a learning curve close to a vertical line and adds little for the hobby programmer. Other free alternatives like WSO2 API Gateway suffer from being unstable unmanageable pieces of shit or getting shitcanned by their owners, like RedHat’s very promising Apiman (which I had really high hopes for), or locking all the neat features that could draw people in behind payment like RedHat’s 3scale (the cloud-only solution that replaced the much nicer Apiman). Amazon also has an API gateway, which is probably fine, but I’ve not seen anybody really stanning for it. So, API management was a technology that never really caught on.
The modern replacement for the API manager is service meshes in cloud native virtualization platforms like Kubernetes. The two big ones today are Istio and Knative. They offer pretty much the same functionality as API managers but in the cloud. They also offer a couple of extra features like blue/green deployments, where you roll out new functionality to part of your users (e.g., only disabling the Instagram like button in Canada or at random breaking some fundamental feature in any Google service for 20% of users). Istio is pretty much just a service mesh while Knative is a serverless implementation with most of the features of a service mesh. The market has consolidated somewhat, so we are pretty much down to just these two options, but it is not clear which one everybody will standardise on just yet at the time of writing, but I would probably look into Istio for new deployments these days.
Service meshes typically gather performance data (and other data) and store it in a database, allowing inspection either directly from the service mesh or from an external dashboarding application. In the open source world, the de-facto standards are to store data in Prometheus and visualise them using Grafana. Grafana I’ve already praised. Prometheus is a bit of a strange beast; it pretty much started as a system monitoring tool where applications could store data about their operation. Contrary to the enterprise monitoring system where an application would run on one server and actively collect data from individual application using various agents, Prometheus instead expects applications to push data to the monitoring system. That means Prometheus doesn’t need to worry about agents for all kinds of systems or whether their SQL editor works, but also means that it less usable for application monitoring; if an application is down it just stops sending data to Prometheus, so if you want any alert about this, you need a third application adding complexity. Prometheus stores all of the data in a time series database, and these days is really more known as a time series database than a monitoring system, and that is why service meshes like Istio make use of it. Prometheus also has a Kubernetes operator which can scrape data from applications and put it into Prometheus, making it the de-facto standard for time series storage in a modern Kubernetes landscape, pretty much superceeding previous contenders like InfluxDB, Graphite, and OpenTSDB from the time series databases domain and tools like Hawkular from the monitoring domain.
The standardisation in the cloud domain has also brought other niceties; Kubernetes supports automatic liveness testing of applications, and can restart applications automatically if they do not respond or do not respond in a timely manner. With hooks like especially Spring Boot’s actuators, this liveness probe can be set up with minimal developer action, even checking relevant external systems your application depends on (such as connectivity to databases or LDAP servers).
Even with all of the much-needed consolidation of application monitoring in the cloud space, there is still a gap left over: checking operation from a user PoV, even in the case of the DC disconnect. It doesn’t help anything that an application can connect to its database and is sending performance statistics to its Prometheus instance if the entire datacenter is not reachable from the internet, and no amount of automatic restart is going to fix or detect this. There is still need for enterprise monitoring the old-fashioned way. Luckily, there are open source alternatives in this space as well, many of them very mature such as OpenNMS, Nagios, and Zabbix. These are very much like the enterprise systems in that they can display simple charts relatively easily and are an absolute nightmare for anything more complex. They trade exorbitant licensing fees for that open source feel.
It has been years (15 or so) since I last played around with Nagios and OpenNMS, but I presume nothing has really changed since then 🙂 I’ve recently been playing with Zabbix. It is every bit as shit and as good as Applications Manager. It can monitor basic services, though documentation is really dodgy on certain specifics. For example, Zabbix requires an agent running on individual systems for a lot of things, and the configuration of this is a bit of a hassle. Even worse, for Java applications, it requires a special Java agent running on the system as well, and you cannot connect to more than one such agent from each server. Unless, you introduce a special Zabbix proxy, which can talk to at most one Java agent as well, so now we have 3 monitoring processes running: the Zabbix agent, the Zabbix proxy, and the Zabbix Java agent. It is in principle to just run a single proxy and Java agent in each customer environment if they can have access to all systems, but that requires JMX be open inside the network (and in regular Java style, setting up encryption is a nightmare). Don’t forget, we need a proxy for each Java agent, so we cannot just run a single proxy to aggregate data from multiple Java agents. Did I mention that the proxy needs a database as well? Also, the proper way to connect agent, Java agent and proxy together with the server is barely documented. Open Source!
If you get all of the Zabbix processes running on your servers, the next issue is that you cannot monitor multiple Java applications from that server easily. In Zabbix, everything is tied to hosts, and you can in principle add multiple agents interfaces to each host, so with your two agent process and one proxy process, you can in principle add multiple JMX agent interfaces, one for each application. However, in Zabbix, everything is tied to templates, and you cannot use the same template twice for a host. Even if you copy a template, Zabbix will not allow you to use the two different but identical templates on the same host, because it refuses to monitor the same thing twice. And templates do not tie to interfaces but to hosts. So while you can add multiple interfaces, you cannot use them without manually copying the JMX template and manually editing it using what is known in Zabbix communities as the “space workaround.” Yes, this is a known issue that has been around since fucking 2013 and is an issue encountered sufficiently to the point where hacking around it by tricking the application into accepting identical templates by adding spaces so it cannot detect that you are checking the same thing more than once has gotten a name in the community. Open Source!
Zabbix also requires you create hosts for fucking anything. Want to monitor a web endpoint? Create a host. Sure, you can create a dummy host, but you need to have at least one interface, so you need to uselessly monitor a dummy host for each web-endpoint (or group of web-endpoints). Monitoring a web-endpoint is also completely different from monitoring anything else for some reason. You can also not just plop in a URL and have it monitored. No, you need to create a web-scenario, add a step to the scenario, and then add a trigger for your web scenario to get any alerts. For the most common feature, I need to create and configure three artifacts (not counting the garbage host and dummy interface you have to set up). I cannot make hosts correspond to physical hosts and applications cannot live without hosts in Zabbix, so the abstractions just no longer correspond to how we do computing in the modern world. Open Source!
Zabbix also doesn’t integrate with enterprise sign on. If you use a LDAP server you can connect Zabbix to it, but anything more modern is a no go. It has a weird web authentication mode that allows you to put an authentication proxy in front of it assuming you have already created the user in Zabbix. So, to authenticate against an OAuth provider like everybody does nowadays because it is no longer the fricking 90s, you need a reverse proxy for https off-loading, an OIDC connector for OpenId connect offloading (because we use nginx we need an external application for this or this wonderful solution from RedHat). Somebody even made a merge request to support OIDC but it got discarded in place of this “elegant” solutions. Now we have multiple processes running to make Zabbix slightly less open than Goatse, and just have to accept that login is dodgy as all hell and routinely fails. Whether that is due to Safari, the Keycloak Gatekeeper, or Zabbix I frankly don’t know and have lost my will to
live investigate further. Open Source!
As much as I’ve harped on Zabbix, it is not too bad. I can make it mostly do what I want. It is extensible enough that I can add in support for sending text messages, that I can add functionality for checking certificate validity, and similar functionality you would expect from a base product but doesn’t exist because Open Source!
At the end of the day, what I really want is an integrated solution that is good. I want my logs stored and actually searchable. I want to be able to extract metrics from logs and to gather metrics using service meshing, API management or whatever the new name for an ESB is. I want proper active uptime monitoring in addition to my in-cluster liveness probe. It seems IU am not the only one wanting that; everybody is adding these features to their stacks. CloudTrail is close to being there if you are on Amazon. I have not followed ELK too closely as I hate Kibana and dislike Elasticsearch. I am very intrigued by Grafana’s endeavours into log collection, because they have quite decent dashboarding. Grafana is so intimately tied to Prometheus that I have a hard time seeing a proper unified solution from them. A lot of the other time series databases claim to support unified enterprise monitoring as well, but that is mostly lies.
But here, almost 20 years after my first dip into application monitoring the tools still suck. I guess in part because it is a problem that is easy in theory but hard in practise. You need to support “all” applications out of the box, and you need to have simple configuration for most common weird things, and you need to support extension to do anything. At the same time, you need to do this in a way that is simple to set up; I will not configure all the standard Java or Linux or MySQL or IMAP or whatever metrics, I want to easily add a web-check, both for user-facing pages and web-services and I don’t want to do any of this using abstractions that are too general (user facing, it is fine if there’s a general abstraction behind the scenes) so I have to configure 5 artifacts to check a web-service. Now, I’ve got Zabbix kind-of working so I won’t bother with another tool unless it can truly handle both logging, performance monitoring and health monitoring in a single uniform interface. Are there any good stacks out there?
Time person of the year 2006, Nobel Peace Prize winner 2012.